# Part 4: Spark Production Issues - Streaming

**Objective**: Identify, diagnose, and fix the most critical Structured Streaming production issues.

**Duration**: 20 minutes

**What You'll Learn**:
1. Checkpoint management for fault tolerance
2. Watermarks to prevent unbounded state growth
3. Choosing the right output mode
4. Handling small files in streaming sinks
5. Idempotent writes with foreachBatch


In [None]:
# Setup: Import required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time
import os

# Configure for streaming demos
spark.conf.set("spark.sql.shuffle.partitions", 8)  # Lower for faster demos

print("‚úÖ Environment ready for streaming demos!")

## Create Simulated Event Stream

**Setup**: We'll use Spark's `rate` source to simulate real-time events (like IoT sensors, clickstreams, or transactions).


In [None]:
# Create a simulated event stream (like IoT sensor data or user clicks)
# Rate source generates events with timestamp and sequential value

events_stream = spark.readStream \
    .format("rate") \
    .option("rowsPerSecond", 100) \
    .option("rampUpTime", "0s") \
    .load() \
    .withColumn("user_id", (col("value") % 1000).cast("string")) \
    .withColumn("event_type", 
        when(rand() > 0.7, "purchase")
        .when(rand() > 0.4, "add_to_cart")
        .otherwise("page_view")
    ) \
    .withColumn("amount", (rand() * 500).cast("double")) \
    .select(
        col("timestamp").alias("event_time"),
        col("user_id"),
        col("event_type"),
        col("amount")
    )

print("‚úÖ Simulated event stream created!")
print("üìä Schema:")
events_stream.printSchema()


## Issue #1: Missing Checkpoint Location

**The Problem**: Without checkpointing, stream can't recover from failures and may duplicate or lose data.

**Symptoms**:
- Stream fails on restart
- Duplicate processing after crashes
- No fault tolerance

**Solution**: ALWAYS set checkpointLocation for production streams.


In [None]:
### ‚ùå BAD: No checkpoint (stream can't recover!)

# This will work initially but has no fault tolerance
# If it crashes, you lose progress and may duplicate data

try:
    bad_query = events_stream \
        .groupBy("event_type") \
        .count() \
        .writeStream \
        .outputMode("complete") \
        .format("memory") \
        .queryName("bad_no_checkpoint") \
        .start()
    
    time.sleep(3)  # Let it run briefly
    bad_query.stop()
    
    print("‚ö†Ô∏è Stream ran but has NO fault tolerance!")
    print("üî¥ If this crashes, progress is lost")
    print("üî¥ On restart, may duplicate or skip data")
    
except Exception as e:
    print(f"Error: {e}")


In [None]:
### ‚úÖ GOOD: With checkpoint (can recover from failures!)

# Clean up checkpoint directory if exists (for demo)
checkpoint_path = "/tmp/streaming_checkpoint_demo"

# dbutils.fs.rm(checkpoint_path, True)  # Uncomment in Databricks

good_query = events_stream \
    .groupBy("event_type") \
    .count() \
    .writeStream \
    .outputMode("complete") \
    .format("memory") \
    .queryName("good_with_checkpoint") \
    .option("checkpointLocation", checkpoint_path) \
    .start()

time.sleep(3)
print("‚úÖ Stream running with checkpoint!")
print(f"üìÅ Checkpoint location: {checkpoint_path}")
print("\nüíæ What's checkpointed:")
print("   ‚Ä¢ Source offsets (progress tracking)")
print("   ‚Ä¢ State store data (aggregations, joins)")
print("   ‚Ä¢ Metadata (schema, config)")
print("\nüîÑ On restart: Resumes from last committed offset")

good_query.stop()


## Issue #2: No Watermarks = Unbounded State Growth

**The Problem**: Without watermarks, state for windowed/stateful operations grows forever.

**Symptoms**:
- Memory pressure and OOM
- Increasingly slow query performance
- State store keeps growing

**Solution**: Define watermarks for event-time operations.


In [None]:
### ‚ùå BAD: Windowed aggregation without watermark (unbounded state!)

from pyspark.sql.functions import window

# Without watermark, Spark keeps ALL windows in memory forever
bad_windowed = events_stream \
    .groupBy(
        window(col("event_time"), "1 minute"),  # NO WATERMARK!
        col("event_type")
    ).agg(
        count("*").alias("event_count"),
        sum("amount").alias("total_amount")
    )

# This query would accumulate state forever
print("‚ö†Ô∏è This aggregation has NO watermark!")
print("üî¥ State will grow unbounded ‚Üí eventual OOM")
print("üî¥ Every window from beginning of time is kept in memory")
print("\nüìà After 30 days: ~43,000 windows per event_type in memory!")


In [None]:
### ‚úÖ GOOD: With watermark (bounded state, can evict old windows)

# Watermark allows Spark to drop state for windows older than threshold
good_windowed = events_stream \
    .withWatermark("event_time", "10 minutes") \
    .groupBy(
        window(col("event_time"), "1 minute"),
        col("event_type")
    ).agg(
        count("*").alias("event_count"),
        sum("amount").alias("total_amount")
    )

print("‚úÖ This aggregation has a 10-minute watermark!")
print("üí° Spark can drop state for windows > 10 mins old")
print("üìâ Bounded memory usage")
print("\nüéØ How watermark works:")
print("   1. Track max event_time seen so far")
print("   2. Watermark = max_event_time - threshold (10 min)")
print("   3. Drop all windows that end before watermark")
print("   4. Late data > 10 min is discarded")

# Start the query briefly to show it works
checkpoint_wm = "/tmp/checkpoint_with_watermark"
query_wm = good_windowed \
    .writeStream \
    .outputMode("append") \
    .format("memory") \
    .queryName("with_watermark") \
    .option("checkpointLocation", checkpoint_wm) \
    .start()

time.sleep(3)
print("\n‚úÖ Query running with bounded state!")
query_wm.stop()


## Issue #3: Wrong Output Mode

**The Problem**: Choosing wrong output mode causes empty results or massive output.

**Symptoms**:
- No data in sink
- Exploding output size
- Query fails with unsupported operation

**Solution**: Match output mode to your operation type.


In [None]:
### Understanding Output Modes

print("üìù Output Mode Guide:\n")

print("1Ô∏è‚É£ APPEND (default)")
print("   ‚úì Use for: stateless operations, watermarked aggregations")
print("   ‚Ä¢ Only new rows since last trigger")
print("   ‚Ä¢ Lowest output volume")
print("   ‚Ä¢ Example: filtering, simple transformations\n")

print("2Ô∏è‚É£ UPDATE")
print("   ‚úì Use for: aggregations, stateful operations")
print("   ‚Ä¢ Only changed rows since last trigger")
print("   ‚Ä¢ Medium output volume")
print("   ‚Ä¢ Example: aggregations, joins with updates\n")

print("3Ô∏è‚É£ COMPLETE")
print("   ‚ö†Ô∏è Use sparingly: full table output every trigger")
print("   ‚Ä¢ ALL rows every time")
print("   ‚Ä¢ Highest output volume (can explode)")
print("   ‚Ä¢ Example: dashboards needing full state, small result sets")


In [None]:
### Example: Choosing the right mode

# Stateless transformation ‚Üí APPEND
simple_filter = events_stream \
    .filter(col("event_type") == "purchase") \
    .select("event_time", "user_id", "amount")

print("‚úÖ Simple filter ‚Üí Use APPEND mode")

# Aggregation without watermark ‚Üí UPDATE or COMPLETE
running_totals = events_stream \
    .groupBy("user_id") \
    .agg(
        count("*").alias("total_events"),
        sum("amount").alias("total_spent")
    )

print("‚úÖ Aggregation (no watermark) ‚Üí Use UPDATE or COMPLETE")

# Aggregation with watermark ‚Üí APPEND
windowed_with_wm = events_stream \
    .withWatermark("event_time", "10 minutes") \
    .groupBy(
        window(col("event_time"), "5 minutes"),
        col("event_type")
    ).count()

print("‚úÖ Aggregation (with watermark) ‚Üí Use APPEND (finalized windows only)")


In [None]:
## Issue #4: Small Files Problem in Streaming

**The Problem**: Streaming writes create many tiny files (one per trigger per partition).

**Symptoms**:
- Thousands of small files in sink
- Slow downstream reads
- High metadata overhead
- S3/cloud storage throttling

**Solution**: Coalesce before writing OR use adaptive sink features.


### ‚ùå BAD: Default behavior (many tiny files)

# With 8 shuffle partitions and triggers every second:
# ‚Üí 8 files per second = 28,800 files per hour!

print("‚ö†Ô∏è Default streaming sink behavior:")
print(f"   ‚Ä¢ Shuffle partitions: {spark.conf.get('spark.sql.shuffle.partitions')}")
print("   ‚Ä¢ Trigger interval: every batch (e.g., 1 second)")
print(f"   ‚Ä¢ Files per trigger: {spark.conf.get('spark.sql.shuffle.partitions')} (one per partition)")
print("\nüî¥ After 1 hour: ~28,800 tiny files!")
print("üî¥ Slow reads, metadata overhead, cloud storage issues")


### ‚úÖ GOOD: Coalesce before write (fewer, larger files)

# Solution 1: Coalesce to fewer partitions
coalesced_stream = events_stream \
    .filter(col("event_type") == "purchase") \
    .coalesce(2)  # Reduce to 2 files per trigger

print("‚úÖ Solution 1: Coalesce")
print("   ‚Ä¢ Reduces partitions before write")
print("   ‚Ä¢ Trade-off: Less parallelism in writing")
print(f"   ‚Ä¢ Files per trigger: 2")
print("   ‚Ä¢ After 1 hour: ~7,200 files (vs 28,800)\n")

# Solution 2: Increase trigger interval
print("‚úÖ Solution 2: Longer trigger interval")
print("   ‚Ä¢ .trigger(processingTime='10 seconds')")
print("   ‚Ä¢ Fewer triggers = fewer file batches")
print("   ‚Ä¢ After 1 hour: ~2,880 files (with 8 partitions)\n")

# Solution 3: Use Delta Lake with OPTIMIZE
print("‚úÖ Solution 3: Delta Lake + OPTIMIZE")
print("   ‚Ä¢ Write to Delta Lake normally")
print("   ‚Ä¢ Run periodic OPTIMIZE command to compact")
print("   ‚Ä¢ Best of both worlds: fast writes + optimized reads")


## Issue #5: Non-Idempotent Sinks (Duplicates on Retry)

**The Problem**: Spark streaming retries on failure, causing duplicates if sink isn't idempotent.

**Symptoms**:
- Duplicate records after restarts
- Incorrect aggregates downstream
- Double-charging customers

**Solution**: Use foreachBatch with upsert/merge logic.


In [None]:
### ‚ùå BAD: Direct append (not idempotent, can duplicate)

print("‚ö†Ô∏è Problem with simple append:")
print("""
# This looks innocent but can duplicate on retry
query = stream.writeStream \\
    .format("parquet") \\
    .outputMode("append") \\
    .option("checkpointLocation", "/chk") \\
    .start("/output")

üî¥ If Spark retries a batch, same data written twice!
üî¥ No deduplication logic
üî¥ Downstream sees duplicates
""")


In [None]:
### ‚úÖ GOOD: foreachBatch with idempotent logic (Delta MERGE)

print("‚úÖ Solution: foreachBatch with upsert/merge\n")

# Example idempotent sink function
def write_to_delta_idempotent(batch_df, batch_id):
    """
    Idempotent write using Delta MERGE (upsert)
    If record exists (by key), update it; else insert
    """
    print(f"Processing batch {batch_id} with {batch_df.count()} records")
    
    # This is pseudocode - actual Delta merge example:
    # from delta.tables import DeltaTable
    # target = DeltaTable.forPath(spark, "/delta/purchases")
    # (target.alias("t")
    #   .merge(batch_df.alias("s"), "t.user_id = s.user_id AND t.event_time = s.event_time")
    #   .whenMatchedUpdateAll()  # If exists, update (idempotent)
    #   .whenNotMatchedInsertAll()  # If new, insert
    #   .execute())
    
    # For demo, just show the pattern
    batch_df.write.mode("append").format("noop").save()

print("""
def write_idempotent(batch_df, batch_id):
    # Delta MERGE ensures idempotency
    target.merge(batch_df, "t.id = s.id")
        .whenMatchedUpdateAll()
        .whenNotMatchedInsertAll()
        .execute()

query = stream.writeStream \\
    .foreachBatch(write_idempotent) \\
    .option("checkpointLocation", "/chk") \\
    .start()

‚úÖ Same batch processed twice ‚Üí same result (idempotent)
‚úÖ No duplicates
‚úÖ Exactly-once semantics
""")


## üéØ Production Streaming Template

**Copy-paste starter for production streaming jobs:**


In [None]:
production_template = """
# Production Structured Streaming Template
from pyspark.sql.functions import *

# 1. Read from reliable source (Kafka, Kinesis, etc.)
stream = spark.readStream \\
    .format("kafka") \\
    .option("kafka.bootstrap.servers", "broker:9092") \\
    .option("subscribe", "events") \\
    .option("startingOffsets", "latest") \\
    .load()

# 2. Parse and transform
parsed = stream \\
    .select(from_json(col("value").cast("string"), schema).alias("data")) \\
    .select("data.*")

# 3. Add watermark for stateful operations
with_watermark = parsed \\
    .withWatermark("event_time", "10 minutes")

# 4. Business logic (aggregations, joins, etc.)
result = with_watermark \\
    .groupBy(
        window(col("event_time"), "5 minutes"),
        col("user_id")
    ).agg(
        count("*").alias("event_count"),
        sum("amount").alias("total_amount")
    )

# 5. Write with idempotency
def write_batch(batch_df, batch_id):
    # Use Delta MERGE or database upsert
    batch_df.write \\
        .format("delta") \\
        .mode("append") \\
        .save("/output/path")

# 6. Start query with all safeguards
query = result.writeStream \\
    .foreachBatch(write_batch) \\
    .outputMode("append") \\
    .option("checkpointLocation", "/checkpoint/path") \\
    .trigger(processingTime="10 seconds") \\
    .start()

# 7. Monitor
query.awaitTermination()
"""

print(production_template)


## üéØ Part 4 Summary: Streaming Production Checklist

### Top 5 Streaming Issues & Fixes

| Issue | Symptom | Fix | Critical? |
|-------|---------|-----|-----------|
| **No Checkpoint** | No fault tolerance, data loss | Set `checkpointLocation` | ‚ö†Ô∏è MUST HAVE |
| **No Watermark** | Unbounded state growth ‚Üí OOM | `.withWatermark()` on event time | ‚ö†Ô∏è MUST HAVE |
| **Wrong Output Mode** | Empty results or exploding output | Match mode to operation | ‚ö†Ô∏è MUST HAVE |
| **Small Files** | Thousands of tiny files | Coalesce + longer triggers + OPTIMIZE | üîß SHOULD HAVE |
| **Not Idempotent** | Duplicates on retry | Use foreachBatch + MERGE | ‚ö†Ô∏è MUST HAVE |

### Pre-Production Checklist

**MUST HAVE** (will fail without these):
- ‚úÖ Checkpoint location configured
- ‚úÖ Watermark defined for stateful operations
- ‚úÖ Correct output mode for operation type
- ‚úÖ Idempotent sink (foreachBatch + merge/upsert)
- ‚úÖ Schema explicitly defined (no inference)

**SHOULD HAVE** (performance & operations):
- ‚úÖ Trigger interval tuned (not too frequent)
- ‚úÖ Coalesce partitions before write
- ‚úÖ Monitoring & alerting on lag/throughput
- ‚úÖ Plan for schema evolution
- ‚úÖ Periodic OPTIMIZE for Delta tables

**Monitoring Metrics**:
- Input rate (rows/sec)
- Processing time per batch
- End-to-end latency
- State store memory
- Consumer lag (Kafka)

### Quick Debugging Commands

```python
# Check active streams
spark.streams.active

# Get stream status
query.status

# Check last progress
query.lastProgress

# View recent errors
query.exception()
```

### Common Patterns

**Pattern 1: Event-Time Windows**
```python
stream.withWatermark("event_time", "1 hour") \\
    .groupBy(window("event_time", "10 minutes")) \\
    .count()
```

**Pattern 2: Stream-Stream Join**
```python
stream1.withWatermark("time1", "10 min") \\
    .join(stream2.withWatermark("time2", "10 min"), "key")
```

**Pattern 3: Deduplication**
```python
stream.withWatermark("event_time", "1 hour") \\
    .dropDuplicates(["id", "event_time"])
```

---

## üèÜ Workshop Complete!

You now know how to identify and fix the 90% most common Spark production issues:

**Part 3 - Batch**: Shuffles, skew, broadcast joins, UDFs, AQE  
**Part 4 - Streaming**: Checkpoints, watermarks, output modes, idempotency

**Next Steps**:
1. Practice with your own data
2. Use Spark UI to diagnose issues
3. Apply the SPARK_PRODUCTION_ISSUES.md playbook
4. Monitor, measure, optimize!

**Remember**: 
- üìä Always check Spark UI first
- üîß Use `explain("formatted")` to verify plans
- üìà Measure before and after optimizations
- üéØ Focus on the 20% of issues causing 80% of problems!
