# Module 11: Spark Streaming Basics

**Difficulty**: ⭐⭐⭐  
**Estimated Time**: 80 minutes  
**Prerequisites**: 
- [Module 03: DataFrames and Datasets](03_dataframes_and_datasets.ipynb)
- [Module 05: DataFrame Operations](05_dataframe_operations.ipynb)
- Understanding of streaming concepts

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand Structured Streaming architecture and concepts (micro-batches, watermarks, triggers)
2. Read streaming data from various sources (files, sockets, generated streams)
3. Apply window operations (tumbling and sliding windows) for time-based aggregations
4. Use different output modes (append, complete, update) appropriately
5. Build real-time streaming applications with stateful operations and aggregations

## 1. Setup and Introduction

**What is Spark Structured Streaming?**

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It treats streaming data as a table that is continuously appended to.

**Key Concepts:**

1. **Streaming DataFrame**: Unbounded table that grows as new data arrives
2. **Micro-batch Processing**: Processes data in small batches (default: every second)
3. **Trigger**: Defines when Spark should process new data (e.g., every 5 seconds)
4. **Watermark**: Handles late-arriving data in time-windowed operations
5. **State Management**: Maintains intermediate state for aggregations

**Output Modes:**

- **Append**: Only new rows are written to the sink (for non-aggregated queries)
- **Complete**: Entire result table is written every trigger (for aggregations)
- **Update**: Only updated rows are written (for aggregations, efficient)

**Sources:**
- File sources (JSON, CSV, Parquet)
- Socket sources (for testing)
- Kafka (production message queues)
- Rate source (for testing, generates data)

**Sinks:**
- Console (for debugging)
- File sink (Parquet, JSON, CSV)
- Memory sink (for testing)
- Kafka sink (for downstream processing)
- ForeachBatch (custom logic)

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, window, count, sum as spark_sum, avg, max as spark_max, min as spark_min,
    current_timestamp, to_timestamp, date_format, hour, minute,
    explode, split, length, lower, when, lit, expr
)
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, 
    DoubleType, TimestampType, LongType
)

import time
import random
import threading
from datetime import datetime, timedelta

In [None]:
# Create Spark session for streaming
spark = SparkSession.builder \
    .appName("Spark Streaming Basics") \
    .config("spark.driver.memory", "2g") \
    .config("spark.sql.streaming.checkpointLocation", "/tmp/checkpoint") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
print(f"Spark version: {spark.version}")
print("Spark session created successfully!")

## 2. Reading Streaming Data

We'll start with the **Rate Source**, which generates streaming data automatically.

**Rate Source** generates rows with:
- `timestamp`: The time the row was generated
- `value`: A counter (0, 1, 2, ...)

**Parameters:**
- `rowsPerSecond`: How many rows to generate per second
- `rampUpTime`: Time to reach full rate (optional)
- `numPartitions`: Number of partitions for generated data

In [None]:
# Create a streaming DataFrame using rate source
# Generates 5 rows per second
rate_stream = spark.readStream \
    .format("rate") \
    .option("rowsPerSecond", 5) \
    .load()

# Check if it's a streaming DataFrame
print(f"Is streaming: {rate_stream.isStreaming}")
print("\nSchema:")
rate_stream.printSchema()

In [None]:
# Write to console sink to see the data
# This starts a streaming query
query = rate_stream.writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", "false") \
    .option("numRows", 10) \
    .start()

print("Streaming query started. Will show data for 10 seconds...")
time.sleep(10)  # Let it run for 10 seconds

# Stop the query
query.stop()
print("\nQuery stopped.")

## 3. Basic Stream Transformations

Streaming DataFrames support most of the same operations as batch DataFrames:
- `select`, `filter`, `where`
- `withColumn`, `withColumnRenamed`
- Aggregations (with special considerations)

Let's create a more realistic streaming scenario: **sensor data**.

In [None]:
# Transform rate stream into sensor data
# Simulate temperature sensors from different rooms
sensor_stream = rate_stream \
    .select(
        col("timestamp"),
        col("value"),
        # Simulate room IDs (Room_0 to Room_4)
        (col("value") % 5).cast("string").alias("room_id"),
        # Simulate temperature readings (18-28°C with some variation)
        (20 + (col("value") % 10) + (col("value") % 3) / 10.0).alias("temperature"),
        # Simulate humidity (40-70%)
        (50 + (col("value") % 20)).alias("humidity")
    ) \
    .withColumn("room_name", expr("concat('Room_', room_id)")) \
    .select("timestamp", "room_name", "temperature", "humidity")

print("Sensor stream schema:")
sensor_stream.printSchema()

In [None]:
# Filter and transform the stream
# Alert when temperature is too high (>25°C)
high_temp_stream = sensor_stream \
    .filter(col("temperature") > 25) \
    .select(
        col("timestamp"),
        col("room_name"),
        col("temperature"),
        lit("HIGH TEMPERATURE ALERT").alias("alert_type")
    )

# Start query to see high temperature alerts
alert_query = high_temp_stream.writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", "false") \
    .start()

print("High temperature alert stream started. Running for 10 seconds...")
time.sleep(10)

alert_query.stop()
print("Alert query stopped.")

## 4. Aggregations in Streaming

**Streaming aggregations** are more complex than batch because:
- Data arrives continuously
- We need to maintain state
- Late-arriving data must be handled

**Types of Aggregations:**

1. **Global Aggregations**: Aggregate all data (needs `complete` mode)
2. **Grouped Aggregations**: Group by key (can use `update` mode)
3. **Windowed Aggregations**: Group by time windows (append/update/complete)

In [None]:
# Global aggregation: count all events
# Requires outputMode("complete") because we're maintaining global state
global_count = sensor_stream \
    .groupBy() \
    .count()

global_query = global_count.writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

print("Global count query started. Running for 10 seconds...")
time.sleep(10)

global_query.stop()
print("Global count query stopped.")

In [None]:
# Grouped aggregation: statistics per room
# Can use outputMode("update") to only show changed rows
room_stats = sensor_stream \
    .groupBy("room_name") \
    .agg(
        count("*").alias("num_readings"),
        avg("temperature").alias("avg_temperature"),
        spark_max("temperature").alias("max_temperature"),
        spark_min("temperature").alias("min_temperature"),
        avg("humidity").alias("avg_humidity")
    )

room_query = room_stats.writeStream \
    .outputMode("update") \
    .format("console") \
    .option("truncate", "false") \
    .start()

print("Room statistics query started. Running for 15 seconds...")
time.sleep(15)

room_query.stop()
print("Room statistics query stopped.")

## 5. Window Operations

**Time Windows** allow aggregating data over sliding or tumbling time intervals.

**Tumbling Windows:**
- Non-overlapping, fixed-size intervals
- Example: Every 1 minute (00:00-01:00, 01:00-02:00, ...)
- Syntax: `window(col("timestamp"), "1 minute")`

**Sliding Windows:**
- Overlapping intervals
- Example: 10-minute window, sliding every 5 minutes
- Syntax: `window(col("timestamp"), "10 minutes", "5 minutes")`

**Window Column:**
- Contains `start` and `end` timestamps
- Can be used in `groupBy`

In [None]:
# Tumbling window: 10-second windows
# Count events in each 10-second window
tumbling_window = sensor_stream \
    .groupBy(
        window(col("timestamp"), "10 seconds")
    ) \
    .count() \
    .select(
        col("window.start").alias("window_start"),
        col("window.end").alias("window_end"),
        col("count")
    )

tumbling_query = tumbling_window.writeStream \
    .outputMode("update") \
    .format("console") \
    .option("truncate", "false") \
    .start()

print("Tumbling window query (10-second windows) started...")
time.sleep(25)  # Run long enough to see multiple windows

tumbling_query.stop()
print("Tumbling window query stopped.")

In [None]:
# Sliding window: 20-second window, sliding every 10 seconds
# This creates overlapping windows
sliding_window = sensor_stream \
    .groupBy(
        window(col("timestamp"), "20 seconds", "10 seconds"),
        col("room_name")
    ) \
    .agg(
        count("*").alias("num_readings"),
        avg("temperature").alias("avg_temp")
    ) \
    .select(
        col("room_name"),
        col("window.start").alias("window_start"),
        col("window.end").alias("window_end"),
        col("num_readings"),
        col("avg_temp")
    )

sliding_query = sliding_window.writeStream \
    .outputMode("update") \
    .format("console") \
    .option("truncate", "false") \
    .start()

print("Sliding window query (20s window, 10s slide) started...")
time.sleep(30)

sliding_query.stop()
print("Sliding window query stopped.")

## 6. Watermarking for Late Data

**Problem**: In real-world streaming, data can arrive late (network delays, clock skew, etc.)

**Solution**: Watermarks define how long to wait for late data

**How it works:**
- Watermark = `max_event_time - threshold`
- Events older than watermark are dropped
- Allows cleaning up old state

**Example:**
- If watermark is "10 minutes"
- And latest event time is 12:30
- Then events before 12:20 will be dropped

In [None]:
# Window aggregation with watermark
# Wait up to 30 seconds for late data
watermarked_stream = sensor_stream \
    .withWatermark("timestamp", "30 seconds") \
    .groupBy(
        window(col("timestamp"), "10 seconds"),
        col("room_name")
    ) \
    .agg(
        count("*").alias("count"),
        avg("temperature").alias("avg_temp")
    ) \
    .select(
        col("room_name"),
        col("window.start").alias("start"),
        col("window.end").alias("end"),
        col("count"),
        col("avg_temp")
    )

# With watermark, we can use append mode for windowed aggregations
watermark_query = watermarked_stream.writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", "false") \
    .start()

print("Watermarked window query started (30s watermark)...")
print("Windows will be finalized after watermark delay...")
time.sleep(45)  # Need to wait longer than watermark

watermark_query.stop()
print("Watermarked query stopped.")

## 7. Output Modes in Detail

Let's demonstrate all three output modes with the same query.

**Append Mode:**
- Only outputs new rows
- Works for: non-aggregated queries, watermarked aggregations
- Doesn't work for: aggregations without watermark

**Complete Mode:**
- Outputs entire result table every time
- Works for: aggregations
- Expensive for large result sets

**Update Mode:**
- Only outputs rows that changed since last trigger
- Works for: aggregations
- Most efficient for aggregations

In [None]:
# Same aggregation query to test different modes
test_agg = sensor_stream \
    .groupBy("room_name") \
    .agg(count("*").alias("count"))

# Complete mode: Shows ALL rooms every time
print("\n=== COMPLETE MODE ===")
print("Shows all rooms every trigger (even unchanged ones)\n")

complete_query = test_agg.writeStream \
    .outputMode("complete") \
    .format("console") \
    .trigger(processingTime="5 seconds") \
    .start()

time.sleep(12)
complete_query.stop()
print("\nComplete mode query stopped.")

In [None]:
# Update mode: Shows only CHANGED rooms
print("\n=== UPDATE MODE ===")
print("Shows only rooms with new data since last trigger\n")

update_query = test_agg.writeStream \
    .outputMode("update") \
    .format("console") \
    .trigger(processingTime="5 seconds") \
    .start()

time.sleep(12)
update_query.stop()
print("\nUpdate mode query stopped.")

## 8. Triggers

**Triggers** control when Spark processes new data.

**Types:**

1. **Default (micro-batch)**: Processes as soon as previous batch finishes
2. **Fixed interval**: `trigger(processingTime="10 seconds")`
3. **One-time**: `trigger(once=True)` - processes all available data and stops
4. **Continuous**: Low-latency (experimental)

**Choosing triggers:**
- **High throughput**: Default or long intervals (e.g., 1 minute)
- **Low latency**: Short intervals (e.g., 1 second) or continuous
- **Batch-like**: One-time trigger

In [None]:
# Fixed interval trigger: Process every 5 seconds
triggered_stream = sensor_stream \
    .groupBy("room_name") \
    .count()

print("Starting query with 5-second trigger interval...")
print("Notice it processes exactly every 5 seconds\n")

trigger_query = triggered_stream.writeStream \
    .outputMode("update") \
    .format("console") \
    .trigger(processingTime="5 seconds") \
    .start()

time.sleep(16)  # See 3 triggers
trigger_query.stop()
print("Triggered query stopped.")

## 9. Memory Sink for Testing

**Memory sink** writes to an in-memory table that can be queried.

**Use cases:**
- Testing streaming logic
- Debugging aggregations
- Quick prototyping

**Warning**: Only for testing! Doesn't persist and uses memory.

In [None]:
# Write to memory sink with table name
memory_stream = sensor_stream \
    .groupBy("room_name") \
    .agg(
        count("*").alias("total_readings"),
        avg("temperature").alias("avg_temp"),
        avg("humidity").alias("avg_humidity")
    )

memory_query = memory_stream.writeStream \
    .outputMode("complete") \
    .format("memory") \
    .queryName("room_stats_table") \
    .start()

print("Memory sink query started. Letting it accumulate data...")
time.sleep(10)

# Query the in-memory table
print("\nQuerying the in-memory table:")
spark.sql("SELECT * FROM room_stats_table ORDER BY room_name").show(truncate=False)

# Wait and query again to see updates
time.sleep(5)
print("\nQuerying again after 5 more seconds:")
spark.sql("SELECT * FROM room_stats_table ORDER BY room_name").show(truncate=False)

memory_query.stop()
print("\nMemory query stopped.")

## 10. Exercises

### Exercise 1: Click Stream Analysis

Simulate and analyze website click stream data.

**Tasks:**
1. Create a streaming source using rate source
2. Transform into click events with: user_id, page, action (view/click/purchase)
3. Count actions per page in 30-second tumbling windows
4. Filter for 'purchase' actions and display in real-time
5. Run for 1 minute and observe results

In [None]:
# Your code here
# TODO: Create click stream
# TODO: Add window aggregations
# TODO: Filter and display purchases

### Exercise 2: Real-time Stock Price Monitoring

Build a real-time stock price monitoring system.

**Tasks:**
1. Generate streaming stock prices (symbol, price, volume)
2. Calculate 1-minute moving average for each stock
3. Alert when price changes by >5% in a 1-minute window
4. Display top 3 stocks by volume every 30 seconds
5. Use update mode for efficiency

In [None]:
# Your code here
# TODO: Generate stock price stream
# TODO: Calculate moving averages
# TODO: Implement price change alerts
# TODO: Track volume leaders

### Exercise 3: IoT Device Monitoring with Watermarks

Monitor IoT devices and handle late-arriving data.

**Tasks:**
1. Simulate IoT device data (device_id, metric_type, value, timestamp)
2. Add watermark of 1 minute to handle late data
3. Calculate statistics per device in 2-minute sliding windows (slide: 1 minute)
4. Use append mode to only output finalized windows
5. Verify that late data beyond watermark is dropped

In [None]:
# Your code here
# TODO: Generate IoT data stream
# TODO: Add watermark
# TODO: Implement sliding windows
# TODO: Use append mode

## 11. Exercise Solutions

### Solution 1: Click Stream Analysis

In [None]:
# Generate click stream data
click_stream = spark.readStream \
    .format("rate") \
    .option("rowsPerSecond", 10) \
    .load() \
    .select(
        col("timestamp"),
        (col("value") % 100).alias("user_id"),
        expr("CASE WHEN value % 5 = 0 THEN 'home' "
             "WHEN value % 5 = 1 THEN 'products' "
             "WHEN value % 5 = 2 THEN 'cart' "
             "WHEN value % 5 = 3 THEN 'checkout' "
             "ELSE 'about' END").alias("page"),
        expr("CASE WHEN value % 10 < 6 THEN 'view' "
             "WHEN value % 10 < 9 THEN 'click' "
             "ELSE 'purchase' END").alias("action")
    )

# Window aggregation: count actions per page in 30-second windows
click_agg = click_stream \
    .groupBy(
        window(col("timestamp"), "30 seconds"),
        col("page"),
        col("action")
    ) \
    .count() \
    .select(
        col("window.start").alias("window_start"),
        col("page"),
        col("action"),
        col("count")
    )

# Start aggregation query
agg_query = click_agg.writeStream \
    .outputMode("update") \
    .format("console") \
    .option("truncate", "false") \
    .start()

# Filter for purchases
purchases = click_stream \
    .filter(col("action") == "purchase") \
    .select("timestamp", "user_id", "page")

purchase_query = purchases.writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", "false") \
    .start()

print("Click stream analysis started. Running for 1 minute...\n")
time.sleep(60)

agg_query.stop()
purchase_query.stop()
print("\nClick stream queries stopped.")

### Solution 2: Real-time Stock Price Monitoring

In [None]:
# Generate stock price stream
stocks = ["AAPL", "GOOGL", "MSFT", "AMZN", "TSLA"]

stock_stream = spark.readStream \
    .format("rate") \
    .option("rowsPerSecond", 15) \
    .load() \
    .select(
        col("timestamp"),
        expr(f"CASE WHEN value % 5 = 0 THEN '{stocks[0]}' "
             f"WHEN value % 5 = 1 THEN '{stocks[1]}' "
             f"WHEN value % 5 = 2 THEN '{stocks[2]}' "
             f"WHEN value % 5 = 3 THEN '{stocks[3]}' "
             f"ELSE '{stocks[4]}' END").alias("symbol"),
        (100 + (col("value") % 50) + (col("value") % 10) / 10.0).alias("price"),
        (1000 + (col("value") % 500) * 100).alias("volume")
    )

# 1-minute moving average per stock
moving_avg = stock_stream \
    .withWatermark("timestamp", "2 minutes") \
    .groupBy(
        window(col("timestamp"), "1 minute"),
        col("symbol")
    ) \
    .agg(
        avg("price").alias("avg_price"),
        spark_min("price").alias("min_price"),
        spark_max("price").alias("max_price"),
        spark_sum("volume").alias("total_volume")
    ) \
    .withColumn(
        "price_change_pct",
        ((col("max_price") - col("min_price")) / col("min_price") * 100)
    ) \
    .select(
        col("symbol"),
        col("window.start").alias("window_start"),
        col("avg_price"),
        col("min_price"),
        col("max_price"),
        col("price_change_pct"),
        col("total_volume")
    )

# Alert on >5% price change
price_alerts = moving_avg \
    .filter(col("price_change_pct") > 5) \
    .select(
        col("symbol"),
        col("window_start"),
        col("price_change_pct"),
        lit("SIGNIFICANT PRICE MOVEMENT").alias("alert")
    )

# Start queries
avg_query = moving_avg.writeStream \
    .outputMode("update") \
    .format("console") \
    .option("truncate", "false") \
    .trigger(processingTime="30 seconds") \
    .start()

alert_query = price_alerts.writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", "false") \
    .start()

print("Stock monitoring started. Running for 90 seconds...\n")
time.sleep(90)

avg_query.stop()
alert_query.stop()
print("\nStock monitoring stopped.")

### Solution 3: IoT Device Monitoring with Watermarks

In [None]:
# Generate IoT device stream
iot_stream = spark.readStream \
    .format("rate") \
    .option("rowsPerSecond", 8) \
    .load() \
    .select(
        col("timestamp"),
        expr("concat('device_', cast(value % 10 as string))").alias("device_id"),
        expr("CASE WHEN value % 3 = 0 THEN 'temperature' "
             "WHEN value % 3 = 1 THEN 'humidity' "
             "ELSE 'pressure' END").alias("metric_type"),
        (20 + (col("value") % 30) + (col("value") % 5) / 10.0).alias("value")
    )

# Apply watermark and sliding windows
iot_stats = iot_stream \
    .withWatermark("timestamp", "1 minute") \
    .groupBy(
        window(col("timestamp"), "2 minutes", "1 minute"),
        col("device_id"),
        col("metric_type")
    ) \
    .agg(
        count("*").alias("num_readings"),
        avg("value").alias("avg_value"),
        spark_min("value").alias("min_value"),
        spark_max("value").alias("max_value")
    ) \
    .select(
        col("device_id"),
        col("metric_type"),
        col("window.start").alias("window_start"),
        col("window.end").alias("window_end"),
        col("num_readings"),
        col("avg_value"),
        col("min_value"),
        col("max_value")
    )

# Use append mode (only works with watermark)
iot_query = iot_stats.writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", "false") \
    .start()

print("IoT monitoring with watermark started...")
print("Windows will be finalized 1 minute after their end time.\n")
time.sleep(180)  # Run for 3 minutes to see multiple windows

iot_query.stop()
print("\nIoT monitoring stopped.")

## 12. Summary

Congratulations! You've learned the fundamentals of Spark Structured Streaming.

### Key Takeaways:

1. **Structured Streaming Basics**:
   - Streaming DataFrames are unbounded tables
   - Same API as batch DataFrames
   - Micro-batch processing model
   - Fault-tolerant with checkpointing

2. **Data Sources**:
   - Rate source: For testing and demos
   - File source: JSON, CSV, Parquet from directories
   - Socket source: For simple network streams
   - Kafka: Production message queues (not covered here)

3. **Aggregations**:
   - Global aggregations: Need complete mode
   - Grouped aggregations: Can use update mode
   - Windowed aggregations: Time-based grouping
   - State is automatically managed by Spark

4. **Window Operations**:
   - Tumbling windows: Non-overlapping, fixed intervals
   - Sliding windows: Overlapping intervals
   - Used with `window()` function in `groupBy`
   - Essential for time-series analytics

5. **Watermarks**:
   - Handle late-arriving data
   - Allow state cleanup
   - Enable append mode for aggregations
   - Balance between completeness and resource usage

6. **Output Modes**:
   - Append: New rows only (non-agg or watermarked agg)
   - Complete: Full result every time (memory intensive)
   - Update: Only changed rows (most efficient for agg)

7. **Triggers**:
   - Control processing frequency
   - Trade-off between latency and throughput
   - Fixed intervals for predictable processing
   - One-time for batch-like processing

### Best Practices:

- Use watermarks for time-windowed aggregations to prevent unbounded state
- Choose appropriate trigger intervals based on latency requirements
- Use update mode instead of complete mode when possible (more efficient)
- Always configure checkpoint location for production
- Monitor streaming query progress and metrics
- Test with rate/socket sources before deploying with Kafka
- Handle schema evolution carefully in production

### Production Considerations:

- **Checkpointing**: Enable for fault tolerance (`checkpointLocation`)
- **Monitoring**: Track metrics like processing time, input rate, batch duration
- **Backpressure**: Configure max offsets per trigger to prevent overload
- **State Management**: Tune state store configurations for large state
- **Resource Allocation**: Allocate enough memory for state and processing

### What's Next?

In [Module 12: Performance Optimization](12_performance_optimization.ipynb), you'll learn:
- Partitioning strategies for optimal data distribution
- Caching and persistence levels
- Broadcast variables and joins
- Avoiding shuffle operations
- Performance tuning and monitoring

### Additional Resources:

- [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
- [Structured Streaming + Kafka Integration](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html)
- [Watermarking in Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking)

In [None]:
# Clean up
spark.stop()
print("Spark session stopped. Excellent work on streaming!")