# Spark Streaming Concepts

Interactive exploration of streaming data processing with PySpark. This notebook covers the concepts demonstrated in the Scala streaming examples.

## Streaming Types Overview

### DStream (Discretized Stream)
- **Basic streaming**: Socket, file, Kafka sources
- **Micro-batch processing**: Fixed time intervals
- **RDD-based**: Each batch is an RDD

### Structured Streaming
- **DataFrame/Dataset API**: Unified with batch processing
- **Continuous processing**: Event-time processing
- **SQL integration**: Declarative streaming queries

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time

# Create Spark session for streaming
spark = SparkSession.builder \n    .appName("StreamingConcepts") \n    .config("spark.sql.streaming.checkpointLocation", "/tmp/checkpoint") \n    .getOrCreate()

print(f"Spark version: {spark.version}")
print("Streaming session ready")

## 1. Socket Streaming (DStream)

Equivalent to basic-streaming/streaming_*.scala examples:

In [None]:
# Note: Socket streaming requires a netcat server running
# Terminal command: nc -lk 9999
# Then send text data to see streaming in action

# Socket streaming setup (uncomment to run)
# from pyspark.streaming import StreamingContext

# ssc = StreamingContext(spark.sparkContext, 5)  # 5 second batches

# Create socket stream
# lines = ssc.socketTextStream("localhost", 9999)

# Process each batch
# words = lines.flatMap(lambda line: line.split(" "))
# word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

# Print results
# word_counts.pprint()

# Start streaming
# ssc.start()
# ssc.awaitTermination()

print("Socket streaming code ready (requires netcat server on port 9999)")

## 2. File Streaming

Monitor directory for new files - equivalent to file-streaming examples:

In [None]:
# Create sample data directory
import os
os.makedirs("sample_data", exist_ok=True)

# Create sample files
sample_texts = [
    "spark hadoop kafka streaming",
    "machine learning data science",
    "big data analytics pipeline"
]

for i, text in enumerate(sample_texts):
    with open(f"sample_data/file_{i+1}.txt", "w") as f:
        f.write(text)

print("Sample files created in sample_data/ directory")
!ls -la sample_data/

In [None]:
# File streaming setup (uncomment to run)
# file_stream = spark \n#     .readStream \n#     .text("sample_data/") \n#     .withColumn("timestamp", current_timestamp())

# Process streaming files
# word_counts = file_stream \n#     .select(split(col("value"), " ").alias("words")) \n#     .select(explode(col("words")).alias("word"), "timestamp") \n#     .groupBy("word") \n#     .count() \n#     .orderBy(desc("count"))

# Display streaming results
# query = word_counts \n#     .writeStream \n#     .outputMode("complete") \n#     .format("console") \n#     .trigger(processingTime="10 seconds") \n#     .start()

# query.awaitTermination()

print("File streaming code ready - monitors sample_data/ directory")

## 3. Structured Streaming

Advanced streaming with DataFrame API - equivalent to structured-streaming examples:

In [None]:
# Sample structured data
streaming_data = [
    {"event_id": 1, "user_id": "alice", "action": "login", "timestamp": "2023-01-01 10:00:00"},
    {"event_id": 2, "user_id": "bob", "action": "view", "timestamp": "2023-01-01 10:01:00"},
    {"event_id": 3, "user_id": "alice", "action": "purchase", "timestamp": "2023-01-01 10:02:00"},
    {"event_id": 4, "user_id": "charlie", "action": "login", "timestamp": "2023-01-01 10:03:00"}
]

# Define schema
schema = StructType([
    StructField("event_id", IntegerType()),
    StructField("user_id", StringType()),
    StructField("action", StringType()),
    StructField("timestamp", StringType())
])

# Create static DataFrame for demonstration
df = spark.createDataFrame(streaming_data, schema)
df = df.withColumn("timestamp", to_timestamp("timestamp"))
df.show()
print("Structured data ready for streaming operations")

## Windowing Operations

Time-based aggregations - equivalent to structured-streaming windowing:

In [None]:
# Windowed aggregations (would run on streaming data)
windowed_stats = df \n    .withWatermark("timestamp", "10 minutes") \n    .groupBy(
        window("timestamp", "5 minutes"),
        "user_id",
        "action"
    ) \n    .count() \n    .orderBy("window", "user_id")

windowed_stats.show(truncate=False)

print("Windowing operations demonstrated on static data")

## State Management

Maintaining state across streaming batches:

In [None]:
# User session tracking (would be used in streaming)
from pyspark.sql.window import Window

# Calculate session metrics
session_window = Window.partitionBy("user_id").orderBy("timestamp")

user_sessions = df \n    .withColumn("session_id", 
                concat(col("user_id"), 
                       lit("_"),
                       date_format("timestamp", "yyyyMMdd"))) \n    .withColumn("event_number", row_number().over(session_window)) \n    .withColumn("time_diff", 
                unix_timestamp("timestamp") - 
                lag(unix_timestamp("timestamp")).over(session_window))

user_sessions.show()

print("State management concepts demonstrated")

## Streaming Best Practices

### Checkpointing
- **Purpose**: Fault tolerance and state recovery
- **Location**: HDFS/S3/cloud storage
- **Frequency**: Every few minutes

### Watermarking
- **Purpose**: Handle late-arriving data
- **Configuration**: Based on expected delays
- **Impact**: Affects memory usage

### Trigger Intervals
- **Default**: As fast as possible
- **Production**: 1-10 minutes for cost optimization
- **Real-time**: Millisecond triggers for latency-sensitive apps

## Scala vs Python Streaming

### Scala Examples (streaming/):
- **streaming_*.scala**: Low-level DStream operations
- **Focus**: Infrastructure and performance
- **Use case**: Custom streaming logic

### Python Notebooks (notebooks/):
- **Structured Streaming**: High-level DataFrame API
- **Focus**: Data processing and analytics
- **Use case**: ML pipelines, data transformations

### Choosing the Right Approach:
- **Scala**: Performance-critical streaming, custom receivers
- **Python**: Data science workflows, ML integration, rapid development