# 🌊 Streaming Fundamentals: A Data Forge Lesson

**Learn real-time data streaming with Apache Spark, Kafka, and Avro**

This lesson demonstrates core streaming concepts using Data Forge's retail data generator. You'll understand why Avro beats JSON, how Schema Registry enables evolution, and how PySpark Structured Streaming works.

---

## 🎯 Learning Objectives

By the end of this lesson, you'll understand:

1. **Prerequisites** → How to start Data Forge's data generator
2. **Data Examination** → What streaming retail data looks like
3. **Avro vs JSON** → Why Avro is superior for streaming (with evidence)
4. **Schema Registry** → How it enables safe schema evolution
5. **PySpark Streaming** → How Structured Streaming processes infinite data
6. **Checkpoints** → Why they're critical for fault tolerance

---

## 📋 Prerequisites

**Before starting this lesson:**

```bash
# 1. Start Data Forge core services
docker compose --profile core up -d

# 2. Start the data generator
docker compose --profile datagen up -d

# 3. Verify data is flowing
docker compose logs -f data-generator | head -20
```

🛑 **Without the data generator, this lesson won't work.** The generator produces the streaming events we'll analyze.

---

## ⚙️ Setup & Configuration

Initialize our streaming environment and validate connections.

In [None]:
import os
from datetime import datetime

# Data Forge service endpoints
KAFKA_BOOTSTRAP = os.getenv('KAFKA_BOOTSTRAP_SERVERS', 'kafka:9092')
SCHEMA_REGISTRY_URL = os.getenv('SCHEMA_REGISTRY_URL', 'http://schema-registry:8081')
SPARK_MASTER = os.getenv('SPARK_MASTER_URL', 'spark://spark-master:7077')

print("🔥 Data Forge Streaming Configuration:")
print(f"   Kafka Bootstrap: {KAFKA_BOOTSTRAP}")
print(f"   Schema Registry: {SCHEMA_REGISTRY_URL}")
print(f"   Spark Master: {SPARK_MASTER}")
print(f"   Lesson Start: {datetime.now()}")
print("\n✅ Ready to learn streaming fundamentals!")

---

## 🔍 Lesson 1: Examining Streaming Data

**Goal:** Understand what real-time retail data looks like.

Data Forge's generator produces realistic business events: orders, payments, shipments, inventory changes, and customer interactions. Let's examine this data to understand streaming patterns.

In [None]:
from confluent_kafka import Consumer
import json

kafka_config = {
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'streaming-lesson-' + str(hash('lesson') % 10000),
    'auto.offset.reset': 'earliest',
    'enable.auto.commit': False
}

def sample_streaming_data(topic_name, num_messages=2):
    """Sample live messages from a Kafka topic to understand data structure"""
    print(f"📊 Sampling {num_messages} live messages from {topic_name}...")
    
    consumer = Consumer(kafka_config)
    
    try:
        consumer.subscribe([topic_name])
        messages = []
        attempts = 0
        max_attempts = 30
        
        while len(messages) < num_messages and attempts < max_attempts:
            msg = consumer.poll(timeout=1.0)
            attempts += 1
            
            if attempts % 10 == 0:
                print(f"   🔄 Polling attempt {attempts}/{max_attempts}...")
            
            if msg is None:
                continue
            if msg.error():
                print(f"❌ Consumer error: {msg.error()}")
                continue
                
            messages.append(msg)
            
            print(f"\n✅ MESSAGE {len(messages)} CAPTURED:")
            print(f"   📍 Topic: {msg.topic()}, Partition: {msg.partition()}, Offset: {msg.offset()}")
            print(f"   ⏰ Timestamp: {msg.timestamp()}")
            key = msg.key()
            if key:
                try:
                    key_str = key.decode('utf-8') if isinstance(key, bytes) else str(key)
                    print(f"   🔑 Key: {key_str}")
                except:
                    print(f"   🔑 Key: {key} (binary)")
            else:
                print(f"   🔑 Key: None")
            value = msg.value()
            if value:
                print(f"   📦 Value length: {len(value)} bytes")
                hex_preview = ' '.join([f'{b:02x}' for b in value[:20]])
                print(f"   🔍 Hex preview: {hex_preview}...")
                if len(value) >= 5 and value[0] == 0:
                    schema_id = int.from_bytes(value[1:5], byteorder='big')
                    print(f"   📋 Avro schema ID: {schema_id}")
                    print(f"   ✅ Confluent Schema Registry format detected")
                else:
                    print(f"   ⚠️ Not standard Confluent Avro format")
            else:
                print(f"   📦 Value: None")
        
        if len(messages) == 0:
            print("🛑 No messages found - check if data generator is running!")
        else:
            print(f"\n🎯 Successfully sampled {len(messages)} messages from {topic_name}")
            
        return messages
        
    except Exception as e:
        print(f"❌ Error: {e}")
        return []
    finally:
        consumer.close()

print("🔗 Testing Kafka connectivity...")
try:
    consumer = Consumer({'bootstrap.servers': 'kafka:9092', 'group.id': 'test-connectivity'})
    metadata = consumer.list_topics(timeout=10)
    print(f"✅ Connected to Kafka. Found {len(metadata.topics)} topics:")
    
    retail_topics = []
    for topic_name in sorted(metadata.topics.keys()):
        if not topic_name.startswith('_'):
            partitions = len(metadata.topics[topic_name].partitions)
            print(f"   📊 {topic_name} ({partitions} partitions)")
            if topic_name.endswith('.v1'):
                retail_topics.append(topic_name)
    
    print(f"\n🏪 Retail streaming topics: {retail_topics}")
    consumer.close()
except Exception as e:
    print(f"❌ Kafka connection failed: {e}")

In [None]:
print("🛒 EXAMINING ORDERS DATA:")
print("=" * 50)

orders_messages = sample_streaming_data("orders.v1", 2)

print("\n🎓 LESSON INSIGHT:")
print("Notice the binary data format - this is Avro, not JSON.")
print("The magic byte (00) + schema ID (4 bytes) tells us it's Confluent format.")
print("This compact binary encoding is why Avro beats JSON for streaming.")

---

## 📋 Lesson 2: Schema Registry Deep Dive

**Goal:** Understand how Schema Registry enables safe schema evolution.

Schema Registry is like a "contract database" for your streaming data. It stores Avro schemas and enforces compatibility rules, preventing breaking changes that would crash your streaming pipelines.

In [None]:
from confluent_kafka.schema_registry import SchemaRegistryClient

sr_client = SchemaRegistryClient({'url': 'http://schema-registry:8081'})

print("📋 SCHEMA REGISTRY EXPLORATION:")
print("=" * 50)

try:
    subjects = sr_client.get_subjects()
    print(f"📊 Total schema subjects: {len(subjects)}")
    print("\n🔍 Available schemas:")
    
    for subject in sorted(subjects):
        try:
            versions = sr_client.get_versions(subject)
            latest_version = sr_client.get_latest_version(subject)
            print(f"   📋 {subject}: {len(versions)} versions (latest: v{latest_version.version}, schema ID: {latest_version.schema_id})")
            if subject == "orders.v1-value":
                schema_str = latest_version.schema.schema_str
                schema_obj = json.loads(schema_str)
                
                print(f"\n🛒 ORDERS SCHEMA BREAKDOWN:")
                print(f"   📝 Schema name: {schema_obj['name']}")
                print(f"   📊 Number of fields: {len(schema_obj['fields'])}")
                print(f"   🔧 Fields:")
                for field in schema_obj['fields']:
                    field_type = field['type']
                    print(f"      • {field['name']}: {field_type}")
                
        except Exception as e:
            print(f"   ❌ {subject}: (error getting version info: {e})")
            
except Exception as e:
    print(f"❌ Schema Registry error: {e}")

print(f"\n🎓 LESSON INSIGHT:")
print(f"Schema Registry acts as a 'contract database' for streaming data.")
print(f"Each message references a schema ID, enabling safe evolution without breaking consumers.")
print(f"This is impossible with JSON - you'd need to parse every message to know its structure.")

---

## 🆚 Lesson 3: Avro vs JSON - The Evidence

**Goal:** Understand why Avro dominates streaming with concrete evidence.

JSON seems simpler, but Avro wins on every metric that matters for streaming: size, speed, schema evolution, and type safety. Let's prove it.

In [None]:
def decode_avro_message(message_value):
    """Decode Avro message using proven working method"""
    try:
        if len(message_value) < 5:
            return None
            
        magic_byte = message_value[0]
        schema_id = int.from_bytes(message_value[1:5], byteorder='big')
        avro_payload = message_value[5:]
        
        if magic_byte != 0:
            return None

        try:
            import io
            import avro.schema
            import avro.io
            
            schema = sr_client.get_schema(schema_id)
            avro_schema = avro.schema.parse(schema.schema_str)
            
            bytes_reader = io.BytesIO(avro_payload)
            decoder = avro.io.BinaryDecoder(bytes_reader)
            reader = avro.io.DatumReader(avro_schema)
            
            decoded = reader.read(decoder)
            return decoded
            
        except Exception:
            return None
        
    except Exception:
        return None

print("🆚 AVRO vs JSON COMPARISON:")
print("=" * 50)

if orders_messages:
    sample_msg = orders_messages[0]
    avro_data = decode_avro_message(sample_msg.value())
    
    if avro_data:
        json_equivalent = json.dumps(avro_data, indent=2, default=str)
        
        avro_size = len(sample_msg.value())
        json_size = len(json_equivalent.encode('utf-8'))
        
        print(f"📊 SIZE COMPARISON:")
        print(f"   🔹 Avro binary: {avro_size} bytes")
        print(f"   🔹 JSON equivalent: {json_size} bytes")
        print(f"   📈 Space savings: {((json_size - avro_size) / json_size * 100):.1f}% smaller with Avro")
        
        print(f"\n📋 DECODED DATA:")
        for key, value in avro_data.items():
            if isinstance(value, (int, float, bool)):
                print(f"   {key}: {value}")
            elif isinstance(value, str):
                safe_value = ''.join(c if ord(c) < 128 else '?' for c in value)
                print(f"   {key}: '{safe_value}'")
            else:
                print(f"   {key}: {str(value)}")
        if 'ts' in avro_data:
            try:
                from datetime import datetime
                ts_ms = avro_data['ts']
                if isinstance(ts_ms, (int, float)):
                    ts_readable = datetime.fromtimestamp(ts_ms / 1000).strftime('%Y-%m-%d %H:%M:%S')
                    print(f"   ts_readable: {ts_readable}")
            except Exception:
                pass
        
        print(f"\n🎓 WHY AVRO WINS:")
        print(f"   ✅ Size: {((json_size - avro_size) / json_size * 100):.1f}% smaller → less network/storage cost")
        print(f"   ✅ Speed: Binary parsing is faster than JSON text parsing")
        print(f"   ✅ Schema: Enforced types prevent runtime errors")
        print(f"   ✅ Evolution: Add/remove fields without breaking consumers")
        print(f"   ✅ Compression: Better compression ratios due to structure")
        
        print(f"\n❌ JSON PROBLEMS:")
        print(f"   ❌ No schema enforcement → runtime type errors")
        print(f"   ❌ Text parsing overhead")
        print(f"   ❌ Field name repetition in every message")
        print(f"   ❌ No safe evolution strategy")
        
    else:
        print("❌ Could not decode Avro message")
else:
    print("❌ No orders messages available for comparison")

---

## ⚡ Lesson 4: PySpark Structured Streaming Fundamentals

**Goal:** Understand how Spark processes infinite data streams.

Traditional batch processing reads finite data, processes it, and stops. Streaming is different - data never stops arriving. Spark Structured Streaming treats streams as "unbounded tables" that grow continuously.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import os

print("⚡ SPARK STREAMING SETUP:")
print("=" * 50)

spark = SparkSession.builder \
    .appName("StreamingFundamentalsLesson") \
    .master(SPARK_MASTER) \
    .config("spark.jars.packages", 
            "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0,"
            "org.apache.spark:spark-avro_2.12:3.4.0") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.streaming.checkpointLocation", "/tmp/streaming-lesson-checkpoint") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

print(f"✅ Spark session created successfully")
print(f"   📊 Spark version: {spark.version}")
print(f"   🆔 Application ID: {spark.sparkContext.applicationId}")
print(f"   🖥️ Master: {SPARK_MASTER}")
print(f"   💾 Checkpoint location: /tmp/streaming-lesson-checkpoint")
print(f"\n🔗 Testing Spark-Kafka connectivity...")
try:
    test_df = spark.read \
        .format("kafka") \
        .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP) \
        .option("subscribe", "orders.v1") \
        .option("startingOffsets", "earliest") \
        .option("endingOffsets", "latest") \
        .load()
    
    message_count = test_df.count()
    print(f"✅ Kafka connectivity test passed")
    print(f"   📊 Found {message_count} messages in orders.v1 topic")
    
except Exception as e:
    print(f"❌ Kafka connectivity test failed: {e}")

print(f"\n🎓 STREAMING FUNDAMENTALS:")
print(f"   • Streaming = processing unbounded (infinite) data")
print(f"   • Spark treats streams as 'growing tables'")
print(f"   • Each micro-batch processes new data incrementally")
print(f"   • Checkpoints track progress for fault tolerance")

In [None]:
def create_streaming_dataframe(topic_name):
    """Create a streaming DataFrame - this represents infinite data"""
    print(f"🌊 Creating streaming DataFrame for {topic_name}")
    
    try:
        # This creates a streaming DataFrame - it represents infinite data
        kafka_stream = (spark.readStream
            .format("kafka")
            .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP)
            .option("subscribe", topic_name)
            .option("startingOffsets", "latest")  # Only process new data
            .option("failOnDataLoss", "false")    # Don't fail if data is lost
            .load())
        
        # Transform the raw Kafka data into a structured format
        structured_stream = kafka_stream.select(
            col("key").cast("string").alias("message_key"),
            col("topic"),
            col("partition"),
            col("offset"),
            col("timestamp").alias("kafka_timestamp"),
            length(col("value")).alias("message_size_bytes"),
            col("value")  # Raw Avro binary data
        )
        
        print(f"✅ Streaming DataFrame created for {topic_name}")
        print(f"   📊 Schema (what each streaming record looks like):")
        structured_stream.printSchema()
        
        return structured_stream
        
    except Exception as e:
        print(f"❌ Error creating streaming DataFrame: {e}")
        return None

print("🛒 CREATING ORDERS STREAM:")
print("=" * 40)

orders_stream = create_streaming_dataframe("orders.v1")

if orders_stream:
    print(f"\n🎓 KEY CONCEPTS:")
    print(f"   • This DataFrame represents INFINITE data - it never ends")
    print(f"   • isStreaming = {orders_stream.isStreaming}")
    print(f"   • You can't call .show() or .collect() on it directly")
    print(f"   • You need a 'streaming query' to process the data")
    print(f"   • Spark processes data in micro-batches (e.g., every 2 seconds)")
else:
    print("❌ Failed to create orders stream")

---

## 💾 Lesson 5: Checkpoints - The Fault Tolerance Foundation

**Goal:** Understand why checkpoints are critical for production streaming.

Streaming applications run 24/7. Networks fail, machines crash, code gets deployed. Without checkpoints, you'd lose progress and either miss data or reprocess everything. Checkpoints are your streaming safety net.

In [None]:
import time
from IPython.display import display, clear_output

print("💾 CHECKPOINTING DEMONSTRATION:")
print("=" * 50)

if orders_stream:
    print("🔍 What checkpoints store:")
    print("   • Stream metadata (offsets, batch IDs)")
    print("   • Query progress information")
    print("   • State store data for aggregations")
    print("   • Watermark information for event time")
    
    # Create a streaming query with memory sink (perfect for Jupyter)
    checkpoint_demo_query = orders_stream.select(
        col("message_key"),
        col("offset"),
        col("kafka_timestamp"),
        col("message_size_bytes"),
        # Extract hex preview for educational purposes
        expr("hex(substring(value, 1, 20))").alias("avro_hex_preview")
    ).writeStream \
        .queryName("checkpoint_demo") \
        .outputMode("append") \
        .format("memory") \
        .option("checkpointLocation", "/tmp/streaming-lesson-checkpoint/demo") \
        .trigger(processingTime="3 seconds") \
        .start()

    print(f"\n✅ Streaming query started with checkpointing!")
    print(f"   📍 Query name: {checkpoint_demo_query.name}")
    print(f"   📁 Checkpoint location: /tmp/streaming-lesson-checkpoint/demo")
    print(f"   ⏱️ Processing trigger: every 3 seconds")
    print(f"\n🔄 Watching streaming progress (15 seconds)...")
    
    for i in range(5):
        time.sleep(3)
        
        try:
            current_data = spark.sql("SELECT COUNT(*) as total FROM checkpoint_demo")
            total_messages = current_data.collect()[0].total
            progress = checkpoint_demo_query.lastProgress
            
            print(f"   📊 Progress check {i+1}/5:")
            print(f"      Messages processed: {total_messages}")
            
            if progress:
                batch_id = progress.get('batchId', 'N/A')
                input_rate = progress.get('inputRowsPerSecond', 0)
                print(f"      Current batch: {batch_id}")
                print(f"      Input rate: {input_rate:.1f} rows/sec")
            if total_messages > 0:
                sample = spark.sql("SELECT message_key, offset, message_size_bytes FROM checkpoint_demo ORDER BY offset DESC LIMIT 2")
                rows = sample.collect()
                print(f"      Latest messages:")
                for row in rows:
                    print(f"        Key: {row.message_key}, Offset: {row.offset}, Size: {row.message_size_bytes}B")
            
        except Exception as e:
            print(f"      ⚠️ Progress check {i+1}: {e}")
    checkpoint_demo_query.stop()
    print(f"\n🛑 Streaming query stopped")
    
    print(f"\n🎓 CHECKPOINT BENEFITS:")
    print(f"   ✅ Exactly-once processing guarantees")
    print(f"   ✅ Fault tolerance - resume from failure point")
    print(f"   ✅ State preservation for aggregations")
    print(f"   ✅ No data loss or duplication")
    
    print(f"\n⚠️ CHECKPOINT CONSIDERATIONS:")
    print(f"   • Choose reliable storage (HDFS, S3, not local /tmp in production)")
    print(f"   • Checkpoint format is tied to Spark version")
    print(f"   • Schema changes may require checkpoint reset")
    print(f"   • Checkpoint size grows with state (aggregations)")

else:
    print("❌ No stream available for checkpoint demonstration")

---

## 🌊 Lesson 6: Advanced Streaming - Avro Decoding in Real-Time

**Goal:** Combine everything - stream processing with Avro decoding.

Now let's put it all together: process infinite Kafka streams, decode Avro messages in real-time, and display business-readable data. This is production-level streaming.

In [None]:
print("🚀 ADVANCED STREAMING WITH AVRO DECODING:")
print("=" * 60)

if orders_stream:
    print("🎯 This demonstrates production-level streaming:")
    print("   • Infinite data processing")
    print("   • Real-time Avro decoding")
    print("   • Business data extraction")
    print("   • Fault-tolerant checkpointing")
    advanced_query = orders_stream.select(
        col("message_key"),
        col("offset"),
        col("kafka_timestamp"),
        col("value"),  # Full Avro binary for decoding
        col("message_size_bytes")
    ).writeStream \
        .queryName("advanced_avro_streaming") \
        .outputMode("append") \
        .format("memory") \
        .option("checkpointLocation", "/tmp/streaming-lesson-checkpoint/advanced") \
        .trigger(processingTime="2 seconds") \
        .start()

    print(f"\n✅ Advanced streaming query started!")
    print(f"   🔧 Processing: Kafka → Spark → Avro decode → Business data")
    print(f"   📊 Collecting and decoding live data...")
    for i in range(6):
        time.sleep(2)
        
        try:
            current_data = spark.sql("SELECT * FROM advanced_avro_streaming ORDER BY offset DESC LIMIT 1")
            data_count = current_data.count()
            
            if data_count > 0:
                print(f"\n📊 LIVE UPDATE {i+1}/6:")
                print("─" * 40)
                
                row = current_data.collect()[0]
                print(f"📍 Kafka metadata:")
                print(f"   Key: {row.message_key}")
                print(f"   Offset: {row.offset}")
                print(f"   Timestamp: {row.kafka_timestamp}")
                print(f"   Size: {row.message_size_bytes} bytes")
                try:
                    message_bytes = row.value
                    if message_bytes and len(message_bytes) >= 5:
                        magic_byte = message_bytes[0]
                        schema_id = int.from_bytes(message_bytes[1:5], byteorder='big')
                        avro_payload = message_bytes[5:]
                        
                        if magic_byte == 0:
                            try:
                                import io
                                import avro.schema
                                import avro.io
                                
                                schema = sr_client.get_schema(schema_id)
                                avro_schema = avro.schema.parse(schema.schema_str)
                                
                                bytes_reader = io.BytesIO(avro_payload)
                                decoder = avro.io.BinaryDecoder(bytes_reader)
                                reader = avro.io.DatumReader(avro_schema)
                                
                                decoded = reader.read(decoder)
                                
                                print(f"\n🎯 DECODED BUSINESS DATA:")
                                for key, value in decoded.items():
                                    if isinstance(value, (int, float)):
                                        print(f"   {key}: {value}")
                                    elif isinstance(value, str):
                                        safe_value = ''.join(c if ord(c) < 128 else '?' for c in value)
                                        print(f"   {key}: '{safe_value}'")
                                    else:
                                        print(f"   {key}: {str(value)}")

                                if 'ts' in decoded and isinstance(decoded['ts'], (int, float)):
                                    from datetime import datetime
                                    ts_readable = datetime.fromtimestamp(decoded['ts'] / 1000).strftime('%Y-%m-%d %H:%M:%S')
                                    print(f"   event_time: {ts_readable}")
                                
                                print(f"   ✅ Real-time Avro decoding successful!")
                                    
                            except Exception as decode_error:
                                print(f"   ⚠️ Avro decode error: {decode_error}")
                        else:
                            print(f"   ⚠️ Invalid magic byte: {magic_byte}")
                            
                except Exception as e:
                    print(f"   ❌ Processing error: {e}")
            else:
                print(f"⏳ Update {i+1}/6: Waiting for streaming data...")
                
        except Exception as e:
            print(f"⚠️ Update {i+1}/6: {e}")
    advanced_query.stop()
    print(f"\n🛑 Advanced streaming query stopped")
    
    print(f"\n🎓 WHAT YOU JUST SAW:")
    print(f"   🌊 Infinite data stream processing")
    print(f"   📋 Schema Registry integration")
    print(f"   🔧 Real-time Avro decoding")
    print(f"   💾 Fault-tolerant checkpointing")
    print(f"   📊 Business data extraction from binary streams")
    print(f"   ⚡ This is how production streaming systems work!")

else:
    print("❌ No stream available for advanced demonstration")

---

## 🎯 Lesson Summary & Production Insights

**Congratulations!** You've learned the fundamentals of modern streaming architecture.

### 🎓 What You Learned

1. **Prerequisites** → Data Forge's generator creates realistic streaming data
2. **Data Examination** → Streaming data is binary Avro, not JSON
3. **Avro Superiority** → 30-50% smaller, faster, type-safe, evolvable
4. **Schema Registry** → Enables safe schema evolution without breaking consumers
5. **Spark Streaming** → Treats infinite streams as "growing tables"
6. **Checkpoints** → Critical for exactly-once processing and fault tolerance

### 🏭 Production Patterns

**You're now ready for real-world streaming:**

- **Schema Evolution** → Add/remove fields without downtime
- **Fault Tolerance** → Streams survive failures and restarts
- **Exactly-Once Processing** → No data loss or duplication
- **Type Safety** → Avro prevents runtime errors
- **Performance** → Binary encoding reduces costs

### 🚀 Next Steps

**Explore more Data Forge capabilities:**

```bash
# Explore with Trino SQL
# Visit http://localhost:8081 for Trino UI

# Build dashboards in Superset
# Visit http://localhost:8088 (admin/admin)
```

**Additional learning resources:**
- `notebooks/lessons/streaming` → More streaming examples
- `docs/` → Service-specific guides
- Data Forge README → Architecture overview

---

## 🔧 Cleanup

In [None]:
print("🧹 LESSON CLEANUP:")
print("=" * 30)
active_queries = spark.streams.active
if active_queries:
    print(f"🛑 Stopping {len(active_queries)} active streaming queries...")
    for query in active_queries:
        query.stop()
        print(f"   ✅ Stopped: {query.name}")
else:
    print("ℹ️ No active queries to stop")
try:
    import shutil
    if os.path.exists("/tmp/streaming-lesson-checkpoint"):
        shutil.rmtree("/tmp/streaming-lesson-checkpoint")
        print("🗑️ Cleaned up checkpoint directory")
except Exception as e:
    print(f"⚠️ Checkpoint cleanup: {e}")

print(f"\n✅ Lesson cleanup complete!")
print(f"💡 Spark session remains active for further experimentation")
print(f"🎉 You've mastered streaming fundamentals with Data Forge!")

# Optional: Uncomment to stop Spark completely
# spark.stop()
# print("🏁 Spark session stopped")