# Module 03: Stream Processing Fundamentals

**Estimated Time:** 90 minutes

## Learning Objectives

By the end of this module, you will:
- Understand stream processing concepts and patterns
- Differentiate between stateless and stateful operations
- Master windowing techniques (tumbling, sliding, session)
- Work with event time vs processing time
- Handle late-arriving data with watermarks
- Build real-time aggregations and transformations

---

## 1. What is Stream Processing?

### Stream Processing vs Batch Processing

**Batch Processing (Traditional):**
```
Data Collection (1 hour) → Process → Results
    [...........]          [compute]   [output]
    Wait, wait, wait...    ALL at once  

Characteristics:
- Process data in fixed-size chunks
- High latency (minutes to hours)
- Complete dataset available
- Examples: Daily reports, ETL jobs
```

**Stream Processing (Real-time):**
```
Continuous Data → Process Each Event → Continuous Results
   [→][→][→][→]      [compute]           [→][→][→][→]
   Process immediately, one by one or in micro-batches

Characteristics:
- Process data as it arrives
- Low latency (milliseconds to seconds)
- Infinite, unbounded dataset
- Examples: Fraud detection, monitoring
```

### Stream Processing Operations

**1. Stateless Operations:**
```
Input Stream:    [1] → [2] → [3] → [4] → [5]
                  ↓     ↓     ↓     ↓     ↓
Operation:       *2    *2    *2    *2    *2  (independent)
                  ↓     ↓     ↓     ↓     ↓
Output Stream:   [2] → [4] → [6] → [8] → [10]

Examples: filter, map, flatMap
```

**2. Stateful Operations:**
```
Input Stream:    [1] → [2] → [3] → [4] → [5]
                  ↓     ↓     ↓     ↓     ↓
State:          sum=1 sum=3 sum=6 sum=10 sum=15
                  ↓     ↓     ↓     ↓     ↓
Output Stream:   [1] → [3] → [6] → [10] → [15]

Examples: count, sum, aggregations, joins
```

### Real-World Use Cases

| Use Case | Input | Processing | Output |
|----------|-------|------------|--------|
| Fraud Detection | Transactions | Pattern matching | Alerts |
| Real-time Analytics | User clicks | Aggregation | Dashboard |
| IoT Monitoring | Sensor data | Threshold checks | Notifications |
| Recommendation | User behavior | ML inference | Recommendations |
| Log Monitoring | Log events | Filtering + aggregation | Metrics |

In [None]:
# Setup: Import libraries
from confluent_kafka import Producer, Consumer, KafkaException
from confluent_kafka.admin import AdminClient, NewTopic
import json
import time
from datetime import datetime, timedelta
from collections import defaultdict, deque
import random
import threading

admin_client = AdminClient({"bootstrap.servers": "localhost:9092"})
print("[OK] Ready for stream processing!")

In [None]:
# Create topics for stream processing examples
TOPICS = {
    "click-stream": "Raw user click events",
    "transactions": "Financial transactions",
    "sensor-data": "IoT sensor readings",
    "processed-clicks": "Processed click events",
}

new_topics = [
    NewTopic(topic=name, num_partitions=3, replication_factor=1) for name in TOPICS.keys()
]

try:
    futures = admin_client.create_topics(new_topics)
    for topic, future in futures.items():
        try:
            future.result()
            print(f"[OK] Created topic '{topic}'")
        except KafkaException as e:
            if "TOPIC_ALREADY_EXISTS" in str(e):
                print(f"[OK] Topic '{topic}' exists")
except Exception as e:
    print(f"[FAIL] Error: {e}")

---

## 2. Stateless Stream Processing

### Filter Operation

**Concept**: Select events that match a condition
```
Input:  [1] [2] [3] [4] [5] [6] [7] [8] [9]
         ↓   ↓   ↓   ↓   ↓   ↓   ↓   ↓   ↓
Filter: even numbers only
         ↓       ↓       ↓       ↓
Output: [2]     [4]     [6]     [8]
```

### Map Operation

**Concept**: Transform each event independently
```
Input:  [a] [b] [c] [d]
         ↓   ↓   ↓   ↓
Map:    uppercase
         ↓   ↓   ↓   ↓
Output: [A] [B] [C] [D]
```

### FlatMap Operation

**Concept**: Transform one event into multiple events
```
Input:  ["hello world"] ["foo bar"]
         ↓               ↓
FlatMap: split by space
         ↓               ↓
Output: ["hello"]["world"] ["foo"]["bar"]
```

In [None]:
# Example: Stateless stream processor - Filter and Transform
class StatelessProcessor:
    """Process events without maintaining state"""

    def __init__(self, input_topic, output_topic):
        self.input_topic = input_topic
        self.output_topic = output_topic

        self.consumer = Consumer(
            {
                "bootstrap.servers": "localhost:9092",
                "group.id": "stateless-processor",
                "auto.offset.reset": "earliest",
            }
        )

        self.producer = Producer({"bootstrap.servers": "localhost:9092"})

    def filter_event(self, event):
        """Filter: Only process events from premium users"""
        return event.get("user_tier") == "premium"

    def transform_event(self, event):
        """Map: Enrich event with additional fields"""
        return {
            **event,
            "processed_at": datetime.now().isoformat(),
            "is_premium": True,
            "priority": "high",
        }

    def process(self, duration_seconds=10):
        """Run the stream processor"""
        self.consumer.subscribe([self.input_topic])

        processed_count = 0
        filtered_count = 0
        start_time = time.time()

        print(f"[OK] Starting stateless processor...\n")

        try:
            while time.time() - start_time < duration_seconds:
                msg = self.consumer.poll(timeout=1.0)

                if msg is None:
                    continue

                if msg.error():
                    continue

                # Deserialize
                event = json.loads(msg.value().decode("utf-8"))

                # Filter
                if not self.filter_event(event):
                    filtered_count += 1
                    continue

                # Transform
                transformed = self.transform_event(event)

                # Produce to output topic
                self.producer.produce(
                    topic=self.output_topic, key=event.get("user_id"), value=json.dumps(transformed)
                )
                self.producer.poll(0)

                processed_count += 1

                if processed_count <= 5:
                    print(f"[{processed_count}] Processed: {event['user_id']} - {event['action']}")

        finally:
            self.producer.flush()
            self.consumer.close()

            print(f"\n[DATA] Processor Summary:")
            print(f"  Processed: {processed_count}")
            print(f"  Filtered out: {filtered_count}")
            print(f"  Total events: {processed_count + filtered_count}")


print("[OK] StatelessProcessor class defined")

In [None]:
# Generate sample click events
def generate_click_events(num_events=50):
    """Generate sample click stream events"""
    producer = Producer({"bootstrap.servers": "localhost:9092"})

    actions = ["view_page", "click_button", "add_to_cart", "checkout"]
    tiers = ["free", "premium", "free", "premium", "free"]  # Mix of tiers

    for i in range(num_events):
        event = {
            "event_id": f"evt_{i}",
            "user_id": f"user_{random.randint(1, 10)}",
            "user_tier": random.choice(tiers),
            "action": random.choice(actions),
            "timestamp": datetime.now().isoformat(),
            "page": f"/page{random.randint(1, 5)}",
        }

        producer.produce(topic="click-stream", value=json.dumps(event))
        producer.poll(0)
        time.sleep(0.05)

    producer.flush()
    print(f"[OK] Generated {num_events} click events")


# Generate events in background
generator_thread = threading.Thread(target=generate_click_events, args=(50,))
generator_thread.start()

# Wait a moment for events to start flowing
time.sleep(1)

# Run stateless processor
processor = StatelessProcessor("click-stream", "processed-clicks")
processor.process(duration_seconds=8)

generator_thread.join()
print("\n[SUCCESS] Stateless processing complete!")

---

## 3. Stateful Stream Processing

### State Management

**Stateful operations need to remember previous events:**
```
Example: Count clicks per user

Event 1: {user: 'Alice', action: 'click'}
State: {'Alice': 1}

Event 2: {user: 'Bob', action: 'click'}
State: {'Alice': 1, 'Bob': 1}

Event 3: {user: 'Alice', action: 'click'}
State: {'Alice': 2, 'Bob': 1}  ← Updated state!
```

### Types of Stateful Operations

**1. Aggregations:**
- Count: How many events?
- Sum: Total of values
- Average: Mean of values
- Min/Max: Extremes

**2. Joins:**
- Stream-Stream: Join two event streams
- Stream-Table: Enrich stream with reference data

**3. Pattern Detection:**
- Sequence detection: A followed by B within time window
- Anomaly detection: Values outside normal range

### State Storage

```
Stream Processor
┌────────────────────────────────┐
│  Processing Logic              │
│  ┌──────────────┐              │
│  │ Local State  │              │
│  │ (in-memory)  │              │
│  └──────┬───────┘              │
│         │                      │
│         ↓                      │
│  ┌──────────────┐              │
│  │ State Store  │              │
│  │ (persistent) │              │
│  └──────────────┘              │
└────────────────────────────────┘

Benefits:
- Fast access (in-memory)
- Fault tolerant (persisted)
- Recoverable (from checkpoints)
```

In [None]:
# Example: Stateful stream processor - Aggregations
class StatefulAggregator:
    """Process events with state (counting, summing, etc.)"""

    def __init__(self, input_topic):
        self.input_topic = input_topic

        # State: counts per user
        self.user_counts = defaultdict(int)

        # State: total spend per user
        self.user_spend = defaultdict(float)

        # State: actions per user
        self.user_actions = defaultdict(list)

        self.consumer = Consumer(
            {
                "bootstrap.servers": "localhost:9092",
                "group.id": "stateful-aggregator",
                "auto.offset.reset": "earliest",
            }
        )

    def update_state(self, event):
        """Update internal state with new event"""
        user_id = event["user_id"]

        # Increment count
        self.user_counts[user_id] += 1

        # Track spend
        if "amount" in event:
            self.user_spend[user_id] += event["amount"]

        # Track actions
        self.user_actions[user_id].append(event.get("action", "unknown"))

    def get_user_summary(self, user_id):
        """Get aggregated state for a user"""
        return {
            "user_id": user_id,
            "event_count": self.user_counts[user_id],
            "total_spend": self.user_spend[user_id],
            "actions": self.user_actions[user_id][-5:],  # Last 5 actions
        }

    def process(self, duration_seconds=10):
        """Run the stateful processor"""
        self.consumer.subscribe([self.input_topic])

        events_processed = 0
        start_time = time.time()

        print(f"[OK] Starting stateful aggregator...\n")

        try:
            while time.time() - start_time < duration_seconds:
                msg = self.consumer.poll(timeout=1.0)

                if msg is None:
                    continue

                if msg.error():
                    continue

                event = json.loads(msg.value().decode("utf-8"))

                # Update state
                self.update_state(event)
                events_processed += 1

                # Show progress
                if events_processed % 10 == 0:
                    print(
                        f"[{events_processed}] Processed events, tracking {len(self.user_counts)} users"
                    )

        finally:
            self.consumer.close()

            # Show final state
            print(f"\n[DATA] Final Aggregation Results:\n")

            # Top 5 users by event count
            top_users = sorted(self.user_counts.items(), key=lambda x: x[1], reverse=True)[:5]

            for user_id, count in top_users:
                summary = self.get_user_summary(user_id)
                print(f"{user_id}:")
                print(f"  Events: {summary['event_count']}")
                print(f"  Spend: ${summary['total_spend']:.2f}")
                print(f"  Recent actions: {summary['actions']}")
                print()


print("[OK] StatefulAggregator class defined")

In [None]:
# Generate transaction events
def generate_transactions(num_events=100):
    """Generate sample transaction events"""
    producer = Producer({"bootstrap.servers": "localhost:9092"})

    actions = ["view", "add_to_cart", "purchase", "return"]

    for i in range(num_events):
        action = random.choice(actions)
        event = {
            "event_id": f"txn_{i}",
            "user_id": f"user_{random.randint(1, 5)}",  # 5 users
            "action": action,
            "timestamp": datetime.now().isoformat(),
        }

        # Add amount for purchases
        if action == "purchase":
            event["amount"] = random.randint(10, 200)

        producer.produce(topic="transactions", value=json.dumps(event))
        producer.poll(0)
        time.sleep(0.03)

    producer.flush()
    print(f"[OK] Generated {num_events} transaction events")


# Generate and process
generator_thread = threading.Thread(target=generate_transactions, args=(100,))
generator_thread.start()

time.sleep(1)

aggregator = StatefulAggregator("transactions")
aggregator.process(duration_seconds=8)

generator_thread.join()
print("[SUCCESS] Stateful aggregation complete!")

---

## 4. Windowing

### Why Windowing?

**Problem**: Infinite streams need bounded computations
```
Question: "Count events per hour"
Stream: [e1][e2][e3]... (infinite)

Solution: Divide stream into windows!
```

### Types of Windows

**1. Tumbling Windows (Fixed, Non-overlapping):**
```
Window size: 1 hour

00:00 ────────────── 01:00 ────────────── 02:00
  [  Window 1    ]     [  Window 2    ]
  Count: 10 events     Count: 15 events

Properties:
- Fixed size
- No overlap
- Each event in exactly ONE window
```

**2. Sliding Windows (Overlapping):**
```
Window size: 1 hour, Slide: 15 minutes

00:00 ──────── 00:15 ──────── 00:30 ──────── 00:45 ──────── 01:00
  [    W1     ]
       [    W2     ]
            [    W3     ]
                 [    W4     ]

Properties:
- Fixed size
- Windows overlap
- Each event in MULTIPLE windows
- Good for: Moving averages, trend detection
```

**3. Session Windows (Gap-based):**
```
Inactivity gap: 5 minutes

Events: [e1]─2min─[e2]─1min─[e3]───7min───[e4]─3min─[e5]
        └──────── Session 1 ──────┘       └── Session 2 ──┘

Properties:
- Dynamic size
- Ends after inactivity gap
- Good for: User sessions, activity bursts
```

In [None]:
# Implement Tumbling Window Aggregator
class TumblingWindowAggregator:
    """Aggregate events in fixed-size, non-overlapping windows"""

    def __init__(self, window_size_seconds=10):
        self.window_size = window_size_seconds
        self.windows = {}  # window_id -> events
        self.current_window_start = None

    def get_window_id(self, timestamp):
        """Determine which window this event belongs to"""
        event_time = datetime.fromisoformat(timestamp)
        epoch = int(event_time.timestamp())
        window_id = (epoch // self.window_size) * self.window_size
        return window_id

    def add_event(self, event):
        """Add event to appropriate window"""
        window_id = self.get_window_id(event["timestamp"])

        if window_id not in self.windows:
            self.windows[window_id] = []

        self.windows[window_id].append(event)

    def get_window_results(self, window_id):
        """Get aggregated results for a window"""
        events = self.windows.get(window_id, [])

        if not events:
            return None

        # Aggregate
        action_counts = defaultdict(int)
        for event in events:
            action_counts[event.get("action", "unknown")] += 1

        window_start = datetime.fromtimestamp(window_id)
        window_end = window_start + timedelta(seconds=self.window_size)

        return {
            "window_start": window_start.isoformat(),
            "window_end": window_end.isoformat(),
            "event_count": len(events),
            "action_counts": dict(action_counts),
        }

    def print_all_windows(self):
        """Display results for all windows"""
        print("\n[DATA] Tumbling Window Results:\n")

        for window_id in sorted(self.windows.keys()):
            result = self.get_window_results(window_id)
            if result:
                print(f"Window {window_id}:")
                print(f"  Time: {result['window_start'][:19]} to {result['window_end'][11:19]}")
                print(f"  Events: {result['event_count']}")
                print(f"  Actions: {result['action_counts']}")
                print()


print("[OK] TumblingWindowAggregator class defined")

In [None]:
# Test tumbling windows
def generate_sensor_data(num_events=60, window_aggregator=None):
    """Generate sensor events over time"""
    actions = ["temp_reading", "humidity_reading", "motion_detected"]

    for i in range(num_events):
        event = {
            "event_id": f"sensor_{i}",
            "sensor_id": f"sensor_{random.randint(1, 3)}",
            "action": random.choice(actions),
            "value": random.randint(20, 30),
            "timestamp": datetime.now().isoformat(),
        }

        if window_aggregator:
            window_aggregator.add_event(event)

        time.sleep(0.1)  # 100ms between events

    print(f"[OK] Generated {num_events} sensor events")


# Create aggregator with 10-second windows
window_agg = TumblingWindowAggregator(window_size_seconds=10)

print("[OK] Generating events with 10-second tumbling windows...\n")
generate_sensor_data(num_events=60, window_aggregator=window_agg)

# Show results
window_agg.print_all_windows()

print("[SUCCESS] Tumbling window aggregation complete!")
print("\n[OK] Notice: Events are grouped into non-overlapping 10-second windows")

---

## 5. Event Time vs Processing Time

### The Two Notions of Time

**Event Time**: When the event actually occurred
```
Mobile phone disconnected: 10:00:00 AM
Event created: 10:00:00 AM ← Event Time
```

**Processing Time**: When the system processes the event
```
Phone reconnects: 10:30:00 AM
Event reaches Kafka: 10:30:01 AM
Processed: 10:30:02 AM ← Processing Time
```

### Why Event Time Matters

**Problem with Processing Time:**
```
Events:     A(10:00) → B(10:01) → C(10:02)
            Network delay...
Arrive:     C(10:05) → A(10:07) → B(10:08)
            ↑ Out of order!

Processing Time Windows (10:05-10:10):
  Would count: C, A, B ← Wrong grouping!

Event Time Windows (10:00-10:05):
  Would count: A, B, C ← Correct grouping!
```

### Watermarks

**Definition**: A watermark is an assertion that no events with timestamp < T will arrive

```
Events arrive:    E1(10:00) E2(10:01) E3(10:03)
Watermark:        ────────────────────────────→ 10:02
                  "No events before 10:02 will arrive"

Late event:       E4(10:01) ← Timestamp before watermark!
                  Options:
                  1. Drop (ignore)
                  2. Accept (adjust results)
                  3. Side output (special handling)
```

### Handling Late Data

**Strategy 1: Allowed Lateness**
```
Window: 10:00-10:10
Watermark at 10:12 (2min delay)
Allowed lateness: 5 minutes

Accept events until: 10:12 + 5min = 10:17
```

**Strategy 2: Side Outputs**
```
On-time events → Main output
Late events → Side output (for investigation)
```

In [None]:
# Simulate late-arriving events
class EventTimeProcessor:
    """Process events using event time with watermarks"""

    def __init__(self, window_size_seconds=10, allowed_lateness_seconds=5):
        self.window_size = window_size_seconds
        self.allowed_lateness = allowed_lateness_seconds
        self.windows = {}
        self.watermark = None
        self.late_events = []

    def update_watermark(self, event_time):
        """Update watermark (event_time - 2 seconds)"""
        event_dt = datetime.fromisoformat(event_time)
        new_watermark = event_dt - timedelta(seconds=2)

        if self.watermark is None or new_watermark > self.watermark:
            self.watermark = new_watermark
            return True
        return False

    def get_window_id(self, timestamp):
        """Get window ID from event timestamp"""
        event_time = datetime.fromisoformat(timestamp)
        epoch = int(event_time.timestamp())
        return (epoch // self.window_size) * self.window_size

    def is_late(self, event_time):
        """Check if event is late"""
        if self.watermark is None:
            return False

        event_dt = datetime.fromisoformat(event_time)
        return event_dt < self.watermark

    def process_event(self, event):
        """Process event using event time"""
        event_time = event["timestamp"]

        # Update watermark
        self.update_watermark(event_time)

        # Check if late
        if self.is_late(event_time):
            event_dt = datetime.fromisoformat(event_time)
            lateness = (self.watermark - event_dt).total_seconds()

            if lateness <= self.allowed_lateness:
                # Accept late event
                window_id = self.get_window_id(event_time)
                if window_id not in self.windows:
                    self.windows[window_id] = []
                self.windows[window_id].append(event)
                return "late_accepted"
            else:
                # Too late, drop
                self.late_events.append(event)
                return "too_late"
        else:
            # On-time event
            window_id = self.get_window_id(event_time)
            if window_id not in self.windows:
                self.windows[window_id] = []
            self.windows[window_id].append(event)
            return "on_time"

    def print_stats(self):
        """Print processing statistics"""
        print("\n[DATA] Event Time Processing Results:\n")
        print(f"Watermark: {self.watermark.isoformat() if self.watermark else 'None'}")
        print(f"Windows processed: {len(self.windows)}")
        print(f"Late events dropped: {len(self.late_events)}")

        total_events = sum(len(events) for events in self.windows.values())
        print(f"Total events in windows: {total_events}")


print("[OK] EventTimeProcessor class defined")

In [None]:
# Simulate events with realistic delays
def generate_events_with_delays():
    """Generate events with some arriving late"""
    processor = EventTimeProcessor(window_size_seconds=10, allowed_lateness_seconds=3)

    base_time = datetime.now()
    events = []

    # Create events with timestamps
    for i in range(20):
        event_time = base_time + timedelta(seconds=i)
        events.append(
            {"id": i, "timestamp": event_time.isoformat(), "value": random.randint(1, 100)}
        )

    # Shuffle to simulate out-of-order arrival
    random.shuffle(events)

    # Process events
    on_time = 0
    late_accepted = 0
    too_late = 0

    print("[OK] Processing events with out-of-order arrival...\n")

    for i, event in enumerate(events):
        result = processor.process_event(event)

        if result == "on_time":
            on_time += 1
        elif result == "late_accepted":
            late_accepted += 1
            print(f"[{i+1}] Event {event['id']} arrived LATE but ACCEPTED")
        elif result == "too_late":
            too_late += 1
            print(f"[{i+1}] Event {event['id']} arrived TOO LATE, DROPPED")

        time.sleep(0.05)

    processor.print_stats()

    print(f"\n[DATA] Event Classification:")
    print(f"  On-time: {on_time}")
    print(f"  Late (accepted): {late_accepted}")
    print(f"  Too late (dropped): {too_late}")


generate_events_with_delays()

print("\n[SUCCESS] Event time processing complete!")
print("\n[OK] Key insight: Event time ensures correct results despite out-of-order arrival")

---

## 6. Mini-Project: Real-Time Analytics Dashboard

Let's build a complete stream processor that:
- Filters events
- Aggregates in tumbling windows
- Handles late data
- Produces real-time metrics

In [None]:
# Complete real-time analytics processor
class RealTimeAnalytics:
    """
    Real-time analytics processor combining:
    - Filtering
    - Tumbling windows
    - Event time processing
    - Stateful aggregations
    """

    def __init__(self, window_size_seconds=10):
        self.window_size = window_size_seconds
        self.windows = {}
        self.stats = {"total_events": 0, "filtered_events": 0, "windows_completed": 0}

    def should_process(self, event):
        """Filter: Only process high-value events"""
        return event.get("value", 0) > 50

    def get_window_id(self, timestamp):
        """Assign event to window"""
        event_time = datetime.fromisoformat(timestamp)
        epoch = int(event_time.timestamp())
        return (epoch // self.window_size) * self.window_size

    def process_event(self, event):
        """Process incoming event"""
        self.stats["total_events"] += 1

        # Filter
        if not self.should_process(event):
            self.stats["filtered_events"] += 1
            return False

        # Assign to window
        window_id = self.get_window_id(event["timestamp"])

        if window_id not in self.windows:
            self.windows[window_id] = {
                "events": [],
                "count": 0,
                "sum": 0,
                "max": float("-inf"),
                "min": float("inf"),
            }

        # Update window state
        window = self.windows[window_id]
        window["events"].append(event)
        window["count"] += 1
        window["sum"] += event.get("value", 0)
        window["max"] = max(window["max"], event.get("value", 0))
        window["min"] = min(window["min"], event.get("value", 0))

        return True

    def get_window_metrics(self, window_id):
        """Calculate metrics for a window"""
        if window_id not in self.windows:
            return None

        window = self.windows[window_id]

        if window["count"] == 0:
            return None

        window_start = datetime.fromtimestamp(window_id)
        window_end = window_start + timedelta(seconds=self.window_size)

        return {
            "window_start": window_start.isoformat(),
            "window_end": window_end.isoformat(),
            "count": window["count"],
            "sum": window["sum"],
            "avg": window["sum"] / window["count"],
            "max": window["max"],
            "min": window["min"],
        }

    def print_dashboard(self):
        """Print real-time dashboard"""
        print("\n" + "=" * 60)
        print("           REAL-TIME ANALYTICS DASHBOARD")
        print("=" * 60)

        print(f"\nOverall Statistics:")
        print(f"  Total events received: {self.stats['total_events']}")
        print(f"  Events filtered out: {self.stats['filtered_events']}")
        print(f"  Events processed: {self.stats['total_events'] - self.stats['filtered_events']}")
        print(f"  Active windows: {len(self.windows)}")

        print(f"\nWindow Metrics:")
        print(f"{'Window':<20} {'Count':<8} {'Avg':<10} {'Max':<8} {'Min':<8}")
        print("-" * 60)

        for window_id in sorted(self.windows.keys()):
            metrics = self.get_window_metrics(window_id)
            if metrics:
                window_str = metrics["window_start"][11:19]
                print(
                    f"{window_str:<20} {metrics['count']:<8} {metrics['avg']:<10.2f} "
                    f"{metrics['max']:<8} {metrics['min']:<8}"
                )

        print("=" * 60)


print("[OK] RealTimeAnalytics class defined")

In [None]:
# Run the analytics processor
def generate_analytics_events(analytics, num_events=100):
    """Generate events for analytics"""
    base_time = datetime.now()

    for i in range(num_events):
        event = {
            "id": i,
            "timestamp": (base_time + timedelta(seconds=i * 0.5)).isoformat(),
            "value": random.randint(1, 100),
            "sensor_id": f"sensor_{random.randint(1, 5)}",
        }

        analytics.process_event(event)

        # Print dashboard every 20 events
        if (i + 1) % 20 == 0:
            analytics.print_dashboard()
            time.sleep(0.5)


# Create and run analytics
analytics = RealTimeAnalytics(window_size_seconds=10)

print("[OK] Starting real-time analytics...\n")
generate_analytics_events(analytics, num_events=100)

# Final dashboard
analytics.print_dashboard()

print("\n[SUCCESS] Real-time analytics complete!")

---

## 7. Key Takeaways

[OK] **Stream Processing**: Process data continuously as it arrives

[OK] **Stateless vs Stateful**: Stateless operations are independent; stateful maintain state

[OK] **Windowing**: Divide infinite streams into bounded computations

[OK] **Window Types**: Tumbling (non-overlapping), Sliding (overlapping), Session (gap-based)

[OK] **Event Time**: Use event timestamps for correctness

[OK] **Watermarks**: Handle late data gracefully

### Design Patterns

**1. Filter-Map-Reduce:**
```
Stream → Filter → Map → Reduce → Output
```

**2. Windowed Aggregation:**
```
Stream → Assign to Windows → Aggregate → Output
```

**3. Event Time Processing:**
```
Stream → Extract Event Time → Window → Watermark → Output
```

### Production Considerations

1. **Choose appropriate window size** based on latency requirements
2. **Handle late data** with allowed lateness
3. **Use event time** for correctness
4. **Monitor watermark lag** to detect issues
5. **Checkpoint state** for fault tolerance
6. **Scale with partitions** for high throughput

---

## 8. Practice Exercises

1. **Implement sliding window** aggregation (overlapping windows)
2. **Create session window** processor (gap-based windows)
3. **Build anomaly detector** that alerts when values exceed threshold
4. **Implement join** of two streams (correlation)
5. **Add watermark visualization** to see progression over time

In [None]:
# Your practice code here

---

## 9. Next Steps

Congratulations on completing Module 03!

### What You've Learned

- [OK] Stream processing fundamentals
- [OK] Stateless and stateful operations
- [OK] Windowing techniques
- [OK] Event time vs processing time
- [OK] Late data handling with watermarks

### Coming Up in Module 04: Apache Flink Basics

You'll learn:
- Apache Flink architecture
- DataStream API
- Flink operators and transformations
- Connectors (Kafka source/sink)
- Running Flink jobs

### Resources

- [Stream Processing Concepts](https://www.oreilly.com/library/view/streaming-systems/9781491983867/)
- [Windowing in Stream Processing](https://www.confluent.io/blog/windowing-in-kafka-streams/)
- [Event Time and Watermarks](https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/time/)

---

**Ready for Apache Flink?** Open `04_apache_flink_basics.ipynb` to continue!