# Module 00: Setup and Introduction to Streaming

**Estimated Time:** 30 minutes

## Learning Objectives

By the end of this module, you will:
- Verify your streaming environment is set up correctly
- Understand the difference between batch and stream processing
- Learn core streaming concepts (events, topics, consumers)
- Connect to Kafka and Flink clusters
- Run your first streaming "Hello World" example

---

## 1. Environment Verification

Let's verify that all required components are installed and running.

In [None]:
# Check Python version
import sys

print(f"Python version: {sys.version}")
print(f"Python version info: {sys.version_info}")

if sys.version_info < (3, 8):
    print("\n[WARNING] Python 3.8 or higher is recommended")
else:
    print("\n[OK] Python version is compatible")

In [None]:
# Verify core streaming libraries
try:
    from confluent_kafka import Producer, Consumer, KafkaException
    from confluent_kafka.admin import AdminClient

    print("[OK] confluent-kafka library installed")
except ImportError as e:
    print(f"[FAIL] confluent-kafka not installed: {e}")
    print("\nPlease run: pip install confluent-kafka")

In [None]:
# Check if Docker is running
import subprocess

try:
    result = subprocess.run(["docker", "ps"], capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print("[OK] Docker is running")
        print(f"\nRunning containers:")
        print(result.stdout[:500])  # Show first 500 chars
    else:
        print("[FAIL] Docker command failed")
        print("Please start Docker Desktop")
except Exception as e:
    print(f"[FAIL] Cannot connect to Docker: {e}")
    print("\nPlease ensure Docker Desktop is installed and running")

In [None]:
# Test Kafka connection
from confluent_kafka.admin import AdminClient

kafka_config = {"bootstrap.servers": "localhost:9092"}

try:
    admin_client = AdminClient(kafka_config)
    # Test connection by listing topics
    metadata = admin_client.list_topics(timeout=5)
    print("[OK] Successfully connected to Kafka")
    print(f"\nKafka cluster has {len(metadata.topics)} topics")
    if metadata.topics:
        print(f"Topics: {list(metadata.topics.keys())[:5]}")
except Exception as e:
    print(f"[FAIL] Cannot connect to Kafka: {e}")
    print("\nPlease ensure Kafka is running:")
    print("  docker-compose up -d")

---

## 2. Batch vs Stream Processing

### What's the Difference?

**Batch Processing:**
- Processes data in **fixed-size chunks**
- Runs on a **schedule** (hourly, daily, weekly)
- Higher **latency** (minutes to hours)
- Examples: Daily reports, monthly analytics, ETL jobs

**Stream Processing:**
- Processes data **continuously** as it arrives
- Runs **24/7** in real-time
- Lower **latency** (milliseconds to seconds)
- Examples: Fraud detection, real-time dashboards, monitoring

### Visual Comparison

```
BATCH PROCESSING:
Events → [Buffer] → Process Every Hour → Results
   |        |           ↓
  1hr      1hr        Latency: 1 hour

STREAM PROCESSING:
Events → Process Immediately → Results
   |            ↓
  Continuous   Latency: milliseconds
```

### When to Use Each?

| Use Case | Batch | Stream |
|----------|-------|--------|
| Fraud detection | ❌ | [OK] |
| Monthly reports | [OK] | ❌ |
| Real-time recommendations | ❌ | [OK] |
| Historical analysis | [OK] | ❌ |
| Live dashboards | ❌ | [OK] |
| Data warehousing | [OK] | ❌ |
| IoT monitoring | ❌ | [OK] |

**Key Insight**: Many modern systems use **both** - stream for real-time, batch for historical analysis.

---

## 3. Core Streaming Concepts

### Events
An **event** is something that happened at a point in time.
- User clicked a button
- Sensor recorded a temperature
- Payment was processed

### Topics
A **topic** is a category/feed of events.
- "user-clicks" topic
- "temperature-readings" topic
- "payments" topic

### Producers
**Producers** write events to topics.
- Web application sends user clicks
- IoT device sends sensor data
- Payment service sends transactions

### Consumers
**Consumers** read events from topics.
- Analytics service reads clicks
- Monitoring service reads sensor data
- Fraud detection reads payments

### Architecture

```
Producers                 Kafka Cluster              Consumers
   |                           |                         |
[Web App] ──┐                  |                    ┌──[Analytics]
            ├─→ Topic: clicks ─┤                    |
[Mobile]  ──┘                  |                    ├──[Dashboard]
                                |                    |
[Sensors] ───→ Topic: sensors ─┤                    └──[Alerts]
                                |
[PaymentAPI]→ Topic: payments─┤
```

---

## 4. Your First Streaming Example

Let's create a simple producer-consumer example!

In [None]:
import json
from datetime import datetime
from confluent_kafka import Producer, Consumer, KafkaException
import time

# Configuration
TOPIC_NAME = "hello-streaming"

producer_config = {"bootstrap.servers": "localhost:9092", "client.id": "hello-producer"}

consumer_config = {
    "bootstrap.servers": "localhost:9092",
    "group.id": "hello-consumer-group",
    "auto.offset.reset": "earliest",
    "enable.auto.commit": True,
}

print("[OK] Configuration ready")

In [None]:
# Step 1: Create a Producer and send events
def delivery_callback(err, msg):
    """Callback for message delivery confirmation"""
    if err:
        print(f"[FAIL] Message delivery failed: {err}")
    else:
        print(
            f"[OK] Message delivered to {msg.topic()} [{msg.partition()}] at offset {msg.offset()}"
        )


# Create producer
producer = Producer(producer_config)

# Send 5 events
print("Sending events to Kafka...\n")
for i in range(5):
    event = {
        "event_id": i,
        "message": f"Hello from event {i}",
        "timestamp": datetime.now().isoformat(),
    }

    # Convert to JSON string
    event_json = json.dumps(event)

    # Produce event
    producer.produce(topic=TOPIC_NAME, key=str(i), value=event_json, callback=delivery_callback)

    # Trigger delivery callbacks
    producer.poll(0)

    time.sleep(0.5)  # Small delay

# Wait for all messages to be delivered
producer.flush()

print("\n[SUCCESS] All events sent to Kafka!")

In [None]:
# Step 2: Create a Consumer and read events
print("Reading events from Kafka...\n")

consumer = Consumer(consumer_config)
consumer.subscribe([TOPIC_NAME])

messages_read = 0
max_messages = 5

try:
    while messages_read < max_messages:
        # Poll for messages (timeout in seconds)
        msg = consumer.poll(timeout=2.0)

        if msg is None:
            print("[WARNING] No message received, waiting...")
            continue

        if msg.error():
            print(f"[FAIL] Consumer error: {msg.error()}")
            continue

        # Successfully received a message
        key = msg.key().decode("utf-8") if msg.key() else None
        value = msg.value().decode("utf-8")
        event = json.loads(value)

        print(f"[OK] Received: {event['message']} (ID: {event['event_id']})")
        print(f"     Key: {key}, Partition: {msg.partition()}, Offset: {msg.offset()}")

        messages_read += 1

finally:
    consumer.close()
    print(f"\n[SUCCESS] Read {messages_read} events from Kafka!")

### What Just Happened?

1. **Producer** created 5 events and sent them to Kafka topic `hello-streaming`
2. Events were **stored** in Kafka (durable, replicated)
3. **Consumer** read the events from the topic
4. Each event has:
   - **Key**: Used for partitioning
   - **Value**: The actual event data (JSON)
   - **Partition**: Which partition stored it
   - **Offset**: Position in the partition

This is the foundation of **event streaming**!

---

## 5. Event Time vs Processing Time

In streaming, there are different concepts of "time":

### Event Time
**When the event actually happened** in the real world.
- User clicked at 10:00:00 AM
- Sensor reading at 10:00:05 AM

### Processing Time
**When the system processes the event**.
- Event arrives at Kafka at 10:00:10 AM
- Consumer processes it at 10:00:15 AM

### Why Does This Matter?

```
Event Happens → Network Delay → Kafka → Processing Delay → Result
  10:00:00         (5 sec)      10:00:05     (3 sec)      10:00:08
     ↑                                                        ↑
  Event Time                                           Processing Time
```

**Challenge**: Events can arrive **out of order** or **late**!
- Event A happens at 10:00:00 but arrives at 10:00:10
- Event B happens at 10:00:05 but arrives at 10:00:08

**Solution**: Use **watermarks** and **event time processing** (covered in Module 05)

In [None]:
# Example: Event time vs Processing time
import time
from datetime import datetime


def simulate_event_delay():
    """
    Simulate events with delays
    """
    events = []

    # Create events
    for i in range(3):
        event_time = datetime.now()

        # Simulate network delay
        time.sleep(0.5)

        processing_time = datetime.now()

        delay_ms = (processing_time - event_time).total_seconds() * 1000

        events.append(
            {
                "event_id": i,
                "event_time": event_time.isoformat(),
                "processing_time": processing_time.isoformat(),
                "delay_ms": round(delay_ms, 2),
            }
        )

        print(f"Event {i}: Delay = {delay_ms:.2f} ms")

    return events


events = simulate_event_delay()
print(f"\n[DATA] Average delay: {sum(e['delay_ms'] for e in events) / len(events):.2f} ms")

---

## 6. Accessing Web UIs

Several web interfaces are available for monitoring:

### Kafka UI
- URL: http://localhost:8080
- View topics, messages, consumer groups
- Browse event data

### Flink Dashboard
- URL: http://localhost:8082
- Monitor running jobs
- View metrics and checkpoints

### Schema Registry
- URL: http://localhost:8081
- Manage Avro schemas
- API endpoint for schemas

Open these URLs in your browser to explore!

---

## 7. Key Takeaways

[OK] **Environment Setup**: Kafka and Flink are running locally via Docker

[OK] **Batch vs Stream**: Stream processing offers low latency, continuous processing

[OK] **Core Concepts**: Events, topics, producers, consumers

[OK] **First Example**: Successfully produced and consumed events

[OK] **Time Semantics**: Event time vs processing time matters for correctness

### Important Points

1. **Events are immutable** - Once written, they don't change
2. **Topics are logs** - Events are appended, not updated
3. **Multiple consumers** can read the same topic independently
4. **Kafka persists** events (configurable retention)
5. **Order is guaranteed** within a partition

---

## 8. Practice Exercise

Try modifying the producer-consumer example:

1. **Change the event structure**: Add more fields (user_id, action, etc.)
2. **Send more events**: Increase from 5 to 20
3. **Add event time**: Include timestamp in the event payload
4. **Multiple consumers**: Run the consumer cell twice (different group IDs)

Use the cell below to experiment:

In [None]:
# Your code here: Experiment with producer/consumer

---

## 9. Next Steps

Congratulations on completing Module 00!

### Ready to Continue?

In **Module 01: Introduction to Event Streaming**, you'll learn:
- Event-driven architecture principles
- Kafka architecture deep dive
- Partitions, replication, and fault tolerance
- Consumer groups and coordination
- Build your first mini-project

### Before Moving On

Make sure you:
- [OK] Have Docker containers running
- [OK] Successfully connected to Kafka
- [OK] Produced and consumed events
- [OK] Can access web UIs (Kafka UI, Flink Dashboard)

### Resources

- [Kafka Documentation](https://kafka.apache.org/documentation/)
- [Confluent Python Client](https://docs.confluent.io/kafka-clients/python/current/overview.html)
- [Event-Driven Architecture](https://martinfowler.com/articles/201701-event-driven.html)

---

**Ready?** Open `01_introduction_to_event_streaming.ipynb` to continue your streaming journey!