# Module 04: Apache Flink Basics

**Estimated Time:** 90 minutes

## Learning Objectives

By the end of this module, you will:
- Understand Apache Flink architecture and components
- Work with the PyFlink DataStream API
- Implement stream transformations and operators
- Connect Flink to Kafka (source and sink)
- Build and run your first Flink streaming job
- Monitor Flink jobs via the web dashboard

---

## 1. What is Apache Flink?

### Overview

Apache Flink is a **distributed stream processing framework** for:
- Processing unbounded (streaming) and bounded (batch) data
- Stateful computations with exactly-once semantics
- Event-time processing with watermarks
- Low-latency, high-throughput data processing

### Flink vs Other Systems

| Feature | Flink | Spark Streaming | Kafka Streams |
|---------|-------|-----------------|---------------|
| Processing Model | True streaming | Micro-batch | True streaming |
| Latency | Milliseconds | Seconds | Milliseconds |
| State Management | Advanced | Basic | Good |
| Exactly-Once | Yes | Yes | Yes |
| Deployment | Standalone cluster | Spark cluster | Embedded library |
| Windowing | Advanced | Good | Good |

### Flink Architecture

```
┌─────────────────────────────────────────────────────────┐
│                   Flink Cluster                         │
│                                                         │
│  ┌──────────────────┐        ┌──────────────────┐      │
│  │  JobManager      │        │  TaskManager     │      │
│  │  (Master)        │        │  (Worker)        │      │
│  │                  │        │                  │      │
│  │  - Job scheduling│───────→│  - Execute tasks │      │
│  │  - Checkpointing │        │  - Manage state  │      │
│  │  - Coordination  │        │  - Shuffle data  │      │
│  └──────────────────┘        └──────────────────┘      │
│                                                         │
│                              ┌──────────────────┐      │
│                              │  TaskManager     │      │
│                              │  (Worker)        │      │
│                              └──────────────────┘      │
└─────────────────────────────────────────────────────────┘
         ↑                                        ↓
    Input Sources                            Output Sinks
    (Kafka, files)                          (Kafka, DB)
```

### Key Components

1. **JobManager**: Master node that coordinates job execution
2. **TaskManager**: Worker nodes that execute tasks and store state
3. **Job**: User-defined streaming application
4. **Task**: Unit of execution (parallel operator instance)
5. **Checkpoint**: Snapshot of state for fault tolerance

In [None]:
# Setup: Import PyFlink libraries
try:
    from pyflink.datastream import StreamExecutionEnvironment
    from pyflink.table import StreamTableEnvironment, EnvironmentSettings
    from pyflink.datastream.connectors.kafka import (
        KafkaSource,
        KafkaOffsetsInitializer,
        KafkaSink,
        KafkaRecordSerializationSchema,
    )
    from pyflink.common import Types, WatermarkStrategy, Encoder
    from pyflink.datastream.functions import MapFunction, FilterFunction, FlatMapFunction

    print("[OK] PyFlink libraries loaded")
except ImportError as e:
    print(f"[WARNING] PyFlink not available: {e}")
    print("         Some examples will use simplified implementations")

import json
from datetime import datetime
from confluent_kafka import Producer, Consumer
from confluent_kafka.admin import AdminClient, NewTopic
import time
import random

print("[OK] Libraries ready for Flink examples")

---

## 2. PyFlink DataStream API

### Execution Environment

The execution environment is the entry point for Flink programs:

```python
# Create execution environment
env = StreamExecutionEnvironment.get_execution_environment()

# Configure parallelism
env.set_parallelism(4)

# Enable checkpointing
env.enable_checkpointing(10000)  # Every 10 seconds
```

### DataStream Operations

**Basic Transformation Pattern:**
```
Source → Transformation(s) → Sink

env.from_source(...) \       # Read data
   .map(...) \                # Transform
   .filter(...) \             # Filter
   .key_by(...) \             # Partition
   .window(...) \             # Window
   .reduce(...) \             # Aggregate
   .sink_to(...)              # Write data
```

### Common Transformations

| Operation | Description | Example |
|-----------|-------------|----------|
| `map()` | 1-to-1 transformation | Convert Celsius to Fahrenheit |
| `filter()` | Select matching elements | Filter premium users |
| `flat_map()` | 1-to-N transformation | Split sentence into words |
| `key_by()` | Partition by key | Group by user_id |
| `reduce()` | Combine elements | Sum, count, max |
| `aggregate()` | Custom aggregation | Moving average |
| `window()` | Group into windows | 1-minute tumbling window |
| `union()` | Merge streams | Combine web + mobile events |

In [None]:
# Simple Flink job structure (conceptual)
print(
    """[DATA] Flink Job Structure:

from pyflink.datastream import StreamExecutionEnvironment

# 1. Create execution environment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(2)

# 2. Define source
stream = env.from_collection([
    {'temperature': 20, 'sensor': 'A'},
    {'temperature': 25, 'sensor': 'B'},
    {'temperature': 30, 'sensor': 'A'}
])

# 3. Apply transformations
result = stream \\
    .map(lambda x: {'temp_f': x['temperature'] * 9/5 + 32, 'sensor': x['sensor']}) \\
    .filter(lambda x: x['temp_f'] > 70)

# 4. Define sink (output)
result.print()

# 5. Execute job
env.execute('Temperature Conversion Job')
"""
)

print("[OK] This shows the basic structure of a Flink streaming job")

---

## 3. Flink Transformations

### Map Transformation

**Concept**: Transform each element independently
```
Input:  [1, 2, 3, 4, 5]
Map(x → x * 2)
Output: [2, 4, 6, 8, 10]
```

### Filter Transformation

**Concept**: Select elements matching a condition
```
Input:  [1, 2, 3, 4, 5, 6]
Filter(x → x % 2 == 0)
Output: [2, 4, 6]
```

### FlatMap Transformation

**Concept**: Transform one element into zero or more elements
```
Input:  ['hello world', 'foo bar']
FlatMap(s → s.split(' '))
Output: ['hello', 'world', 'foo', 'bar']
```

### KeyBy Transformation

**Concept**: Partition stream by key for parallel processing
```
Input:  [{user: 'A', val: 1}, {user: 'B', val: 2}, {user: 'A', val: 3}]
KeyBy('user')
Result: 
  Partition 0: [{user: 'A', val: 1}, {user: 'A', val: 3}]
  Partition 1: [{user: 'B', val: 2}]
```

In [None]:
# Demonstrate transformations with Python (Flink-style)
class FlinkStyleProcessor:
    """Simulate Flink transformations with Python"""

    def __init__(self, data):
        self.data = data

    def map(self, func):
        """Apply function to each element"""
        self.data = [func(x) for x in self.data]
        return self

    def filter(self, func):
        """Keep elements matching condition"""
        self.data = [x for x in self.data if func(x)]
        return self

    def flat_map(self, func):
        """Transform one element into multiple"""
        result = []
        for x in self.data:
            result.extend(func(x))
        self.data = result
        return self

    def key_by(self, key_func):
        """Group by key"""
        from collections import defaultdict

        grouped = defaultdict(list)
        for x in self.data:
            key = key_func(x)
            grouped[key].append(x)
        return grouped

    def collect(self):
        """Get results"""
        return self.data


# Example: Transform temperature data
data = [
    {"temperature": 20, "sensor": "A", "unit": "C"},
    {"temperature": 25, "sensor": "B", "unit": "C"},
    {"temperature": 15, "sensor": "A", "unit": "C"},
    {"temperature": 30, "sensor": "C", "unit": "C"},
]

print("[DATA] Original data:")
for d in data:
    print(f"  {d}")

# Apply transformations
result = (
    FlinkStyleProcessor(data)
    .map(lambda x: {**x, "temperature": x["temperature"] * 9 / 5 + 32, "unit": "F"})
    .filter(lambda x: x["temperature"] > 70)
    .collect()
)

print("\n[DATA] After map (C to F) and filter (> 70F):")
for d in result:
    print(f"  {d}")

# KeyBy example
grouped = FlinkStyleProcessor(result).key_by(lambda x: x["sensor"])

print("\n[DATA] After key_by('sensor'):")
for sensor, readings in grouped.items():
    print(f"  Sensor {sensor}: {len(readings)} readings")

print("\n[OK] Transformations complete!")

---

## 4. Kafka Integration

### Kafka Source

**Reading from Kafka:**
```python
from pyflink.datastream.connectors.kafka import KafkaSource

kafka_source = KafkaSource.builder() \\
    .set_bootstrap_servers('localhost:9092') \\
    .set_topics('input-topic') \\
    .set_group_id('flink-consumer') \\
    .set_starting_offsets(KafkaOffsetsInitializer.earliest()) \\
    .build()

stream = env.from_source(
    kafka_source,
    WatermarkStrategy.no_watermarks(),
    'Kafka Source'
)
```

### Kafka Sink

**Writing to Kafka:**
```python
from pyflink.datastream.connectors.kafka import KafkaSink

kafka_sink = KafkaSink.builder() \\
    .set_bootstrap_servers('localhost:9092') \\
    .set_record_serializer(...) \\
    .build()

stream.sink_to(kafka_sink)
```

### End-to-End Flow

```
Kafka Topic          Flink Job          Kafka Topic
  (Input)                                (Output)
     │                                       ↑
     │                                       │
     └→ KafkaSource → Transform → KafkaSink ┘
                         ↓
                    map, filter,
                    aggregate, etc.
```

In [None]:
# Simulate Kafka → Flink → Kafka pipeline
class FlinkKafkaPipeline:
    """
    Simulates a Flink job that:
    1. Reads from Kafka
    2. Processes events
    3. Writes to Kafka
    """

    def __init__(self, input_topic, output_topic, group_id="flink-processor"):
        self.input_topic = input_topic
        self.output_topic = output_topic

        # Kafka consumer (source)
        self.consumer = Consumer(
            {
                "bootstrap.servers": "localhost:9092",
                "group.id": group_id,
                "auto.offset.reset": "earliest",
            }
        )

        # Kafka producer (sink)
        self.producer = Producer({"bootstrap.servers": "localhost:9092"})

    def process_event(self, event):
        """
        Transform event (Flink processing logic)
        """
        # Example: Enrich event with processing timestamp
        return {**event, "processed_at": datetime.now().isoformat(), "processor": "flink-job-1"}

    def run(self, duration_seconds=10):
        """Run the processing pipeline"""
        self.consumer.subscribe([self.input_topic])

        processed_count = 0
        start_time = time.time()

        print(f"[OK] Flink pipeline started")
        print(f"     Input: {self.input_topic}")
        print(f"     Output: {self.output_topic}\n")

        try:
            while time.time() - start_time < duration_seconds:
                msg = self.consumer.poll(timeout=1.0)

                if msg is None:
                    continue

                if msg.error():
                    continue

                # Read event
                event = json.loads(msg.value().decode("utf-8"))

                # Process (transform)
                processed = self.process_event(event)

                # Write to output topic
                self.producer.produce(
                    topic=self.output_topic, key=event.get("user_id"), value=json.dumps(processed)
                )
                self.producer.poll(0)

                processed_count += 1

                if processed_count <= 5:
                    print(f"[{processed_count}] Processed: {event.get('event_type', 'unknown')}")

        finally:
            self.producer.flush()
            self.consumer.close()

            print(f"\n[SUCCESS] Processed {processed_count} events")


print("[OK] FlinkKafkaPipeline class defined")

In [None]:
# Create topics for Flink pipeline
admin_client = AdminClient({"bootstrap.servers": "localhost:9092"})

topics = [
    NewTopic("flink-input", num_partitions=3, replication_factor=1),
    NewTopic("flink-output", num_partitions=3, replication_factor=1),
]

try:
    futures = admin_client.create_topics(topics)
    for topic, future in futures.items():
        try:
            future.result()
            print(f"[OK] Created topic '{topic}'")
        except Exception as e:
            if "TOPIC_ALREADY_EXISTS" in str(e):
                print(f"[OK] Topic '{topic}' exists")
except Exception as e:
    print(f"[WARNING] {e}")

In [None]:
# Generate sample events and run pipeline
import threading


def generate_flink_input(num_events=50):
    """Generate events for Flink processing"""
    producer = Producer({"bootstrap.servers": "localhost:9092"})

    event_types = ["click", "view", "purchase", "search"]

    for i in range(num_events):
        event = {
            "event_id": f"evt_{i}",
            "user_id": f"user_{random.randint(1, 10)}",
            "event_type": random.choice(event_types),
            "timestamp": datetime.now().isoformat(),
            "value": random.randint(1, 100),
        }

        producer.produce(topic="flink-input", value=json.dumps(event))
        producer.poll(0)
        time.sleep(0.1)

    producer.flush()
    print(f"\n[OK] Generated {num_events} input events")


# Start event generator in background
generator = threading.Thread(target=generate_flink_input, args=(50,))
generator.start()

time.sleep(1)

# Run Flink pipeline
pipeline = FlinkKafkaPipeline("flink-input", "flink-output")
pipeline.run(duration_seconds=8)

generator.join()

print("\n[SUCCESS] Flink pipeline complete!")

---

## 5. Stateful Processing in Flink

### State Types

**1. ValueState**: Stores a single value
```python
# Example: Track last value per key
class MyFunction(RichMapFunction):
    def open(self, context):
        self.state = context.get_state(
            ValueStateDescriptor('last-value', Types.INT())
        )
```

**2. ListState**: Stores a list of values
```python
# Example: Track last N values
self.state = context.get_list_state(
    ListStateDescriptor('history', Types.INT())
)
```

**3. MapState**: Stores key-value pairs
```python
# Example: Count by category
self.state = context.get_map_state(
    MapStateDescriptor('counts', Types.STRING(), Types.INT())
)
```

### Checkpointing

**Purpose**: Save state periodically for fault tolerance
```
Timeline:
  ├── Process events ──┬── Checkpoint 1 (save state)
  ├── Process events ──┬── Checkpoint 2 (save state)
  ├── [FAILURE] ───────┘
  └── Restore from Checkpoint 2 → Resume

Configuration:
env.enable_checkpointing(10000)  # Every 10 seconds
env.get_checkpoint_config().set_min_pause_between_checkpoints(5000)
```

### Exactly-Once Semantics

**How Flink achieves exactly-once:**
```
1. Checkpoint barrier flows through pipeline
2. All operators save state at barrier
3. Kafka offsets committed atomically
4. On failure: restore from last checkpoint

Result: Each event processed exactly once!
```

In [None]:
# Simulate stateful processing
class StatefulCounter:
    """Maintain counts per key (like Flink state)"""

    def __init__(self):
        self.state = {}  # key -> count
        self.checkpoints = []  # Saved state snapshots

    def process(self, key, value):
        """Update state for key"""
        if key not in self.state:
            self.state[key] = 0
        self.state[key] += value
        return self.state[key]

    def checkpoint(self):
        """Save current state (checkpoint)"""
        snapshot = self.state.copy()
        self.checkpoints.append(snapshot)
        return len(self.checkpoints) - 1

    def restore(self, checkpoint_id):
        """Restore from checkpoint"""
        if checkpoint_id < len(self.checkpoints):
            self.state = self.checkpoints[checkpoint_id].copy()
            return True
        return False

    def get_state(self):
        """Get current state"""
        return self.state.copy()


# Example: Process events with checkpointing
counter = StatefulCounter()

events = [
    ("user_1", 10),
    ("user_2", 5),
    ("user_1", 15),  # Checkpoint here
    ("user_3", 20),
    ("user_1", 25),
]

print("[DATA] Processing events with state:\n")

for i, (key, value) in enumerate(events):
    result = counter.process(key, value)
    print(f"Event {i+1}: {key} + {value} = {result}")

    # Checkpoint after event 3
    if i == 2:
        checkpoint_id = counter.checkpoint()
        print(f"  [CHECKPOINT] Saved state snapshot {checkpoint_id}")

print(f"\n[DATA] Final state: {counter.get_state()}")

# Simulate failure and recovery
print("\n[WARNING] Simulating failure...")
print("[OK] Restoring from checkpoint 0...")
counter.restore(0)
print(f"[DATA] Restored state: {counter.get_state()}")

print("\n[OK] This demonstrates how Flink maintains and recovers state")

---

## 6. Flink Web Dashboard

### Accessing the Dashboard

**URL**: http://localhost:8082

**Key Screens:**

1. **Overview**: Cluster status, running jobs
2. **Jobs**: All jobs (running, finished, failed)
3. **Task Managers**: Worker nodes, available slots
4. **Job Details**: 
   - Execution graph
   - Metrics (records processed, backpressure)
   - Checkpoints
   - Exceptions

### Key Metrics

| Metric | Description | Good Value |
|--------|-------------|------------|
| Records In | Events received | Steady |
| Records Out | Events produced | Steady |
| Backpressure | Downstream slow | Low/None |
| Checkpoint Duration | Time to checkpoint | < 1 second |
| State Size | Total state | Manageable |

### Troubleshooting

**Problem: High Backpressure**
```
Cause: Downstream operator is slow
Solution: 
  - Increase parallelism
  - Optimize processing logic
  - Add more TaskManagers
```

**Problem: Checkpoint Failures**
```
Cause: State too large or timeout
Solution:
  - Increase checkpoint timeout
  - Reduce state size
  - Use RocksDB state backend
```

In [None]:
# Instructions for monitoring Flink
print(
    """[DATA] Monitoring Your Flink Job:

1. Access Flink Dashboard:
   URL: http://localhost:8082

2. Check Running Jobs:
   - Go to "Running Jobs" tab
   - Click on your job name
   - View execution graph

3. Monitor Metrics:
   - Records In/Out: Event throughput
   - Backpressure: Green = good, Red = problem
   - Checkpoint: Should complete successfully

4. View Task Managers:
   - Check available slots
   - Monitor memory usage
   - View task distribution

5. Debug Issues:
   - Check "Exceptions" tab
   - View task logs
   - Analyze checkpoint history
"""
)

print("[OK] Open http://localhost:8082 in your browser to explore!")

---

## 7. Mini-Project: Word Count Stream

Build a classic streaming word count application!

In [None]:
# Word Count Stream Processor
from collections import defaultdict


class WordCountStream:
    """Streaming word count with tumbling windows"""

    def __init__(self, window_size_seconds=10):
        self.window_size = window_size_seconds
        self.windows = {}  # window_id -> word counts

    def get_window_id(self, timestamp):
        """Determine window for event"""
        event_time = datetime.fromisoformat(timestamp)
        epoch = int(event_time.timestamp())
        return (epoch // self.window_size) * self.window_size

    def process_sentence(self, sentence, timestamp):
        """Process a sentence: split into words and count"""
        window_id = self.get_window_id(timestamp)

        if window_id not in self.windows:
            self.windows[window_id] = defaultdict(int)

        # FlatMap: split into words
        words = sentence.lower().split()

        # Map: count each word
        for word in words:
            # Filter: only alphanumeric
            word = "".join(c for c in word if c.isalnum())
            if word:
                self.windows[window_id][word] += 1

    def get_top_words(self, window_id, n=10):
        """Get top N words in a window"""
        if window_id not in self.windows:
            return []

        word_counts = self.windows[window_id]
        return sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:n]

    def print_results(self):
        """Print word count results"""
        print("\n[DATA] Word Count Results:\n")

        for window_id in sorted(self.windows.keys()):
            window_time = datetime.fromtimestamp(window_id)
            top_words = self.get_top_words(window_id, n=5)

            print(f"Window {window_time.strftime('%H:%M:%S')}:")
            for word, count in top_words:
                print(f"  {word:15s}: {count}")
            print()


# Generate sample text stream
sample_sentences = [
    "Apache Flink is a stream processing framework",
    "Flink provides stateful computations over data streams",
    "Stream processing enables real-time analytics",
    "Kafka and Flink work great together",
    "Event-driven architectures use stream processing",
    "Flink supports exactly-once semantics",
    "Real-time data processing is critical",
    "Apache Kafka is a distributed streaming platform",
    "Flink can process millions of events per second",
    "Stream processing frameworks are powerful tools",
]

# Process sentences
word_counter = WordCountStream(window_size_seconds=5)

print("[OK] Processing text stream...\n")

base_time = datetime.now()
for i, sentence in enumerate(sample_sentences):
    timestamp = (base_time + timedelta(seconds=i)).isoformat()
    word_counter.process_sentence(sentence, timestamp)
    print(f"[{i+1}] Processed: {sentence[:50]}...")
    time.sleep(0.2)

# Show results
word_counter.print_results()

print("[SUCCESS] Word count stream processing complete!")

---

## 8. Key Takeaways

[OK] **Flink Architecture**: JobManager coordinates, TaskManagers execute

[OK] **DataStream API**: Fluent API for building streaming pipelines

[OK] **Transformations**: map, filter, flatMap, keyBy, reduce, window

[OK] **Kafka Integration**: Read from and write to Kafka seamlessly

[OK] **Stateful Processing**: Manage state with checkpoints for fault tolerance

[OK] **Exactly-Once**: Flink guarantees each event processed exactly once

### Common Patterns

**1. ETL Pipeline:**
```
Kafka → Flink (Extract, Transform) → Kafka/Database
```

**2. Real-Time Analytics:**
```
Events → Flink (Aggregate in windows) → Metrics Dashboard
```

**3. Event Enrichment:**
```
Stream → Flink (Join with reference data) → Enriched Stream
```

### Best Practices

1. **Use keyed streams** for stateful operations
2. **Enable checkpointing** for production jobs
3. **Monitor backpressure** to detect bottlenecks
4. **Choose parallelism** based on workload
5. **Test with small data** before scaling up
6. **Use event time** for correctness

---

## 9. Practice Exercises

1. **Build a filter pipeline**: Read from Kafka, filter events, write to new topic
2. **Implement aggregation**: Count events per user in 1-minute windows
3. **Create enrichment job**: Add timestamp and metadata to events
4. **Monitor via dashboard**: Deploy a job and observe metrics
5. **Test fault tolerance**: Simulate failure and verify recovery

In [None]:
# Your practice code here

---

## 10. Next Steps

Congratulations on completing Module 04!

### What You've Learned

- [OK] Apache Flink architecture and components
- [OK] PyFlink DataStream API
- [OK] Stream transformations and operators
- [OK] Kafka source and sink connectors
- [OK] Stateful processing and checkpointing

### Coming Up in Module 05: Advanced Stream Processing

You'll learn:
- Advanced windowing (session, sliding)
- Watermarks and late data handling
- Stream joins (window joins, interval joins)
- Custom functions and operators
- Performance optimization

### Resources

- [Flink Documentation](https://nightlies.apache.org/flink/flink-docs-master/)
- [PyFlink Tutorial](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/overview/)
- [Flink Kafka Connector](https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/kafka/)
- [Flink Best Practices](https://flink.apache.org/news/2020/07/28/flink-best-practices.html)

---

**Ready for advanced topics?** Open `05_advanced_stream_processing.ipynb` to continue!