# **Chapter 9: Data-Intensive Systems**

Modern applications generate petabytes of data daily—user interactions, sensor readings, logs, and transactions. Processing this data efficiently requires specialized architectures that go beyond traditional request-response models. This chapter explores batch processing, stream processing, data pipelines, and the architectural patterns that enable organizations to derive value from massive datasets.

---

## **9.1 Introduction to Data Processing Paradigms**

Data processing systems are categorized by latency requirements and data volume:

```
┌─────────────────────────────────────────────────────────────────────┐
│                    Data Processing Spectrum                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Batch Processing              Micro-Batching           Stream       │
│  (Hours/Days)                  (Minutes/Seconds)        Processing   │
│  (Terabytes/Petabytes)                                  (Milliseconds)│
│                                                                      │
│  ┌──────────────┐             ┌──────────────┐       ┌────────────┐ │
│  │  Historical  │             │  Near Real-  │       │  Real-Time │ │
│  │  Analytics   │             │  Time        │       │  Analytics │ │
│  │              │             │              │       │            │ │
│  │ • Monthly    │             │ • 5-minute   │       │ • Fraud    │ │
│  │   reports    │             │   aggregates │       │   detection│ │
│  │ • Training   │             │ • Lambda     │       │ • IoT      │ │
│  │   ML models  │             │   arch       │       │   alerts   │ │
│  │ • Data       │             │              │       │ • Live     │ │
│  │   warehousing│             │              │       │   dashboards│ │
│  └──────────────┘             └──────────────┘       └────────────┘ │
│         │                            │                      │       │
│         ▼                            ▼                      ▼       │
│    Hadoop/Spark                 Spark Streaming         Flink/      │
│    MapReduce                    Structured Streaming    Kafka       │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

**Key Trade-offs**:
- **Throughput vs. Latency**: Batch systems optimize for throughput (data volume), stream systems optimize for latency (speed)
- **Accuracy vs. Speed**: Batch provides exact results; streaming provides approximate results quickly, refined over time
- **Resource Efficiency**: Batch uses resources intensively then releases; streaming requires persistent resource allocation

---

## **9.2 Batch Processing**

Batch processing handles large volumes of data collected over time. It's optimized for throughput rather than latency.

### **The MapReduce Paradigm**

**Concept**: Process data in two phases—Map (transform/filter) and Reduce (aggregate). Inspired by functional programming.

**How It Works**:
```
Input Data (Distributed across nodes)
    │
    ▼
┌─────────────────────────────────────────────┐
│              Map Phase                       │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐       │
│  │ Node 1  │ │ Node 2  │ │ Node 3  │       │
│  │         │ │         │ │         │       │
│  │ Input:  │ │ Input:  │ │ Input:  │       │
│  │ "hello  │ │ "world  │ │ "hello  │       │
│  │ world"  │ │ hello"  │ │ hello"  │       │
│  │         │ │         │ │         │       │
│  │ Output: │ │ Output: │ │ Output: │       │
│  │ (hello,1)│ │ (world,1)│ │ (hello,1)│    │
│  │ (world,1)│ │ (hello,1)│ │ (hello,1)│    │
│  └────┬────┘ └────┬────┘ └────┬────┘       │
│       │           │           │             │
│       └───────────┼───────────┘             │
│                   │                         │
│                   ▼                         │
│           Shuffle/Sort                      │
│    (Group by key across all nodes)          │
│                   │                         │
│       ┌───────────┼───────────┐             │
│       ▼           ▼           ▼             │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐       │
│  │ (hello, │ │ (world, │ │         │       │
│  │  [1,1,  │ │  [1,1]) │ │         │       │
│  │   1,1]) │ │         │ │         │       │
│  └────┬────┘ └────┬────┘ └─────────┘       │
│       │           │                         │
└───────┼───────────┼─────────────────────────┘
        │           │
        ▼           ▼
┌─────────────────────────────────────────────┐
│             Reduce Phase                     │
│  ┌─────────┐ ┌─────────┐                   │
│  │ Node 1  │ │ Node 2  │                   │
│  │         │ │         │                   │
│  │ Sum:    │ │ Sum:    │                   │
│  │ 1+1+1+1 │ │ 1+1     │                   │
│  │ = 4     │ │ = 2     │                   │
│  │         │ │         │                   │
│  │ Output: │ │ Output: │                   │
│  │ (hello,4)│ │ (world,2)│                  │
│  └─────────┘ └─────────┘                   │
└─────────────────────────────────────────────┘

Final Output:
hello: 4
world: 2
```

**Implementation** (Hadoop MapReduce - Java):
```java
// Word Count Example
public class WordCount {
    
    // Mapper Class
    public static class TokenizerMapper 
        extends Mapper<Object, Text, Text, IntWritable> {
        
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        
        public void map(Object key, Text value, Context context) 
            throws IOException, InterruptedException {
            
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);  // Emit (word, 1)
            }
        }
    }
    
    // Reducer Class
    public static class IntSumReducer 
        extends Reducer<Text, IntWritable, Text, IntWritable> {
        
        private IntWritable result = new IntWritable();
        
        public void reduce(Text key, Iterable<IntWritable> values, Context context) 
            throws IOException, InterruptedException {
            
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);  // Emit (word, sum)
        }
    }
}
```

**Limitations of MapReduce**:
- **High Latency**: Every job writes to disk between stages (slow for iterative algorithms)
- **Complexity**: Simple operations require verbose Java code
- **Not Real-Time**: Designed for batch, not streaming

---

### **Apache Spark: In-Memory Batch Processing**

**Concept**: Keeps data in memory between transformations, 10-100x faster than MapReduce for iterative algorithms.

**Architecture**:
```
Driver Program
    │
    ▼
SparkContext
    │
    ├───► Cluster Manager (YARN, Mesos, Kubernetes)
    │         │
    │         ▼
    │    ┌─────────────────────────────────────┐
    │    │         Worker Nodes                │
    │    │  ┌─────────┐ ┌─────────┐          │
    │    │  │Executor │ │Executor │          │
    │    │  │ ┌─────┐ │ │ ┌─────┐ │          │
    │    │  │ │Task │ │ │ │Task │ │          │
    │    │  │ └─────┘ │ │ └─────┘ │          │
    │    │  │ ┌─────┐ │ │ ┌─────┐ │          │
    │    │  │ │Task │ │ │ │Task │ │          │
    │    │  │ └─────┘ │ │ └─────┘ │          │
    │    │  └─────────┘ └─────────┘          │
    │    └─────────────────────────────────────┘
    │
    ▼
RDD/DataFrame (Resilient Distributed Dataset)
```

**Implementation** (PySpark):
```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, window

# Initialize Spark
spark = SparkSession.builder \
    .appName("DataProcessing") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

# Read data (from S3, HDFS, or local)
df = spark.read.parquet("s3://data-lake/events/")

# Transformations (lazy evaluation - nothing executed yet)
processed_df = df \
    .filter(col("event_type") == "purchase") \
    .groupBy("user_id") \
    .agg(
        count("*").alias("purchase_count"),
        avg("amount").alias("avg_amount")
    ) \
    .filter(col("purchase_count") > 5)

# Action (triggers computation)
results = processed_df.collect()

# Write back to data lake
processed_df.write \
    .mode("overwrite") \
    .parquet("s3://data-lake/processed/high_value_users/")

# Stop Spark
spark.stop()
```

**Spark Optimization Techniques**:
```python
# 1. Partitioning (avoid data skew)
df.repartition("user_id")  # Hash partition by key

# 2. Broadcast Join (for small tables)
from pyspark.sql.functions import broadcast
large_df.join(broadcast(small_df), "key")

# 3. Caching (reuse intermediate results)
df.cache()  # Keep in memory
df.count()  # Materialize
df.filter(...).show()  # Uses cached version

# 4. Predicate Pushdown (filter at source)
spark.read \
    .option("basePath", "s3://data-lake/") \
    .parquet("s3://data-lake/events/") \
    .filter(col("date") == "2024-01-15")  # Pushed to Parquet reader

# 5. Salting (handle skewed keys)
from pyspark.sql.functions import rand, lit, concat

# Add random salt to skewed key
salted_df = skewed_df.withColumn(
    "salted_key", 
    concat(col("skewed_key"), lit("_"), (rand() * 10).cast("int"))
)
```

---

## **9.3 Stream Processing**

Stream processing analyzes data in motion—processing events as they arrive rather than waiting for batches to accumulate.

### **Windowing Strategies**

Since streams are unbounded, we process data in windows:

**1. Tumbling Windows** (Fixed, non-overlapping):
```
Time: 0----10----20----30----40----50----60
      [----Window 1----]
                     [----Window 2----]
                                      [----Window 3----]

Events in Window 1: 0-10 seconds
Events in Window 2: 10-20 seconds
No overlap between windows
```

**2. Sliding Windows** (Fixed size, overlapping):
```
Time: 0----5----10----15----20----25----30
      [----Window 1----]
           [----Window 2----]
                [----Window 3----]

Window size: 10 seconds
Slide interval: 5 seconds
Overlap: 5 seconds
```

**3. Session Windows** (Dynamic, activity-based):
```
User Activity:
Event 1: t=0
Event 2: t=5 (gap < 10s, same session)
Event 3: t=25 (gap > 10s, new session)
Event 4: t=28

Session 1: [0, 5] (gap of 20s closes session)
Session 2: [25, 28]
```

**4. Global Windows** (Single window, triggered by conditions):
```
All events in one window
Trigger: Every 100 events OR every 1 minute
```

---

### **Apache Flink: True Stream Processing**

**Concept**: Processes events one-at-a-time (not micro-batching), providing true low latency with exactly-once semantics.

**Architecture**:
```
Data Sources (Kafka, Kinesis, Pulsar)
    │
    ▼
┌─────────────────────────────────────────────┐
│              Job Manager                     │
│  (Coordinates distributed execution)         │
└──────────────┬──────────────────────────────┘
               │
       ┌───────┼───────┐
       ▼       ▼       ▼
┌─────────────────────────────────────────────┐
│            Task Managers                     │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐       │
│  │ Slot 1  │ │ Slot 2  │ │ Slot 3  │       │
│  │ Map     │ │ KeyBy   │ │ Window  │       │
│  └─────────┘ └─────────┘ └─────────┘       │
│  ┌─────────┐ ┌─────────┐                   │
│  │ Sink    │ │ Sink    │                   │
│  │ (Kafka) │ │ (DB)    │                   │
│  └─────────┘ └─────────┘                   │
└─────────────────────────────────────────────┘

Checkpointing: Periodic snapshots of state for fault tolerance
```

**Implementation** (Flink Python - PyFlink):
```python
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment, EnvironmentSettings
from pyflink.table.window import Tumble

# Initialize environment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4)

# Configure checkpointing (exactly-once semantics)
env.enable_checkpointing(5000)  # 5 seconds
env.get_checkpoint_config().set_checkpointing_mode(
    CheckpointingMode.EXACTLY_ONCE
)

# Create table environment
settings = EnvironmentSettings.new_instance() \
    .in_streaming_mode() \
    .build()
t_env = StreamTableEnvironment.create(env, settings)

# Define source (Kafka)
t_env.execute_sql("""
    CREATE TABLE user_events (
        user_id STRING,
        event_type STRING,
        amount DOUBLE,
        event_time TIMESTAMP(3),
        WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'user-events',
        'properties.bootstrap.servers' = 'kafka:9092',
        'format' = 'json',
        'scan.startup.mode' = 'latest-offset'
    )
""")

# Define sink
t_env.execute_sql("""
    CREATE TABLE fraud_alerts (
        user_id STRING,
        transaction_count BIGINT,
        total_amount DOUBLE,
        window_start TIMESTAMP(3),
        window_end TIMESTAMP(3)
    ) WITH (
        'connector' = 'jdbc',
        'url' = 'jdbc:postgresql://db:5432/alerts',
        'table-name' = 'fraud_alerts',
        'username' = 'user',
        'password' = 'pass'
    )
""")

# Process: Detect high-frequency transactions (potential fraud)
result = t_env.sql_query("""
    SELECT 
        user_id,
        COUNT(*) as transaction_count,
        SUM(amount) as total_amount,
        TUMBLE_START(event_time, INTERVAL '1' MINUTE) as window_start,
        TUMBLE_END(event_time, INTERVAL '1' MINUTE) as window_end
    FROM user_events
    WHERE event_type = 'purchase'
    GROUP BY 
        user_id,
        TUMBLE(event_time, INTERVAL '1' MINUTE)
    HAVING COUNT(*) > 10  -- More than 10 transactions per minute
""")

# Write results
result.execute_insert("fraud_alerts")
```

**Flink State Management**:
```python
# Keyed State (maintains state per key)
class CountFunction(KeyedProcessFunction):
    def __init__(self):
        self.state = ValueStateTypes.LONG
    
    def open(self, runtime_context):
        state_descriptor = ValueStateDescriptor("count", Types.LONG())
        self.state = runtime_context.get_state(state_descriptor)
    
    def process_element(self, value, ctx):
        current = self.state.value() or 0
        current += 1
        self.state.update(current)
        
        if current > 100:
            yield value  # Alert threshold exceeded

# Use in pipeline
stream.key_by(lambda x: x.user_id) \
      .process(CountFunction())
```

---

### **Kafka Streams: Embedded Stream Processing**

**Concept**: Library for building stream processing applications on top of Kafka. Runs inside your application (no separate cluster needed).

**Implementation**:
```python
from kafka import KafkaConsumer, KafkaProducer
import json
from collections import defaultdict
import threading
import time

class KafkaStreamsApp:
    def __init__(self):
        self.consumer = KafkaConsumer(
            'user-events',
            bootstrap_servers=['kafka:9092'],
            value_deserializer=lambda m: json.loads(m.decode('utf-8')),
            group_id='stream-processor',
            auto_offset_reset='latest'
        )
        
        self.producer = KafkaProducer(
            bootstrap_servers=['kafka:9092'],
            value_serializer=lambda v: json.dumps(v).encode('utf-8')
        )
        
        # Local state store (in-memory)
        self.user_activity = defaultdict(lambda: {
            'count': 0,
            'last_seen': None,
            'window_start': time.time()
        })
    
    def process(self):
        """Process stream with tumbling windows"""
        window_size = 60  # 60 second windows
        
        for message in self.consumer:
            event = message.value
            user_id = event['user_id']
            current_time = time.time()
            
            user_state = self.user_activity[user_id]
            
            # Check if window expired
            if current_time - user_state['window_start'] > window_size:
                # Emit windowed result
                self.emit_result(user_id, user_state)
                
                # Reset window
                user_state['count'] = 0
                user_state['window_start'] = current_time
            
            # Update state
            user_state['count'] += 1
            user_state['last_seen'] = current_time
            
            # Check for anomaly (real-time)
            if user_state['count'] > 100:
                self.send_alert(user_id, user_state)
    
    def emit_result(self, user_id, state):
        """Send aggregated metrics to output topic"""
        result = {
            'user_id': user_id,
            'event_count': state['count'],
            'window_duration': 60,
            'timestamp': time.time()
        }
        self.producer.send('user-metrics', result)
    
    def send_alert(self, user_id, state):
        """Send real-time alert"""
        alert = {
            'user_id': user_id,
            'alert_type': 'high_frequency',
            'count': state['count'],
            'timestamp': time.time()
        }
        self.producer.send('alerts', alert)

# Run
app = KafkaStreamsApp()
app.process()
```

---

## **9.4 Architecture Patterns: Lambda vs. Kappa**

### **Lambda Architecture**

**Concept**: Maintain two processing paths—batch layer (accuracy) and speed layer (latency), merged at serving layer.

```
┌─────────────────────────────────────────────────────────────────────┐
│                        Lambda Architecture                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Raw Data                                                            │
│     │                                                                │
│     ├──────────────────────────────────────┐                         │
│     │                                      │                         │
│     ▼                                      ▼                         │
│  ┌──────────────┐                   ┌──────────────┐                │
│  │ Batch Layer  │                   │ Speed Layer  │                │
│  │              │                   │              │                │
│  │ • Hadoop     │                   │ • Storm      │                │
│  │ • Spark      │                   │ • Flink      │                │
│  │ • MapReduce  │                   │ • Kafka      │                │
│  │              │                   │   Streams    │                │
│  │              │                   │              │                │
│  │ Process ALL  │                   │ Process      │                │
│  │ data         │                   │ RECENT data  │                │
│  │ (High        │                   │ (Low latency)│                │
│  │  latency OK) │                   │              │                │
│  └──────┬───────┘                   └──────┬───────┘                │
│         │                                  │                         │
│         ▼                                  ▼                         │
│  ┌──────────────┐                   ┌──────────────┐                │
│  │ Batch Views  │                   │ Real-time    │                │
│  │ (Accurate)   │                   │ Views        │                │
│  │              │                   │ (Approximate)│                │
│  └──────┬───────┘                   └──────┬───────┘                │
│         │                                  │                         │
│         └──────────┬───────────────────────┘                         │
│                    │                                                 │
│                    ▼                                                 │
│           ┌──────────────┐                                          │
│           │ Serving Layer│  <-- Query interface                     │
│           │              │      (Merge batch + real-time)           │
│           │ • Presto     │                                          │
│           │ • Druid      │                                          │
│           │ • Pinot      │                                          │
│           └──────────────┘                                          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Complexity: HIGH (maintain two codebases for same logic)
Use case: When exact accuracy is required AND low latency needed
```

**Challenges**:
- **Code Duplication**: Same business logic in batch and streaming code
- **Reconciliation**: Merging batch and speed results is complex
- **Operational Complexity**: Two separate systems to maintain

---

### **Kappa Architecture**

**Concept**: Single processing path using stream processing for everything. Reprocess historical data by replaying from log.

```
┌─────────────────────────────────────────────────────────────────────┐
│                        Kappa Architecture                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Raw Data                                                            │
│     │                                                                │
│     ▼                                                                │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                     Event Log (Kafka)                        │    │
│  │  (Immutable, append-only, replayable)                        │    │
│  │                                                              │    │
│  │  Offset 0: Event 1                                           │    │
│  │  Offset 1: Event 2                                           │    │
│  │  Offset 2: Event 3                                           │    │
│  │  ...                                                         │    │
│  │  Offset N: Event N                                           │    │
│  └─────────────────────────────────────────────────────────────┘    │
│     │                                                                │
│     │ (Stream Processing)                                            │
│     ▼                                                                │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              Stream Processing Layer                         │    │
│  │                                                              │    │
│  │  • Real-time processing (latest offset)                      │    │
│  │  • Historical reprocessing (from offset 0)                   │    │
│  │                                                              │    │
│  │  Same code for both!                                         │    │
│  └─────────────────────────────────────────────────────────────┘    │
│     │                                                                │
│     ▼                                                                │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    Serving Layer                             │    │
│  │                                                              │    │
│  │  Materialized Views (updated by stream processor)           │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
│  Benefits:                                                           │
│  - Single codebase                                                   │
│  - Simplified operations                                             │
│  - Replay capability (reprocess with new logic)                      │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

**Implementation** (Reprocessing with Kafka):
```python
# To reprocess historical data with new logic:
# 1. Reset consumer group to earliest offset
# 2. Deploy new version of stream processor
# 3. Process all events from beginning

from kafka import KafkaConsumer, TopicPartition

consumer = KafkaConsumer(
    'events',
    bootstrap_servers=['kafka:9092'],
    group_id='processor-v2',  # New consumer group
    auto_offset_reset='earliest'  # Start from beginning
)

# Or reset existing group
# kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
#   --group processor-v1 --reset-offsets --to-earliest --execute --topic events

for message in consumer:
    # Process with NEW business logic
    new_result = process_v2(message.value)
    save_to_database(new_result)
```

**Comparison**:
```
┌───────────────────────┬────────────────────────┬────────────────────────┐
│ Aspect                │ Lambda                 │ Kappa                  │
├───────────────────────┼────────────────────────┼────────────────────────┤
│ Code Complexity       │ High (duplicate logic) │ Low (single codebase)  │
├───────────────────────┼────────────────────────┼────────────────────────┤
│ Latency               │ Low (speed layer)      │ Low (streaming)        │
├───────────────────────┼────────────────────────┼────────────────────────┤
│ Accuracy              │ Exact (batch) + Approx │ Depends on windowing   │
├───────────────────────┼────────────────────────┼────────────────────────┤
│ Reprocessing          │ Rerun batch jobs       │ Replay from log        │
├───────────────────────┼────────────────────────┼────────────────────────┤
│ Use Case              │ Complex analytics      │ Event-driven, IoT,     │
│                       │ requiring exactness    │ real-time monitoring   │
└───────────────────────┴────────────────────────┴────────────────────────┘
```

**Modern Trend**: Most organizations now prefer Kappa architecture with tools like Flink or Kafka Streams, using the event log as the single source of truth.

---

## **9.5 Data Pipeline Orchestration**

Complex data workflows require orchestration—managing dependencies, scheduling, retries, and monitoring.

### **Apache Airflow**

**Concept**: Define workflows as Directed Acyclic Graphs (DAGs) in Python. Tasks are executed based on dependencies.

**Architecture**:
```
┌─────────────────────────────────────────────────────────────┐
│                    Airflow Architecture                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Web Server (Flask)                      │   │
│  │         (UI, REST API, DAG management)               │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Scheduler                               │   │
│  │    (Parses DAGs, schedules tasks, queues execution)  │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Executor                                │   │
│  │  (Local, Celery, Kubernetes)                         │   │
│  │                                                      │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐               │   │
│  │  │ Worker  │ │ Worker  │ │ Worker  │               │   │
│  │  │ (Task)  │ │ (Task)  │ │ (Task)  │               │   │
│  │  └─────────┘ └─────────┘ └─────────┘               │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Metadata Database (PostgreSQL)          │   │
│  │         (DAG runs, task instances, logs, etc.)       │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

**Implementation** (Airflow DAG):
```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator
from datetime import datetime, timedelta

# Default arguments
default_args = {
    'owner': 'data-engineering',
    'depends_on_past': False,
    'email_on_failure': True,
    'email': ['alerts@company.com'],
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'execution_timeout': timedelta(hours=2)
}

# Define DAG
with DAG(
    'etl_daily_sales',
    default_args=default_args,
    description='Daily sales ETL pipeline',
    schedule_interval='0 2 * * *',  # Daily at 2 AM
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=['sales', 'etl'],
    max_active_runs=1
) as dag:

    # Task 1: Extract from OLTP database
    extract_task = PostgresOperator(
        task_id='extract_sales_data',
        postgres_conn_id='oltp_db',
        sql="""
            COPY (
                SELECT * FROM sales 
                WHERE date = '{{ ds }}'
            ) TO STDOUT WITH CSV HEADER;
        """
    )
    
    # Task 2: Transform (Python function)
    def transform_data(**context):
        """Clean and aggregate sales data"""
        execution_date = context['ds']
        
        # Read extracted data
        raw_data = read_from_staging(execution_date)
        
        # Transformations
        clean_data = clean_and_validate(raw_data)
        aggregated = aggregate_by_region(clean_data)
        
        # Write to S3
        write_to_s3(aggregated, f"processed/{execution_date}/")
        
        return f"Processed {len(raw_data)} records"
    
    transform_task = PythonOperator(
        task_id='transform_data',
        python_callable=transform_data,
        provide_context=True
    )
    
    # Task 3: Load to Data Warehouse
    load_task = S3ToRedshiftOperator(
        task_id='load_to_warehouse',
        schema='analytics',
        table='daily_sales',
        s3_bucket='data-lake',
        s3_key="processed/{{ ds }}/",
        redshift_conn_id='redshift_default',
        copy_options=["CSV", "IGNOREHEADER 1"]
    )
    
    # Task 4: Data Quality Check
    quality_check = PostgresOperator(
        task_id='data_quality_check',
        postgres_conn_id='redshift_default',
        sql="""
            SELECT COUNT(*) FROM analytics.daily_sales
            WHERE date = '{{ ds }}' AND amount < 0;
        """
    )
    
    # Define dependencies (DAG structure)
    extract_task >> transform_task >> load_task >> quality_check
    
    # Alternative: Branching
    # transform_task >> [load_task, error_handler]  # Based on condition
```

**Modern Alternatives**:
```python
# Prefect (Simpler, Python-native)
from prefect import flow, task
from prefect.tasks import task_input_hash

@task(cache_key_fn=task_input_hash, retries=3)
def extract_data(date: str):
    return fetch_from_api(date)

@task
def transform_data(raw_data):
    return [clean_record(r) for r in raw_data]

@task
def load_data(clean_data):
    insert_to_warehouse(clean_data)

@flow(name="ETL Pipeline")
def etl_flow(date: str):
    raw = extract_data(date)
    transformed = transform_data(raw)
    load_data(transformed)

# Run
etl_flow("2024-01-15")
```

---

## **9.6 Data Warehousing and OLAP**

Data warehouses are optimized for analytical queries (OLAP) rather than transactional processing (OLTP).

### **OLTP vs. OLAP**

```
┌─────────────────────────────────────────────────────────────────────┐
│                    OLTP vs OLAP Comparison                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  OLTP (Online Transaction Processing)                                │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ • Optimized for INSERT/UPDATE/DELETE                        │    │
│  │ • Normalized schema (3NF)                                   │    │
│  │ • Row-oriented storage                                        │    │
│  │ • High concurrency, short transactions                      │    │
│  │ • Current data only                                         │    │
│  │ • Examples: PostgreSQL, MySQL, Oracle                       │    │
│  │ • Use Case: Order processing, user registration             │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
│  OLAP (Online Analytical Processing)                                 │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ • Optimized for SELECT (aggregations)                       │    │
│  │ • Denormalized schema (star/snowflake schema)               │    │
│  │ • Column-oriented storage                                     │    │
│  │ • Batch loads, complex queries                              │    │
│  │ • Historical data (years)                                   │    │
│  │ • Examples: Snowflake, BigQuery, Redshift                   │    │
│  │ • Use Case: Sales reports, trend analysis                   │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### **Columnar Storage**

**Why Columnar?** Analytical queries typically access few columns but many rows.

```
Row-Oriented (OLTP):
┌────────┬────────┬────────┬────────┐
│ ID │ Name   │ Amount │ Date   │
├────────┼────────┼────────┼────────┤
│ 1  │ Alice  │ 100    │ 01-15  │
│ 2  │ Bob    │ 200    │ 01-15  │
│ 3  │ Carol  │ 150    │ 01-16  │
└────────┴────────┴────────┴────────┘
Storage: [1,Alice,100,01-15,2,Bob,200,01-15,3,Carol,150,01-16]
Query: SUM(Amount) -> Must read all data (inefficient)

Column-Oriented (OLAP):
Column "Amount": [100, 200, 150]
Column "Date":   [01-15, 01-15, 01-16]

Query: SUM(Amount) -> Read only Amount column (efficient)
Compression: Similar values compress better (run-length encoding)
```

**Star Schema Example**:
```sql
-- Fact Table (measurements)
CREATE TABLE sales_fact (
    sale_id BIGINT,
    date_key INT,          -- Foreign key
    product_key INT,       -- Foreign key
    customer_key INT,      -- Foreign key
    store_key INT,         -- Foreign key
    quantity INT,
    amount DECIMAL(10,2),
    discount DECIMAL(5,2)
);

-- Dimension Tables (descriptions)
CREATE TABLE date_dim (
    date_key INT PRIMARY KEY,
    full_date DATE,
    day_of_week VARCHAR(10),
    month VARCHAR(10),
    quarter INT,
    year INT,
    is_holiday BOOLEAN
);

CREATE TABLE product_dim (
    product_key INT PRIMARY KEY,
    sku VARCHAR(50),
    name VARCHAR(200),
    category VARCHAR(100),
    brand VARCHAR(100),
    cost DECIMAL(10,2)
);

-- Query: Sales by category, Q4 2023
SELECT 
    p.category,
    SUM(f.amount) as total_sales
FROM sales_fact f
JOIN date_dim d ON f.date_key = d.date_key
JOIN product_dim p ON f.product_key = p.product_key
WHERE d.year = 2023 AND d.quarter = 4
GROUP BY p.category;
```

### **Modern Data Warehouses**

**Snowflake** (Cloud-native, separation of compute and storage):
```sql
-- Snowflake architecture: Storage + Compute (Virtual Warehouses) + Services
-- Scale compute independently of storage

-- Create warehouse (compute)
CREATE WAREHOUSE etl_wh WITH
    WAREHOUSE_SIZE = 'X-SMALL'
    AUTO_SUSPEND = 300  -- Suspend after 5 min idle
    AUTO_RESUME = TRUE;

-- Create database (storage)
CREATE DATABASE analytics_db;

-- Use warehouse
USE WAREHOUSE etl_wh;

-- Query (compute scales automatically)
SELECT 
    customer_segment,
    AVG(order_value) as avg_order,
    COUNT(*) as order_count
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_segment;

-- Scale up for heavy query
ALTER WAREHOUSE etl_wh SET WAREHOUSE_SIZE = 'LARGE';

-- Zero-copy cloning (instant dev/test environments)
CREATE DATABASE analytics_dev CLONE analytics_db;
```

**Google BigQuery** (Serverless, pay-per-query):
```python
from google.cloud import bigquery

client = bigquery.Client()

# Query (serverless - no infrastructure to manage)
query = """
    SELECT 
        user_id,
        COUNT(*) as session_count,
        AVG(session_duration) as avg_duration
    FROM `project.dataset.events`
    WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
    GROUP BY user_id
    HAVING COUNT(*) > 10
"""

# Run query (pay for bytes processed, not uptime)
job = client.query(query)
results = job.result()

# Partitioning and clustering for cost optimization
"""
CREATE TABLE project.dataset.events (
    event_id STRING,
    user_id STRING,
    event_timestamp TIMESTAMP
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id;
"""
```

---

## **9.7 Real-World Data Architecture Example**

**Clickstream Analytics Platform**:

```
User Events (Website/App)
    │
    ▼
┌─────────────────────────────────────────────┐
│           Kafka (Event Streaming)            │
│  (Durable buffer, decouple producers/consumers)│
└──────────┬─────────────────┬────────────────┘
           │                 │
    ┌──────┴──────┐   ┌──────┴──────┐
    ▼             ▼   ▼             ▼
┌─────────┐  ┌─────────┐      ┌─────────────┐
│ Flink   │  │ Spark   │      │ S3 (Data    │
│ (Real-  │  │ (Batch  │      │ Lake)       │
│  time)  │  │  ETL)   │      │             │
└────┬────┘  └────┬────┘      └─────────────┘
     │            │                   │
     ▼            ▼                   ▼
┌─────────┐  ┌─────────────┐    ┌─────────────┐
│ Redis   │  │ Snowflake   │    │ Athena      │
│ (Cache/ │  │ (Data       │    │ (Ad-hoc     │
│  Hot)   │  │  Warehouse) │    │  queries)   │
└─────────┘  └──────┬──────┘    └─────────────┘
                    │
                    ▼
            ┌─────────────┐
            │ Tableau/    │
            │ Looker      │
            │ (BI/Dashboards)│
            └─────────────┘
```

---

## **9.8 Key Takeaways**

1. **Choose batch for throughput, stream for latency**: Batch processing (Spark) handles petabytes efficiently. Stream processing (Flink) provides sub-second latency for real-time use cases.

2. **Event logs are the source of truth**: Kappa architecture with Kafka provides a single source of truth. Replay capability enables reprocessing historical data with new logic.

3. **Windowing is essential for streams**: Tumbling windows for fixed intervals, sliding windows for overlaps, session windows for user activity. Watermarks handle late-arriving data.

4. **Columnar storage for analytics**: Data warehouses (Snowflake, BigQuery) use columnar storage and massively parallel processing for fast aggregations.

5. **Orchestrate complex workflows**: Airflow/Prefect manage dependencies, retries, and scheduling for data pipelines. Treat data pipelines as code (version control, CI/CD).

6. **Schema evolution matters**: Use Avro/Protobuf with schema registries for backward/forward compatibility in streaming systems.

7. **Handle data skew**: Salting keys in Spark, partitioning strategies in Kafka. Skewed data causes some nodes to be overloaded while others idle.

---

## **Chapter Summary**

In this chapter, we explored data-intensive systems—the technologies that power modern analytics and real-time processing. We compared batch processing (MapReduce, Spark) with stream processing (Flink, Kafka Streams), understanding when each paradigm is appropriate.

We examined windowing strategies for unbounded streams and the architectural patterns that unify batch and stream processing: Lambda architecture (dual paths) and Kappa architecture (single stream-based path).

Data pipeline orchestration tools (Airflow, Prefect) enable complex workflow management, while modern data warehouses (Snowflake, BigQuery) provide scalable analytics through columnar storage and separation of compute and storage.

The chapter concluded with practical architectural guidance for building robust data platforms that balance latency, throughput, and cost.

**Coming up next**: In Chapter 10, we'll explore Reliability & Fault Tolerance—strategies for building systems that survive component failures, including redundancy patterns, disaster recovery, and chaos engineering.

---

**Exercises**:

1. **Architecture Selection**: Design a data pipeline for a ride-sharing app that needs:
   - Real-time driver matching (sub-second latency)
   - Daily fare calculation and driver payments (batch)
   - Real-time fraud detection
   - Monthly business intelligence reports
   
   Which technologies would you use for each requirement? Draw the architecture diagram.

2. **Windowing Strategy**: You're building a sessionization pipeline for website analytics. Users are considered "active" if they have events within 30 minutes of each other. Which windowing strategy would you use? Implement a simple version using your preferred stream processing framework.

3. **Data Skew Handling**: You have a Spark job processing user events, but 10% of users generate 90% of events (power users). How would you handle this data skew to prevent some executors from being overwhelmed?

4. **Cost Optimization**: Your BigQuery bill is unexpectedly high. The table has 1TB of data, but queries are scanning 500GB each time. What optimizations would you implement (partitioning, clustering, materialized views)?

5. **Exactly-Once Semantics**: Design a system that transfers money between accounts using Kafka Streams. How would you ensure exactly-once processing (no double counting) even if the stream processor crashes and restarts?

---
