# Chapter 15: Data-Intensive Systems

---

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Design batch processing pipelines using MapReduce and modern frameworks
- Build real-time stream processing applications with Apache Flink and Kafka Streams
- Choose between Lambda and Kappa architectures for big data systems
- Orchestrate complex data pipelines using Apache Airflow
- Differentiate between OLTP and OLAP systems and select appropriate technologies
- Implement data warehousing solutions with columnar storage
- Handle the challenges of processing petabyte-scale data

---

## **Introduction: The Data Explosion**

We live in the age of data. Every click, swipe, purchase, and sensor reading generates data. Modern systems must process terabytes to petabytes of data efficiently, whether analyzing historical trends or reacting to events in real-time.

### **The Two Worlds of Data Processing**

```
┌─────────────────────────────────────────────────────────────────┐
│              BATCH VS STREAM PROCESSING                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  BATCH PROCESSING                                               │
│  ───────────────────────────────────────────────────────────   │
│                                                                 │
│  Characteristics:                                               │
│  • Process large volumes of historical data                     │
│  • High latency acceptable (minutes to hours)                   │
│  • High throughput                                              │
│  • Complex aggregations and joins                               │
│  • Scheduled or triggered                                       │
│                                                                 │
│  Examples:                                                      │
│  • Daily sales reports                                          │
│  • Monthly billing calculations                                 │
│  • Machine learning model training                              │
│  • Data warehouse ETL                                           │
│                                                                 │
│  Tools: Hadoop, Spark, MapReduce, Airflow                       │
│                                                                 │
│  ───────────────────────────────────────────────────────────   │
│                                                                 │
│  STREAM PROCESSING                                              │
│  ───────────────────────────────────────────────────────────   │
│                                                                 │
│  Characteristics:                                               │
│  • Process data in real-time as it arrives                      │
│  • Low latency required (milliseconds to seconds)               │
│  • Continuous processing                                          │
│  • Event-by-event or micro-batches                              │
│  • Stateful operations (windows, aggregations)                  │
│                                                                 │
│  Examples:                                                      │
│  • Real-time fraud detection                                    │
│  • Stock price monitoring                                       │
│  • IoT sensor processing                                        │
│  • Live dashboard updates                                       │
│  • Clickstream analysis                                         │
│                                                                 │
│  Tools: Flink, Kafka Streams, Spark Streaming, Storm            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

---

## **Batch Processing with MapReduce**

MapReduce is the foundational paradigm for distributed batch processing, introduced by Google and popularized by Hadoop.

### **The MapReduce Paradigm**

```
┌─────────────────────────────────────────────────────────────────┐
│                    MAPREDUCE WORKFLOW                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Input Data (Distributed across cluster)                        │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐               │
│  │ Split 1 │ │ Split 2 │ │ Split 3 │ │ Split 4 │               │
│  │ (64MB)  │ │ (64MB)  │ │ (64MB)  │ │ (64MB)  │               │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘               │
│       │           │           │           │                      │
│       ↓           ↓           ↓           ↓                      │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐               │
│  │  Map    │ │  Map    │ │  Map    │ │  Map    │               │
│  │ Task 1  │ │ Task 2  │ │ Task 3  │ │ Task 4  │               │
│  │         │ │         │ │         │ │         │               │
│  │ Input:  │ │ Input:  │ │ Input:  │ │ Input:  │               │
│  │ K1,V1   │ │ K2,V2   │ │ K3,V3   │ │ K4,V4   │               │
│  │         │ │         │ │         │ │         │               │
│  │ Output: │ │ Output: │ │ Output: │ │ Output: │               │
│  │ K1,[V1] │ │ K2,[V2] │ │ K1,[V3] │ │ K3,[V4] │               │
│  │         │ │         │ │ K2,[V5] │ │         │               │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘               │
│       │           │           │           │                      │
│       └───────────┴─────┬─────┴───────────┘                      │
│                         │                                        │
│                         ↓                                        │
│                  ┌─────────────┐                                 │
│                  │   Shuffle   │ ← Sort and group by key         │
│                  │   & Sort    │                                 │
│                  └──────┬──────┘                                 │
│                         │                                        │
│       ┌─────────────────┼─────────────────┐                    │
│       ↓                 ↓                 ↓                    │
│  ┌─────────┐      ┌─────────┐      ┌─────────┐                 │
│  │ Reduce  │      │ Reduce  │      │ Reduce  │                 │
│  │ Task 1  │      │ Task 2  │      │ Task 3  │                 │
│  │         │      │         │      │         │                 │
│  │ Input:  │      │ Input:  │      │ Input:  │                 │
│  │ K1,[V1,│      │ K2,[V2, │      │ K3,[V4] │                 │
│  │    V3]  │      │    V5]  │      │         │                 │
│  │         │      │         │      │         │                 │
│  │ Output: │      │ Output: │      │ Output: │                 │
│  │ K1,Sum  │      │ K2,Sum  │      │ K3,Sum  │                 │
│  └────┬────┘      └────┬────┘      └────┬────┘                 │
│       │                │                │                        │
│       └────────────────┼────────────────┘                        │
│                        ↓                                        │
│                 ┌─────────────┐                                 │
│                 │   Output    │                                 │
│                 │   Files     │                                 │
│                 └─────────────┘                                 │
│                                                                 │
│  MapReduce Word Count Example:                                  │
│                                                                 │
│  Map:   Input  → (word, 1) for each word                        │
│  Shuffle: Group by word                                          │
│  Reduce: (word, [1,1,1,...]) → (word, sum)                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Modern Alternative: Apache Spark**

Spark is the modern successor to Hadoop MapReduce, offering better performance and easier APIs.

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, count

# Initialize Spark
spark = SparkSession.builder \
    .appName("OrderAnalytics") \
    .getOrCreate()

# Read data (from S3, HDFS, etc.)
orders_df = spark.read.parquet("s3://data-lake/orders/")

# Process with DataFrame API (optimized)
result = orders_df \
    .groupBy("customer_id") \
    .agg(
        spark_sum("total").alias("lifetime_value"),
        count("*").alias("order_count")
    ) \
    .filter(col("lifetime_value") > 1000)

# Write results
result.write.mode("overwrite").parquet("s3://analytics/high-value-customers/")

# Stop Spark
spark.stop()
```

---

## **Stream Processing**

Real-time processing requires different tools than batch.

### **Apache Flink**

Flink is the leading stream processing framework, offering true streaming (not micro-batching).

```java
// Flink Java API example
StreamExecutionEnvironment env = 
    StreamExecutionEnvironment.getExecutionEnvironment();

// Source: Kafka
DataStream<OrderEvent> orders = env
    .addSource(new FlinkKafkaConsumer<>(
        "orders-topic",
        new OrderDeserializationSchema(),
        properties
    ));

// Processing: Windowed aggregation
DataStream<OrderStats> stats = orders
    .keyBy(OrderEvent::getProductId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new OrderAggregator());

// Sink: DynamoDB
stats.addSink(new DynamoDBSink<>(
    "order-stats-table",
    new StatsSerializer()
));

env.execute("Order Analytics Job");
```

### **Kafka Streams**

For simpler stream processing, Kafka Streams provides a lightweight library.

```java
// Kafka Streams DSL
StreamsBuilder builder = new StreamsBuilder();

KStream<String, Order> orders = builder.stream("orders");

// Enrich with customer data
KTable<String, Customer> customers = builder.table("customers");

KStream<String, EnrichedOrder> enriched = orders
    .leftJoin(customers, (order, customer) -> 
        new EnrichedOrder(order, customer)
    );

// Filter and aggregate
enriched
    .filter((key, value) -> value.getAmount() > 100)
    .groupBy((key, value) -> value.getCategory())
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .aggregate(
        () -> 0.0,
        (key, value, aggregate) -> aggregate + value.getAmount(),
        Materialized.as("category-revenue")
    );

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
```

---

## **Lambda vs Kappa Architecture**

Two competing paradigms for big data systems.

```
┌─────────────────────────────────────────────────────────────────┐
│              LAMBDA VS KAPPA ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  LAMBDA ARCHITECTURE                                            │
│  ───────────────────────────────────────────────────────────   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    Data Source                           │   │
│  │                   (Kafka, Kinesis, etc.)                  │   │
│  └───────────────────────┬─────────────────────────────────┘   │
│                          │                                      │
│          ┌───────────────┼───────────────┐                     │
│          ↓               ↓               ↓                     │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐          │
│  │ Speed Layer  │ │ Batch Layer  │ │ Serving Layer │          │
│  │ (Stream)     │ │ (Batch)      │ │ (Query)      │          │
│  │              │ │              │ │              │          │
│  │ • Real-time  │ │ • Hadoop/    │ │ • Presto/    │          │
│  │   views      │ │   Spark      │ │   Druid      │          │
│  │ • Low        │ │ • Accurate   │ │ • Merged     │          │
│  │   latency    │ │   historical │ │   results    │          │
│  │ • Approximate│ │   data       │ │              │          │
│  └──────────────┘ └──────────────┘ └──────────────┘          │
│                                                                 │
│  Problems:                                                      │
│  • Complex: Two codebases (batch + stream)                    │
│  • Expensive: Duplicate processing                              │
│  • Inconsistent: Different results from batch vs speed        │
│                                                                 │
│  ───────────────────────────────────────────────────────────   │
│                                                                 │
│  KAPPA ARCHITECTURE                                             │
│  ───────────────────────────────────────────────────────────   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    Data Source                           │   │
│  │                   (Kafka, Kinesis)                        │   │
│  │                   (Immutable Log)                        │   │
│  └───────────────────────┬─────────────────────────────────┘   │
│                          │                                      │
│                          ↓                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              Stream Processing Layer                     │   │
│  │                                                          │   │
│  │  • Single codebase for all processing                   │   │
│  │  • Reprocess entire log when needed (rebuild state)     │   │
│  │  • Kafka Streams, Flink, Spark Streaming                │   │
│  │                                                          │   │
│  └───────────────────────┬─────────────────────────────────┘   │
│                          │                                      │
│                          ↓                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              Serving Layer (Materialized Views)          │   │
│  │                                                          │   │
│  │  • Queryable state stores                                │   │
│  │  • Real-time and historical from same pipeline          │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Advantages:                                                    │
│  • Simpler: Single processing pipeline                          │
│  • Consistent: Same code for real-time and batch                │
│  • Replayable: Can reprocess history by rewinding log            │
│  • Scalable: Log-centric architecture scales well               │
│                                                                 │
│  When to Use Kappa:                                             │
│  • Event-sourced systems                                          │
│  • Real-time analytics with historical replay needs               │
│  • Audit trails and compliance                                    │
│  • When you want to avoid Lambda complexity                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Kappa Architecture Implementation:**

```python
# Kafka Streams - Kappa Architecture
from kafka import KafkaConsumer, KafkaProducer
import json

class KappaProcessor:
    """
    Kappa architecture processor using Kafka.
    Single pipeline for both real-time and historical processing.
    """
    
    def __init__(self, bootstrap_servers):
        self.consumer = KafkaConsumer(
            'events-topic',
            bootstrap_servers=bootstrap_servers,
            auto_offset_reset='earliest',  # Can rewind to beginning
            enable_auto_commit=False,       # Manual commit for exactly-once
            group_id='kappa-processor'
        )
        
        self.producer = KafkaProducer(
            bootstrap_servers=bootstrap_servers,
            value_serializer=lambda v: json.dumps(v).encode('utf-8')
        )
        
        # Materialized view (state store)
        self.state_store = {}
    
    def process_event(self, event):
        """
        Process single event and update materialized view.
        Same code runs for both real-time and replay.
        """
        event_type = event.get('type')
        user_id = event.get('user_id')
        
        if event_type == 'page_view':
            # Update view count
            if user_id not in self.state_store:
                self.state_store[user_id] = {'views': 0, 'purchases': 0}
            self.state_store[user_id]['views'] += 1
            
        elif event_type == 'purchase':
            # Update purchase count
            if user_id not in self.state_store:
                self.state_store[user_id] = {'views': 0, 'purchases': 0}
            self.state_store[user_id]['purchases'] += 1
            
            # Emit derived event
            self.producer.send('user-metrics', {
                'user_id': user_id,
                'conversion_rate': self.calculate_conversion(user_id)
            })
    
    def calculate_conversion(self, user_id):
        """Calculate conversion rate for user."""
        data = self.state_store.get(user_id, {})
        views = data.get('views', 0)
        purchases = data.get('purchases', 0)
        return purchases / views if views > 0 else 0
    
    def run(self):
        """Main processing loop."""
        print("Starting Kappa processor...")
        print("Can replay from beginning by resetting offset")
        
        for message in self.consumer:
            try:
                event = json.loads(message.value.decode('utf-8'))
                self.process_event(event)
                
                # Commit offset (mark as processed)
                self.consumer.commit()
                
            except Exception as e:
                print(f"Error processing event: {e}")
    
    def replay_from_beginning(self):
        """Replay all events from beginning (Kappa advantage)."""
        print("Replaying all events from beginning...")
        self.consumer.seek_to_beginning()
        self.state_store.clear()  # Reset state
        self.run()

# Usage
if __name__ == "__main__":
    processor = KappaProcessor(['localhost:9092'])
    
    # Normal real-time processing
    # processor.run()
    
    # Replay historical data (Kappa advantage)
    # processor.replay_from_beginning()
```

---

## **Data Pipeline Orchestration**

Apache Airflow is the industry standard for orchestrating data pipelines.

### **Airflow Architecture**

```
┌─────────────────────────────────────────────────────────────────┐
│                    APACHE AIRFLOW ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    Web Server (Flask)                    │   │
│  │  • UI for monitoring and manual triggers                │   │
│  │  • REST API                                             │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ↓                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    Scheduler                              │   │
│  │  • Parses DAGs (Directed Acyclic Graphs)                  │   │
│  │  • Determines task execution order                        │   │
│  │  • Queues tasks to Executor                               │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ↓                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    Executor                             │   │
│  │  • Executes tasks                                       │   │
│  │  Types:                                                 │   │
│  │  • SequentialExecutor (single process, dev only)        │   │
│  │  • LocalExecutor (parallel on single machine)          │   │
│  │  • CeleryExecutor (distributed, production)             │   │
│  │  • KubernetesExecutor (K8s native)                     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ↓                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    Workers                              │   │
│  │  (Celery workers or Kubernetes pods)                    │   │
│  │  • Execute actual tasks                                 │   │
│  │  • Report status back to metadata DB                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    Metadata Database                    │   │
│  │  (PostgreSQL, MySQL)                                    │   │
│  │  • DAG definitions                                      │   │
│  │  • Task execution history                               │   │
│  │  • Task states (queued, running, success, failed)       │   │
│  │  • XComs (cross-communication between tasks)            │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Airflow DAG Example:**

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.operators.s3 import S3FileTransformOperator
from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator
from airflow.utils.dates import days_ago
from datetime import datetime, timedelta

# Default arguments for all tasks
default_args = {
    'owner': 'data-engineering',
    'depends_on_past': False,
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'execution_timeout': timedelta(hours=2),
}

# Define the DAG
dag = DAG(
    'daily_sales_etl',
    default_args=default_args,
    description='Daily sales data ETL pipeline',
    schedule_interval='0 2 * * *',  # Run at 2 AM daily
    start_date=days_ago(1),
    catchup=False,
    tags=['sales', 'etl', 'daily'],
    max_active_runs=1,  # Don't run multiple instances simultaneously
)

# Task 1: Extract from source database
def extract_sales_data(ds, **kwargs):
    """
    Extract yesterday's sales data.
    ds is the execution date (YYYY-MM-DD).
    """
    from sqlalchemy import create_engine
    import pandas as pd
    
    # Connect to source database
    engine = create_engine('postgresql://user:pass@source-db:5432/sales')
    
    # Extract data for execution date
    query = f"""
    SELECT 
        order_id,
        customer_id,
        product_id,
        quantity,
        price,
        order_date,
        status
    FROM orders
    WHERE DATE(order_date) = '{ds}'
    """
    
    df = pd.read_sql(query, engine)
    
    # Save to temporary location for next task
    output_path = f'/tmp/sales_raw_{ds}.csv'
    df.to_csv(output_path, index=False)
    
    # Push to XCom for downstream tasks
    kwargs['ti'].xcom_push(key='raw_data_path', value=output_path)
    
    return f"Extracted {len(df)} records for {ds}"

extract_task = PythonOperator(
    task_id='extract_sales',
    python_callable=extract_sales_data,
    provide_context=True,
    dag=dag,
)

# Task 2: Transform data
def transform_sales_data(ds, **kwargs):
    """Clean and transform sales data."""
    import pandas as pd
    
    # Get input from previous task
    ti = kwargs['ti']
    input_path = ti.xcom_pull(task_ids='extract_sales', key='raw_data_path')
    
    # Read data
    df = pd.read_csv(input_path)
    
    # Transformations
    df['total_amount'] = df['quantity'] * df['price']
    df['discount'] = df.apply(calculate_discount, axis=1)
    df['final_amount'] = df['total_amount'] - df['discount']
    
    # Data quality checks
    if df['final_amount'].min() < 0:
        raise ValueError("Negative amounts detected!")
    
    # Save transformed data
    output_path = f'/tmp/sales_transformed_{ds}.parquet'
    df.to_parquet(output_path, index=False)
    
    # Push to XCom
    ti.xcom_push(key='transformed_path', value=output_path)
    
    return f"Transformed {len(df)} records"

def calculate_discount(row):
    """Calculate discount based on rules."""
    if row['quantity'] > 10:
        return row['total_amount'] * 0.1
    elif row['customer_id'].startswith('VIP'):
        return row['total_amount'] * 0.05
    return 0

transform_task = PythonOperator(
    task_id='transform_sales',
    python_callable=transform_sales_data,
    provide_context=True,
    dag=dag,
)

# Task 3: Load to data warehouse
load_task = PythonOperator(
    task_id='load_to_warehouse',
    python_callable=lambda ds, **kwargs: print(f"Loading data for {ds}"),
    provide_context=True,
    dag=dag,
)

# Task 4: Send notification
def send_success_notification(ds, **kwargs):
    """Send Slack notification on success."""
    import requests
    
    webhook_url = "https://hooks.slack.com/services/..."
    message = {
        "text": f"✅ Daily sales ETL completed successfully for {ds}"
    }
    
    requests.post(webhook_url, json=message)
    return "Notification sent"

notify_task = PythonOperator(
    task_id='send_notification',
    python_callable=send_success_notification,
    provide_context=True,
    trigger_rule='all_success',  # Only run if all upstream succeed
    dag=dag,
)

# Define dependencies
extract_task >> transform_task >> load_task >> notify_task

# Alternative: Parallel tasks after extract
# extract_task >> [transform_task, another_task] >> load_task
```

---

## **OLTP vs OLAP**

Understanding the difference between transactional and analytical processing.

```
┌─────────────────────────────────────────────────────────────────┐
│                    OLTP VS OLAP                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  OLTP (Online Transaction Processing)                           │
│  ───────────────────────────────────────────────────────────   │
│                                                                 │
│  Purpose: Handle day-to-day transactions                        │
│  Examples: Order processing, banking transactions, user updates   │
│                                                                 │
│  Characteristics:                                               │
│  • High write volume                                            │
│  • Simple queries (single row lookups)                          │
│  • ACID compliance required                                     │
│  • Low latency for individual operations                        │
│  • Row-oriented storage                                         │
│                                                                 │
│  Technologies:                                                  │
│  • PostgreSQL, MySQL (relational)                               │
│  • DynamoDB, Cassandra (NoSQL)                                  │
│  • Spanner, CockroachDB (distributed SQL)                       │
│                                                                 │
│  ───────────────────────────────────────────────────────────   │
│                                                                 │
│  OLAP (Online Analytical Processing)                            │
│  ───────────────────────────────────────────────────────────   │
│                                                                 │
│  Purpose: Analyze historical data for business intelligence     │
│  Examples: Sales trends, user behavior analysis, reporting        │
│                                                                 │
│  Characteristics:                                               │
│  • High read volume                                             │
│  • Complex queries (aggregations, joins across tables)          │
│  • Eventual consistency acceptable                              │
│  • Higher latency acceptable (seconds to minutes)               │
│  • Column-oriented storage (efficient for aggregations)         │
│                                                                 │
│  Technologies:                                                  │
│  • Snowflake, BigQuery, Redshift (cloud warehouses)             │
│  • Apache Druid, ClickHouse (real-time analytics)             │
│  • Presto/Trino (federated query engine)                        │
│  • Apache Pinot (low-latency OLAP)                                │
│                                                                 │
│  ───────────────────────────────────────────────────────────   │
│                                                                 │
│  HYBRID APPROACHES:                                             │
│                                                                 │
│  HTAP (Hybrid Transactional/Analytical Processing):             │
│  • Single system for both OLTP and OLAP                         │
│  • Examples: TiDB, SingleStore, AlloyDB                           │
│  • Trade-off: Neither optimized as pure OLTP or OLAP              │
│                                                                 │
│  Lambda Architecture (from earlier):                              │
│  • Speed layer (stream) + Batch layer (historical)              │
│  • Separate systems optimized for each                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Columnar Storage Example (Why OLAP is Fast):**

```
Row-Oriented (OLTP - PostgreSQL):
┌─────────────────────────────────────────────────────────┐
│ ID │ Name  │ Age │ City     │ Salary │ Dept            │
├─────────────────────────────────────────────────────────┤
│ 1  │ Alice │ 30  │ NYC      │ 100000 │ Engineering     │
│ 2  │ Bob   │ 25  │ LA       │ 80000  │ Sales           │
│ 3  │ Carol │ 35  │ Chicago  │ 120000 │ Engineering     │
└─────────────────────────────────────────────────────────┘

Query: SELECT AVG(salary) FROM employees WHERE dept = 'Engineering'
Must read: All rows, then filter, then aggregate
I/O: Read entire table (inefficient for analytics)

Column-Oriented (OLAP - Parquet/ClickHouse):
┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│ ID       │  │ Name     │  │ Age      │  │ Salary   │
│ [1,2,3]  │  │[A,B,C]   │  │[30,25,35]│  │[100,80,120]│
└──────────┘  └──────────┘  └──────────┘  └──────────┘
┌──────────┐  ┌──────────┐
│ City     │  │ Dept     │
│[NYC,LA,Chi]│  │[Eng,Sal,Eng]│
└──────────┘  └──────────┘

Query: SELECT AVG(salary) FROM employees WHERE dept = 'Engineering'
1. Read only 'Dept' column (compressed, fast)
2. Find positions where Dept = 'Engineering' (positions 0, 2)
3. Read only 'Salary' column at positions 0, 2
4. Calculate average

I/O: Read 2 columns only (much faster!)
Compression: Columnar data compresses better (similar values together)
Vectorization: CPU can process column data in parallel
```

---

## **Chapter Summary**

### **Key Takeaways**

| Concept | Summary |
|---------|---------|
| **Batch Processing** | MapReduce is foundational; Spark is modern standard; process historical data |
| **Stream Processing** | Flink for complex event processing; Kafka Streams for simpler cases; real-time |
| **Lambda Architecture** | Separate speed (stream) and batch layers; complex but accurate |
| **Kappa Architecture** | Single stream pipeline; simpler; replay log for reprocessing |
| **Data Orchestration** | Airflow for workflow management; DAGs define dependencies |
| **OLTP vs OLAP** | OLTP for transactions (row-oriented); OLAP for analytics (column-oriented) |
| **Data Warehouses** | Snowflake, BigQuery, Redshift for large-scale analytics |
| **Cold Start Mitigation** | Provisioned concurrency, keep-alive, lazy loading, connection pooling |

### **Technology Selection Guide**

| Use Case | Recommended Technology |
|----------|------------------------|
| Batch ETL | Apache Spark, AWS Glue |
| Real-time streaming | Apache Flink, Kafka Streams |
| Simple workflows | AWS Step Functions |
| Complex DAGs | Apache Airflow, Prefect |
| Ad-hoc analytics | Presto/Trino, BigQuery |
| Real-time OLAP | Apache Druid, ClickHouse |
| Data warehouse | Snowflake, Redshift, BigQuery |
| Event storage | Kafka, AWS Kinesis |

---

**Next:** In Chapter 16, we'll explore **Real-World System Design Case Studies**, applying everything we've learned to design systems like a URL Shortener, Twitter News Feed, Chat Application, Video Streaming Service, and more.


<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='14. serverless_and_cloud_native_architecture.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../6. Real_world_system_design_case_studies/16. user_facing_applications.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
