# Lesson 2: ETL Pipelines

**Module 3: Data & Pipeline Engineering**  
**Estimated Time**: 2 hours  
**Difficulty**: Intermediate

---

## üéØ Learning Objectives

By the end of this lesson, you will:

‚úÖ Understand ETL (Extract, Transform, Load) vs ELT  
‚úÖ Master **Idempotency** (the most important pipeline concept)  
‚úÖ Build a resilient Batch Pipeline in Python  
‚úÖ Answer interview questions on pipeline design  

---

## üìö Table of Contents

1. [ETL vs ELT](#1-etl-elt)
2. [The Golden Rule: Idempotency](#2-idempotency)
3. [Hands-On: Robust Pipeline](#3-hands-on)
4. [Interview Preparation](#4-interview-questions)

---

## 1. ETL vs ELT

### ETL (Extract, Transform, Load)
- **Order**: Read data ‚Üí Process in memory (Python/Spark) ‚Üí Write to DB.
- **Use case**: Complex transformations, privacy filtering (PII) before storage.

### ELT (Extract, Load, Transform)
- **Order**: Dump raw data to DB/Warehouse ‚Üí Transform using SQL (dbt).
- **Use case**: Modern Data Stack (Snowflake/BigQuery). Compute is cheap inside the warehouse.

**ML Context**: We mostly do **ETL** because complex feature engineering is easier in Python/Spark than SQL.

## 2. The Golden Rule: Idempotency

**Definition**: An operation is *idempotent* if running it multiple times yields the **same result** as running it once.

**Why it matters**:
Pipelines fail. You will need to retry them.

### Example: Bad (Not Idempotent)
```python
# Failing run adds 5 records, then crashes.
# Retry adds 10 records.
# Total = 15 records. Duplicates!
def process_data():
    new_data = read()
    database.append(new_data) 
```

### Example: Good (Idempotent)
```python
# Failing run writes partition '2023-01-01', then crashes.
# Retry OVERWRITES partition '2023-01-01'.
# Result is correct.
def process_data(date):
    new_data = read(date)
    database.overwrite_partition(date, new_data)
```

## 3. Hands-On: Robust Pipeline

Let's build a mini-pipeline that handles failures gracefully.

In [None]:
import pandas as pd
import os
import shutil

# Simulation: Source Data (Daily Logs)
source_data = {
    '2023-01-01': pd.DataFrame({'id': [1, 2], 'val': [10, 20]}),
    '2023-01-02': pd.DataFrame({'id': [3, 4], 'val': [30, 40]})
}

OUTPUT_DIR = "data_lake/processed"

def extract(date):
    print(f"[Extract] Reading source for {date}...")
    # Simulate API call or DB read
    return source_data.get(date)

def transform(df):
    if df is None: return None
    print(f"[Transform] Normalizing values...")
    df = df.copy()
    df['val_norm'] = df['val'] / 100.0
    return df

def load(df, date):
    if df is None: return
    
    # IDEMPOTENCY KEY: Partition by Date
    # Instead of appending to one big file, we write specific files per day.
    # If we re-run this function, we just overwrite the file.
    
    target_path = f"{OUTPUT_DIR}/date={date}"
    
    # Ensure clean slate for this partition
    if os.path.exists(target_path):
        shutil.rmtree(target_path)
    os.makedirs(target_path)
    
    file_path = f"{target_path}/data.parquet"
    print(f"[Load] Writing to {file_path}...")
    df.to_parquet(file_path)

def run_pipeline(date):
    try:
        print(f"\n--- Starting Pipeline for {date} ---")
        raw = extract(date)
        processed = transform(raw)
        load(processed, date)
        print("‚úÖ Success!")
    except Exception as e:
        print(f"‚ùå Failed: {e}")

# Run for Day 1
run_pipeline('2023-01-01')

# Run for Day 1 AGAIN (Should be safe!)
run_pipeline('2023-01-01')

# Run for Day 2
run_pipeline('2023-01-02')

## 4. Interview Preparation

### Common Questions

#### Q1: "How do you handle pipeline failures?"
**Answer**: 
1. **Idempotency**: Ensure retries don't duplicate data.
2. **Atomic Writes**: Write to a temp folder, then swap/rename at the end.
3. **Checkpointing**: In streaming, commit offsets only after processing.
4. **Alerting**: PagerDuty/Slack alerts on failure.

#### Q2: "What is a DAG?"
**Answer**: "Directed Acyclic Graph. It represents the workflow logic. Task A (Extract) must finish before Task B (Transform) starts. Airflow uses DAGs to manage dependencies and execution order without creating loops."