# Module 05: Building ETL Pipelines

**Estimated Time:** 60-75 minutes

## Learning Objectives

By the end of this module, you will:
- Design modular ETL pipeline components
- Implement configuration-driven pipelines
- Add comprehensive logging and error handling
- Build reusable pipeline patterns
- Test and validate pipelines
- Apply best practices for production pipelines

---

## 1. ETL Pipeline Design Principles

### Core Principles

1. **Modularity**: Separate concerns (Extract, Transform, Load)
2. **Reusability**: Write components that can be reused
3. **Configurability**: Use config files, not hardcoded values
4. **Idempotency**: Safe to run multiple times
5. **Observability**: Log everything for debugging
6. **Error Handling**: Fail gracefully, retry when appropriate
7. **Testing**: Validate inputs and outputs

### Pipeline Architecture

```
Config → Extract → Validate → Transform → Validate → Load → Log
           ↓          ↓           ↓          ↓        ↓      ↓
        Logging   Error      Logging    Error    Logging  Metrics
                  Handling               Handling
```

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import logging
import json
import yaml
from pathlib import Path

# Setup logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

print("[OK] Libraries loaded and logging configured")

---

## 2. Building Modular Pipeline Components

In [None]:
class Extractor:
    """
    Handles data extraction from various sources
    """

    def __init__(self, source_config):
        self.config = source_config
        self.logger = logging.getLogger(f"{__name__}.Extractor")

    def extract(self):
        """
        Extract data based on configuration
        """
        source_type = self.config.get("type")
        self.logger.info(f"Extracting from {source_type}")

        try:
            if source_type == "csv":
                df = pd.read_csv(self.config["path"])
            elif source_type == "json":
                df = pd.read_json(self.config["path"])
            elif source_type == "api":
                # Simulated API extraction
                df = self._extract_from_api()
            else:
                raise ValueError(f"Unsupported source type: {source_type}")

            self.logger.info(f"Extracted {len(df):,} records")
            return df

        except Exception as e:
            self.logger.error(f"Extraction failed: {e}")
            raise

    def _extract_from_api(self):
        """Simulated API extraction"""
        return pd.DataFrame({"id": range(100), "value": np.random.randn(100)})


class Transformer:
    """
    Handles data transformation
    """

    def __init__(self, transform_config):
        self.config = transform_config
        self.logger = logging.getLogger(f"{__name__}.Transformer")

    def transform(self, df):
        """
        Apply transformations based on configuration
        """
        self.logger.info("Starting transformation")
        df_transformed = df.copy()

        try:
            # Apply configured transformations
            for transformation in self.config.get("steps", []):
                operation = transformation["operation"]

                if operation == "drop_nulls":
                    df_transformed = df_transformed.dropna()
                elif operation == "add_column":
                    col_name = transformation["name"]
                    col_value = transformation["value"]
                    df_transformed[col_name] = col_value
                elif operation == "filter":
                    column = transformation["column"]
                    condition = transformation["condition"]
                    value = transformation["value"]

                    if condition == ">":
                        df_transformed = df_transformed[df_transformed[column] > value]
                    elif condition == "<":
                        df_transformed = df_transformed[df_transformed[column] < value]

                self.logger.info(f"Applied {operation}")

            self.logger.info(f"Transformation complete: {len(df_transformed):,} records")
            return df_transformed

        except Exception as e:
            self.logger.error(f"Transformation failed: {e}")
            raise


class Loader:
    """
    Handles data loading to destinations
    """

    def __init__(self, destination_config):
        self.config = destination_config
        self.logger = logging.getLogger(f"{__name__}.Loader")

    def load(self, df):
        """
        Load data based on configuration
        """
        dest_type = self.config.get("type")
        self.logger.info(f"Loading to {dest_type}")

        try:
            if dest_type == "csv":
                df.to_csv(self.config["path"], index=False)
            elif dest_type == "parquet":
                df.to_parquet(self.config["path"], compression="snappy", index=False)
            elif dest_type == "json":
                df.to_json(self.config["path"], orient="records", indent=2)
            else:
                raise ValueError(f"Unsupported destination type: {dest_type}")

            self.logger.info(f"Loaded {len(df):,} records to {self.config['path']}")
            return True

        except Exception as e:
            self.logger.error(f"Loading failed: {e}")
            raise


print("[OK] Pipeline components defined")

---

## 3. Configuration-Driven Pipeline

In [None]:
# Create sample data source
sample_data = pd.DataFrame(
    {
        "id": range(100),
        "value": np.random.randn(100),
        "category": np.random.choice(["A", "B", "C"], 100),
        "amount": np.random.uniform(10, 100, 100),
    }
)

sample_data.to_csv("../data/raw/pipeline_source.csv", index=False)
print("[OK] Sample source data created")

In [None]:
# Define pipeline configuration
pipeline_config = {
    "name": "sample_etl_pipeline",
    "version": "1.0.0",
    "source": {"type": "csv", "path": "../data/raw/pipeline_source.csv"},
    "transformations": {
        "steps": [
            {"operation": "drop_nulls"},
            {"operation": "filter", "column": "amount", "condition": ">", "value": 50},
            {
                "operation": "add_column",
                "name": "processed_at",
                "value": datetime.now().isoformat(),
            },
        ]
    },
    "destination": {"type": "parquet", "path": "../data/processed/pipeline_output.parquet"},
}

print("Pipeline Configuration:")
print(json.dumps(pipeline_config, indent=2, default=str))

---

## 4. Complete ETL Pipeline Class

In [None]:
class ETLPipeline:
    """
    Complete ETL Pipeline orchestrator
    """

    def __init__(self, config):
        self.config = config
        self.logger = logging.getLogger(f"{__name__}.ETLPipeline")
        self.metrics = {
            "start_time": None,
            "end_time": None,
            "duration_seconds": None,
            "records_extracted": 0,
            "records_transformed": 0,
            "records_loaded": 0,
            "status": "pending",
        }

    def run(self):
        """
        Execute the complete ETL pipeline
        """
        self.metrics["start_time"] = datetime.now()
        self.logger.info(f"Starting pipeline: {self.config['name']}")

        try:
            # Extract
            extractor = Extractor(self.config["source"])
            df = extractor.extract()
            self.metrics["records_extracted"] = len(df)

            # Validate extraction
            self._validate_data(df, stage="extract")

            # Transform
            transformer = Transformer(self.config["transformations"])
            df_transformed = transformer.transform(df)
            self.metrics["records_transformed"] = len(df_transformed)

            # Validate transformation
            self._validate_data(df_transformed, stage="transform")

            # Load
            loader = Loader(self.config["destination"])
            loader.load(df_transformed)
            self.metrics["records_loaded"] = len(df_transformed)

            # Success
            self.metrics["status"] = "success"
            self.metrics["end_time"] = datetime.now()
            self.metrics["duration_seconds"] = (
                self.metrics["end_time"] - self.metrics["start_time"]
            ).total_seconds()

            self.logger.info(f"Pipeline completed successfully")
            self._log_metrics()

            return True, self.metrics

        except Exception as e:
            self.metrics["status"] = "failed"
            self.metrics["error"] = str(e)
            self.metrics["end_time"] = datetime.now()

            self.logger.error(f"Pipeline failed: {e}")
            return False, self.metrics

    def _validate_data(self, df, stage):
        """
        Validate data at each stage
        """
        if df is None or len(df) == 0:
            raise ValueError(f"{stage}: No data to process")

        self.logger.info(f"{stage}: Validation passed ({len(df):,} records)")

    def _log_metrics(self):
        """
        Log pipeline execution metrics
        """
        self.logger.info("=" * 60)
        self.logger.info("PIPELINE METRICS")
        self.logger.info("=" * 60)
        self.logger.info(f"Pipeline: {self.config['name']}")
        self.logger.info(f"Status: {self.metrics['status']}")
        self.logger.info(f"Duration: {self.metrics['duration_seconds']:.2f}s")
        self.logger.info(f"Records Extracted: {self.metrics['records_extracted']:,}")
        self.logger.info(f"Records Transformed: {self.metrics['records_transformed']:,}")
        self.logger.info(f"Records Loaded: {self.metrics['records_loaded']:,}")
        self.logger.info("=" * 60)


print("[OK] ETL Pipeline class defined")

In [None]:
# Run the pipeline
pipeline = ETLPipeline(pipeline_config)
success, metrics = pipeline.run()

if success:
    print("\n[OK] Pipeline executed successfully!")
else:
    print("\n[FAIL] Pipeline execution failed")

print("\nMetrics:")
for key, value in metrics.items():
    print(f"  {key}: {value}")

---

## 5. Error Handling and Recovery

In [None]:
class RobustETLPipeline(ETLPipeline):
    """
    ETL Pipeline with advanced error handling and retry logic
    """

    def __init__(self, config, max_retries=3):
        super().__init__(config)
        self.max_retries = max_retries

    def run_with_retry(self):
        """
        Run pipeline with retry logic
        """
        for attempt in range(1, self.max_retries + 1):
            self.logger.info(f"Attempt {attempt}/{self.max_retries}")

            success, metrics = self.run()

            if success:
                return True, metrics

            if attempt < self.max_retries:
                self.logger.warning(f"Attempt {attempt} failed, retrying...")
                # Could add exponential backoff here

        self.logger.error(f"Pipeline failed after {self.max_retries} attempts")
        return False, metrics


print("[OK] Robust pipeline with retry logic defined")

---

## 6. Pipeline Testing

In [None]:
def test_pipeline(pipeline_config):
    """
    Test pipeline with sample data
    """
    print("Testing Pipeline...")
    print("=" * 60)

    # Test 1: Configuration validation
    required_keys = ["name", "source", "transformations", "destination"]
    for key in required_keys:
        assert key in pipeline_config, f"Missing required config key: {key}"
    print("[OK] Test 1: Configuration valid")

    # Test 2: Source file exists
    if pipeline_config["source"]["type"] in ["csv", "json"]:
        source_path = Path(pipeline_config["source"]["path"])
        assert source_path.exists(), f"Source file not found: {source_path}"
    print("[OK] Test 2: Source accessible")

    # Test 3: Run pipeline
    pipeline = ETLPipeline(pipeline_config)
    success, metrics = pipeline.run()
    assert success, "Pipeline execution failed"
    print("[OK] Test 3: Pipeline executed successfully")

    # Test 4: Output file created
    output_path = Path(pipeline_config["destination"]["path"])
    assert output_path.exists(), "Output file not created"
    print("[OK] Test 4: Output file created")

    # Test 5: Validate record count
    assert metrics["records_loaded"] > 0, "No records loaded"
    print(f"[OK] Test 5: Loaded {metrics['records_loaded']:,} records")

    print("=" * 60)
    print("All tests passed!")


# Run tests
test_pipeline(pipeline_config)

---

## 7. Best Practices Summary

### Design
[OK] Modular components (Extract, Transform, Load)
[OK] Configuration-driven (not hardcoded)
[OK] Reusable and maintainable code

### Reliability
[OK] Comprehensive error handling
[OK] Retry logic for transient failures
[OK] Data validation at each stage

### Observability
[OK] Detailed logging throughout
[OK] Metrics collection and reporting
[OK] Clear success/failure indicators

### Testing
[OK] Unit tests for components
[OK] Integration tests for full pipeline
[OK] Data quality validation

### Documentation
[OK] Clear code comments
[OK] Configuration documentation
[OK] README with setup instructions

---

## Next Steps

In **Module 06: Introduction to Apache Spark**, we'll:
- Learn distributed data processing
- Work with PySpark DataFrames
- Understand when to use Spark vs pandas
- Build Spark-based data pipelines

---

**Ready for big data processing?** Open `06_introduction_to_apache_spark.ipynb`!