# Module 09: End-to-End Pipeline Project

**Estimated Time:** 90-120 minutes

## Learning Objectives

By the end of this module, you will:
- Build a complete production-ready data pipeline
- Apply all concepts from previous modules
- Implement extraction, transformation, validation, and loading
- Add comprehensive logging and error handling
- Test and document the pipeline
- Understand deployment considerations

---

## Project: E-Commerce Sales Analytics Pipeline

### Business Requirements

Build a daily ETL pipeline that:
1. **Extracts** sales data from multiple sources (CSV files, API)
2. **Transforms** and enriches the data
3. **Validates** data quality
4. **Loads** to a data warehouse (Parquet files)
5. **Generates** summary reports

### Data Sources

- **Orders**: Customer orders (CSV)
- **Products**: Product catalog (CSV)
- **Customers**: Customer information (API simulation)

### Success Criteria

- [OK] All data sources successfully extracted
- [OK] Data quality checks pass (>95% quality score)
- [OK] Output matches expected schema
- [OK] Pipeline completes in < 5 minutes
- [OK] Comprehensive logging
- [OK] Error handling and recovery

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import logging
import json
from pathlib import Path
import time

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    handlers=[logging.FileHandler("../data/processed/pipeline.log"), logging.StreamHandler()],
)
logger = logging.getLogger("SalesPipeline")

print("[OK] Libraries loaded and logging configured")

---

## 1. Generate Sample Data Sources

In [None]:
# Generate sample orders data
np.random.seed(42)

n_orders = 1000
orders = pd.DataFrame(
    {
        "order_id": range(1, n_orders + 1),
        "customer_id": np.random.randint(1, 201, n_orders),
        "product_id": np.random.randint(1, 51, n_orders),
        "quantity": np.random.randint(1, 10, n_orders),
        "order_date": [
            datetime(2024, 1, 1) + timedelta(days=np.random.randint(0, 90)) for _ in range(n_orders)
        ],
        "status": np.random.choice(
            ["pending", "processing", "shipped", "delivered", "cancelled"], n_orders
        ),
    }
)

orders.to_csv("../data/raw/orders.csv", index=False)
print(f"[OK] Generated {len(orders):,} orders")

# Generate sample products data
products = pd.DataFrame(
    {
        "product_id": range(1, 51),
        "product_name": [f"Product {i}" for i in range(1, 51)],
        "category": np.random.choice(["Electronics", "Clothing", "Home", "Books", "Sports"], 50),
        "price": np.random.uniform(10, 500, 50).round(2),
        "cost": np.random.uniform(5, 300, 50).round(2),
    }
)

products.to_csv("../data/raw/products.csv", index=False)
print(f"[OK] Generated {len(products)} products")

# Generate sample customers (simulating API data)
customers = pd.DataFrame(
    {
        "customer_id": range(1, 201),
        "name": [f"Customer {i}" for i in range(1, 201)],
        "email": [f"customer{i}@example.com" for i in range(1, 201)],
        "country": np.random.choice(["USA", "UK", "Canada", "Australia", "Germany"], 200),
        "signup_date": [
            datetime(2023, 1, 1) + timedelta(days=np.random.randint(0, 365)) for _ in range(200)
        ],
    }
)

# Save as JSON (simulating API response)
customers.to_json("../data/raw/customers.json", orient="records", indent=2)
print(f"[OK] Generated {len(customers)} customers")

---

## 2. Build the Pipeline

In [None]:
class SalesDataPipeline:
    """
    End-to-end sales data pipeline
    """

    def __init__(self, config):
        self.config = config
        self.logger = logging.getLogger(f"{__name__}.Pipeline")
        self.metrics = {
            "start_time": None,
            "end_time": None,
            "duration_seconds": None,
            "records_extracted": {},
            "records_transformed": 0,
            "records_loaded": 0,
            "quality_score": 0.0,
            "status": "pending",
            "errors": [],
        }

    def run(self):
        """Execute complete pipeline"""
        self.metrics["start_time"] = datetime.now()
        self.logger.info("=" * 80)
        self.logger.info("SALES DATA PIPELINE STARTED")
        self.logger.info("=" * 80)

        try:
            # Phase 1: Extract
            orders_df, products_df, customers_df = self._extract()

            # Phase 2: Transform
            enriched_df = self._transform(orders_df, products_df, customers_df)

            # Phase 3: Validate
            self._validate(enriched_df)

            # Phase 4: Load
            self._load(enriched_df)

            # Phase 5: Generate Reports
            self._generate_reports(enriched_df)

            # Success
            self.metrics["status"] = "success"

        except Exception as e:
            self.metrics["status"] = "failed"
            self.metrics["errors"].append(str(e))
            self.logger.error(f"Pipeline failed: {e}", exc_info=True)
            raise

        finally:
            self.metrics["end_time"] = datetime.now()
            self.metrics["duration_seconds"] = (
                self.metrics["end_time"] - self.metrics["start_time"]
            ).total_seconds()
            self._log_metrics()

    def _extract(self):
        """Extract data from all sources"""
        self.logger.info("[EXTRACT] Starting data extraction")

        # Extract orders (CSV)
        orders_df = pd.read_csv("../data/raw/orders.csv")
        orders_df["order_date"] = pd.to_datetime(orders_df["order_date"])
        self.metrics["records_extracted"]["orders"] = len(orders_df)
        self.logger.info(f"[EXTRACT] Orders: {len(orders_df):,} records")

        # Extract products (CSV)
        products_df = pd.read_csv("../data/raw/products.csv")
        self.metrics["records_extracted"]["products"] = len(products_df)
        self.logger.info(f"[EXTRACT] Products: {len(products_df):,} records")

        # Extract customers (JSON - simulated API)
        customers_df = pd.read_json("../data/raw/customers.json")
        customers_df["signup_date"] = pd.to_datetime(customers_df["signup_date"])
        self.metrics["records_extracted"]["customers"] = len(customers_df)
        self.logger.info(f"[EXTRACT] Customers: {len(customers_df):,} records")

        self.logger.info("[EXTRACT] [OK] Extraction complete")
        return orders_df, products_df, customers_df

    def _transform(self, orders_df, products_df, customers_df):
        """Transform and enrich data"""
        self.logger.info("[TRANSFORM] Starting data transformation")

        # Join orders with products
        enriched = orders_df.merge(products_df, on="product_id", how="left")
        self.logger.info("[TRANSFORM] Joined orders with products")

        # Join with customers
        enriched = enriched.merge(customers_df, on="customer_id", how="left")
        self.logger.info("[TRANSFORM] Joined with customers")

        # Calculate revenue and profit
        enriched["revenue"] = (enriched["quantity"] * enriched["price"]).round(2)
        enriched["cost_total"] = (enriched["quantity"] * enriched["cost"]).round(2)
        enriched["profit"] = (enriched["revenue"] - enriched["cost_total"]).round(2)
        self.logger.info("[TRANSFORM] Calculated revenue and profit")

        # Add derived columns
        enriched["year"] = enriched["order_date"].dt.year
        enriched["month"] = enriched["order_date"].dt.month
        enriched["quarter"] = enriched["order_date"].dt.quarter
        enriched["is_delivered"] = enriched["status"] == "delivered"
        self.logger.info("[TRANSFORM] Added derived columns")

        # Clean data
        enriched = enriched.dropna()
        self.logger.info(f"[TRANSFORM] Removed null values")

        self.metrics["records_transformed"] = len(enriched)
        self.logger.info(f"[TRANSFORM] [OK] Transformation complete: {len(enriched):,} records")

        return enriched

    def _validate(self, df):
        """Validate data quality"""
        self.logger.info("[VALIDATE] Starting data validation")

        validation_checks = []

        # Check 1: No nulls in critical columns
        critical_cols = ["order_id", "customer_id", "product_id", "revenue"]
        for col in critical_cols:
            null_count = df[col].isnull().sum()
            passed = null_count == 0
            validation_checks.append(passed)
            status = "[OK]" if passed else "[FAIL]"
            self.logger.info(f"[VALIDATE] {status} No nulls in {col}: {passed}")

        # Check 2: Revenue is positive
        revenue_valid = (df["revenue"] >= 0).all()
        validation_checks.append(revenue_valid)
        status = "[OK]" if revenue_valid else "[FAIL]"
        self.logger.info(f"[VALIDATE] {status} All revenue >= 0: {revenue_valid}")

        # Check 3: Quantity is positive
        qty_valid = (df["quantity"] > 0).all()
        validation_checks.append(qty_valid)
        status = "[OK]" if qty_valid else "[FAIL]"
        self.logger.info(f"[VALIDATE] {status} All quantity > 0: {qty_valid}")

        # Calculate quality score
        quality_score = sum(validation_checks) / len(validation_checks) * 100
        self.metrics["quality_score"] = quality_score

        self.logger.info(f"[VALIDATE] Quality Score: {quality_score:.1f}%")

        if quality_score < 95:
            raise ValueError(f"Data quality check failed: {quality_score:.1f}% < 95%")

        self.logger.info("[VALIDATE] [OK] Validation passed")

    def _load(self, df):
        """Load data to warehouse"""
        self.logger.info("[LOAD] Starting data load")

        # Load to Parquet (partitioned by year/month)
        output_path = "../data/processed/sales_warehouse.parquet"
        df.to_parquet(output_path, compression="snappy", index=False)

        self.metrics["records_loaded"] = len(df)
        self.logger.info(f"[LOAD] [OK] Loaded {len(df):,} records to {output_path}")

    def _generate_reports(self, df):
        """Generate summary reports"""
        self.logger.info("[REPORT] Generating summary reports")

        # Summary by category
        category_summary = (
            df.groupby("category")
            .agg({"order_id": "count", "revenue": "sum", "profit": "sum"})
            .round(2)
        )
        category_summary.columns = ["orders", "revenue", "profit"]
        category_summary.to_csv("../data/processed/category_summary.csv")
        self.logger.info("[REPORT] Category summary saved")

        # Summary by country
        country_summary = (
            df.groupby("country")
            .agg({"order_id": "count", "revenue": "sum", "customer_id": "nunique"})
            .round(2)
        )
        country_summary.columns = ["orders", "revenue", "unique_customers"]
        country_summary.to_csv("../data/processed/country_summary.csv")
        self.logger.info("[REPORT] Country summary saved")

        self.logger.info("[REPORT] [OK] Reports generated")

    def _log_metrics(self):
        """Log pipeline execution metrics"""
        self.logger.info("=" * 80)
        self.logger.info("PIPELINE EXECUTION SUMMARY")
        self.logger.info("=" * 80)
        self.logger.info(f"Status: {self.metrics['status']}")
        self.logger.info(f"Duration: {self.metrics['duration_seconds']:.2f}s")
        self.logger.info(f"Records Extracted: {self.metrics['records_extracted']}")
        self.logger.info(f"Records Transformed: {self.metrics['records_transformed']:,}")
        self.logger.info(f"Records Loaded: {self.metrics['records_loaded']:,}")
        self.logger.info(f"Quality Score: {self.metrics['quality_score']:.1f}%")
        if self.metrics["errors"]:
            self.logger.info(f"Errors: {self.metrics['errors']}")
        self.logger.info("=" * 80)


print("[OK] Pipeline class defined")

---

## 3. Run the Pipeline

In [None]:
# Configure pipeline
config = {"name": "sales_analytics_pipeline", "version": "1.0.0", "schedule": "daily at 2:00 AM"}

# Create and run pipeline
pipeline = SalesDataPipeline(config)
pipeline.run()

print("\n" + "=" * 80)
print("[OK] PIPELINE COMPLETED SUCCESSFULLY!")
print("=" * 80)

---

## 4. Verify Results

In [None]:
# Load and verify warehouse data
warehouse_df = pd.read_parquet("../data/processed/sales_warehouse.parquet")

print("Warehouse Data:")
print(f"  Total Records: {len(warehouse_df):,}")
print(f"  Columns: {list(warehouse_df.columns)}")
print(f"  Date Range: {warehouse_df['order_date'].min()} to {warehouse_df['order_date'].max()}")
print(f"  Total Revenue: ${warehouse_df['revenue'].sum():,.2f}")
print(f"  Total Profit: ${warehouse_df['profit'].sum():,.2f}")

print("\nSample Records:")
warehouse_df.head()

In [None]:
# View summary reports
category_summary = pd.read_csv("../data/processed/category_summary.csv")
country_summary = pd.read_csv("../data/processed/country_summary.csv")

print("Category Summary:")
print(category_summary.sort_values("revenue", ascending=False))

print("\nCountry Summary:")
print(country_summary.sort_values("revenue", ascending=False))

---

## 5. Deployment Considerations

### Production Deployment Checklist

#### Infrastructure
- [ ] Set up production environment (cloud or on-premise)
- [ ] Configure database/data warehouse
- [ ] Set up orchestration tool (Airflow, Prefect)
- [ ] Configure monitoring and alerting

#### Security
- [ ] Store credentials in secret manager
- [ ] Implement access controls (RBAC)
- [ ] Enable encryption at rest and in transit
- [ ] Set up audit logging

#### Reliability
- [ ] Implement comprehensive error handling
- [ ] Set up retry logic with exponential backoff
- [ ] Create alerting for failures
- [ ] Implement circuit breakers
- [ ] Set up data backups

#### Monitoring
- [ ] Track execution metrics
- [ ] Monitor data quality scores
- [ ] Set up dashboards
- [ ] Configure alerts for anomalies
- [ ] Track resource usage

#### Testing
- [ ] Unit tests for all transformations
- [ ] Integration tests for full pipeline
- [ ] Data quality tests
- [ ] Performance tests
- [ ] Disaster recovery tests

#### Documentation
- [ ] Document pipeline architecture
- [ ] Create runbooks for common issues
- [ ] Document data lineage
- [ ] Write onboarding guide
- [ ] Maintain change log

### Scaling Considerations

1. **Horizontal Scaling**: Add more workers
2. **Vertical Scaling**: Increase resources per worker
3. **Partitioning**: Split data by date/region
4. **Caching**: Cache frequently accessed data
5. **Incremental Processing**: Process only new/changed data

---

## 6. Course Conclusion

### What You've Learned

Throughout this course, you've covered:

1. **Module 00**: Setup and introduction to data engineering
2. **Module 01**: Data engineering concepts, ETL vs ELT, architectures
3. **Module 02**: Data extraction from files, databases, and APIs
4. **Module 03**: Data transformation and cleaning with pandas
5. **Module 04**: Data loading strategies and file formats
6. **Module 05**: Building modular ETL pipelines
7. **Module 06**: Apache Spark for distributed data processing
8. **Module 07**: Workflow orchestration with Airflow concepts
9. **Module 08**: Data quality and validation frameworks
10. **Module 09**: Complete end-to-end pipeline project

### Key Takeaways

[OK] **Data Engineering Fundamentals**: ETL, data pipelines, data quality

[OK] **Tools & Technologies**: pandas, Spark, Airflow, Parquet, SQL

[OK] **Best Practices**: Modular design, logging, testing, error handling

[OK] **Production Skills**: Validation, monitoring, deployment

### Next Steps in Your Journey

1. **Build More Pipelines**: Practice with different data sources
2. **Learn Cloud Platforms**: AWS, GCP, or Azure data services
3. **Master Orchestration**: Set up real Airflow or Prefect
4. **Deep Dive into Spark**: Learn advanced Spark optimization
5. **Data Modeling**: Study Kimball, Data Vault methodologies
6. **Streaming**: Learn Kafka, Flink for real-time pipelines
7. **Contribute**: Open source data engineering projects
8. **Stay Current**: Follow data engineering blogs and communities

### Recommended Resources

**Books**:
- Fundamentals of Data Engineering (Joe Reis & Matt Housley)
- Designing Data-Intensive Applications (Martin Kleppmann)
- The Data Warehouse Toolkit (Ralph Kimball)

**Online**:
- DataCamp Data Engineering Track
- r/dataengineering (Reddit)
- Data Engineering Weekly Newsletter
- Seattle Data Guy (Blog)

**Practice**:
- Kaggle datasets
- Public APIs
- Personal projects

---

## Congratulations!

You've completed the Data Engineering Fundamentals course! 

You now have the skills to:
- Design and build data pipelines
- Extract, transform, and load data at scale
- Ensure data quality and reliability
- Deploy production data systems

**Keep building, keep learning, and welcome to the world of data engineering!** [SUCCESS]

---

*Questions or feedback? Check the main README.md or create an issue in the repository.*