# Module 00: Setup and Introduction to Data Engineering

**Estimated Time:** 20-30 minutes

## Learning Objectives

By the end of this module, you will:
- Verify your development environment is set up correctly
- Understand what data engineering is and why it matters
- Learn the course structure and learning path
- Run your first simple data pipeline example
- Get familiar with the tools we'll be using

---

## 1. Environment Verification

Let's make sure all required packages are installed correctly.

In [None]:
# Check Python version
import sys

print(f"Python version: {sys.version}")
print(f"Python version info: {sys.version_info}")

if sys.version_info < (3, 8):
    print("\n[WARNING] WARNING: Python 3.8 or higher is recommended")
else:
    print("\n[OK] Python version is compatible")

In [None]:
# Verify core data manipulation libraries
try:
    import pandas as pd
    import numpy as np

    print(f"[OK] pandas version: {pd.__version__}")
    print(f"[OK] numpy version: {np.__version__}")
except ImportError as e:
    print(f"[FAIL] Error importing libraries: {e}")
    print("\nPlease run: pip install pandas numpy")

In [None]:
# Verify database connectivity libraries
try:
    import sqlalchemy

    print(f"[OK] SQLAlchemy version: {sqlalchemy.__version__}")
except ImportError as e:
    print(f"[FAIL] Error importing SQLAlchemy: {e}")
    print("\nPlease run: pip install sqlalchemy")

In [None]:
# Verify PySpark (Big Data library)
try:
    import pyspark

    print(f"[OK] PySpark version: {pyspark.__version__}")
except ImportError as e:
    print(f"[WARNING] PySpark not found: {e}")
    print("\nPySpark will be needed in Module 06")
    print("Install with: pip install pyspark")

In [None]:
# Verify additional utilities
try:
    import requests
    import yaml

    print(f"[OK] requests version: {requests.__version__}")
    print("[OK] PyYAML is installed")
except ImportError as e:
    print(f"[WARNING] Some utilities not found: {e}")

### Environment Check Summary

If you see [OK] for pandas, numpy, and SQLAlchemy, you're ready to start!

If you see [FAIL] or [WARNING]:
1. Make sure you've activated your virtual environment
2. Run: `pip install -r requirements.txt`
3. Restart the Jupyter kernel (Kernel → Restart)
4. Re-run the cells above

---

## 2. What is Data Engineering?

### Definition

**Data Engineering** is the practice of designing, building, and maintaining systems that collect, store, process, and deliver data at scale. Data engineers create the infrastructure and pipelines that enable data scientists, analysts, and business users to access reliable, high-quality data.

### The Data Engineering Workflow

```
┌─────────────┐       ┌─────────────┐       ┌─────────────┐       ┌─────────────┐
│   EXTRACT   │  -->  │  TRANSFORM  │  -->  │    LOAD     │  -->  │   ANALYZE   │
│             │       │             │       │             │       │             │
│ Data from   │       │ Clean &     │       │ Store in    │       │ BI, ML,     │
│ various     │       │ process     │       │ warehouse/  │       │ Analytics   │
│ sources     │       │ data        │       │ lake        │       │             │
└─────────────┘       └─────────────┘       └─────────────┘       └─────────────┘
```

### Why Data Engineering Matters

1. **Data Volume is Exploding**: Organizations generate terabytes of data daily
2. **Data Quality is Critical**: Bad data leads to bad decisions
3. **Speed Matters**: Real-time insights require real-time pipelines
4. **Complexity is Increasing**: Multiple data sources, formats, and systems
5. **Foundation for AI/ML**: Machine learning models need quality data

### Key Responsibilities of Data Engineers

- Design and build data pipelines (ETL/ELT)
- Ensure data quality and reliability
- Optimize data storage and retrieval
- Implement data security and governance
- Monitor and maintain data systems
- Enable data accessibility for stakeholders

---

## 3. Data Engineer vs. Other Data Roles

| Role | Primary Focus | Key Skills | Example Task |
|------|---------------|------------|-------------|
| **Data Engineer** | Infrastructure & Pipelines | Python, SQL, ETL, Spark, Airflow | Build a pipeline to ingest and transform daily sales data |
| **Data Scientist** | Insights & Predictions | Statistics, ML, Python, R | Build a model to predict customer churn |
| **Data Analyst** | Reporting & Analysis | SQL, BI Tools, Excel, Statistics | Create a dashboard showing monthly sales trends |
| **Analytics Engineer** | Data Transformation | SQL, dbt, modeling | Transform raw data into analysis-ready tables |
| **ML Engineer** | Production ML Systems | ML frameworks, deployment, scaling | Deploy and scale a recommendation model |

### The Relationship

```
Data Engineers build the infrastructure
        ↓
Analytics Engineers model the data
        ↓
Data Scientists & Analysts use the data
        ↓
ML Engineers productionize the models
```

---

## 4. Your First Data Pipeline

Let's build a simple ETL pipeline to understand the core concept!

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

print("Building your first data pipeline...\n")

### Step 1: EXTRACT - Get Raw Data

In this example, we'll simulate extracting sales data from a source system.

In [None]:
# Simulate extracting data from a source system
def extract_sales_data():
    """
    Simulates extracting sales data from a database or API.
    In real scenarios, this would connect to actual data sources.
    """
    np.random.seed(42)

    # Generate sample data
    n_records = 100
    dates = [datetime.now() - timedelta(days=x) for x in range(n_records)]

    raw_data = {
        "date": dates,
        "product_id": np.random.choice(["P001", "P002", "P003", "P004"], n_records),
        "quantity": np.random.randint(1, 20, n_records),
        "price": np.random.uniform(10.0, 100.0, n_records),
        "customer_id": np.random.choice(["C001", "C002", "C003", "C004", "C005"], n_records),
        "region": np.random.choice(["North", "South", "East", "West"], n_records),
    }

    df = pd.DataFrame(raw_data)
    print(f"[OK] EXTRACT: Retrieved {len(df)} records from source system")
    return df


# Extract the data
raw_sales_data = extract_sales_data()
print("\nFirst 5 records:")
raw_sales_data.head()

### Step 2: TRANSFORM - Clean and Process Data

Now we'll clean, enhance, and aggregate the data.

In [None]:
def transform_sales_data(df):
    """
    Transform raw sales data:
    - Calculate total revenue
    - Format dates
    - Add derived columns
    - Aggregate by product
    """
    # Create a copy to avoid modifying original
    df_transformed = df.copy()

    # Calculate revenue
    df_transformed["revenue"] = df_transformed["quantity"] * df_transformed["price"]

    # Round price and revenue
    df_transformed["price"] = df_transformed["price"].round(2)
    df_transformed["revenue"] = df_transformed["revenue"].round(2)

    # Extract date components
    df_transformed["year"] = df_transformed["date"].dt.year
    df_transformed["month"] = df_transformed["date"].dt.month
    df_transformed["day_of_week"] = df_transformed["date"].dt.day_name()

    print(f"[OK] TRANSFORM: Processed {len(df_transformed)} records")
    print(f"   - Added revenue calculation")
    print(f"   - Extracted date components")
    print(f"   - Total revenue: ${df_transformed['revenue'].sum():,.2f}")

    return df_transformed


# Transform the data
transformed_sales_data = transform_sales_data(raw_sales_data)
print("\nTransformed data sample:")
transformed_sales_data.head()

In [None]:
# Create aggregated summary for reporting
def create_product_summary(df):
    """
    Create a summary report by product
    """
    summary = (
        df.groupby("product_id")
        .agg({"quantity": "sum", "revenue": "sum", "customer_id": "nunique"})
        .round(2)
    )

    summary.columns = ["total_quantity", "total_revenue", "unique_customers"]
    summary = summary.sort_values("total_revenue", ascending=False)

    print("[OK] TRANSFORM: Created product summary")
    return summary


product_summary = create_product_summary(transformed_sales_data)
print("\nProduct Summary:")
product_summary

### Step 3: LOAD - Save to Destination

Finally, we'll save the processed data to CSV files (in production, this would be a database or data warehouse).

In [None]:
import os


def load_data(df, summary, output_dir="outputs"):
    """
    Load transformed data to destination (CSV files in this example)
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Save detailed data
    detail_file = os.path.join(output_dir, "sales_detail.csv")
    df.to_csv(detail_file, index=False)
    print(f"[OK] LOAD: Saved detailed data to {detail_file}")

    # Save summary data
    summary_file = os.path.join(output_dir, "product_summary.csv")
    summary.to_csv(summary_file)
    print(f"[OK] LOAD: Saved summary data to {summary_file}")

    print(f"\n[DATA] Pipeline complete! Data loaded successfully.")


# Load the data
load_data(transformed_sales_data, product_summary)

### Verify the Output

In [None]:
# Verify files were created
import os

output_files = os.listdir("outputs")
print("Files created:")
for file in output_files:
    file_path = os.path.join("outputs", file)
    file_size = os.path.getsize(file_path)
    print(f"  - {file} ({file_size:,} bytes)")

### Complete Pipeline Function

Let's wrap everything into a single pipeline function:

In [None]:
def run_sales_etl_pipeline():
    """
    Complete ETL pipeline for sales data
    """
    print("=" * 60)
    print("SALES DATA ETL PIPELINE")
    print("=" * 60)
    print()

    # Extract
    print("[1/3] EXTRACT Phase")
    raw_data = extract_sales_data()
    print()

    # Transform
    print("[2/3] TRANSFORM Phase")
    transformed_data = transform_sales_data(raw_data)
    summary = create_product_summary(transformed_data)
    print()

    # Load
    print("[3/3] LOAD Phase")
    load_data(transformed_data, summary)
    print()

    print("=" * 60)
    print("ETL Pipeline Completed Successfully!")
    print("=" * 60)

    return transformed_data, summary


# Run the complete pipeline
final_data, final_summary = run_sales_etl_pipeline()

---

## 5. Course Overview

### What You'll Learn

This course consists of 10 modules covering:

1. **Module 00** (This Module): Setup and Introduction ✓
2. **Module 01**: Introduction to Data Engineering - Concepts and architecture
3. **Module 02**: Data Sources and Extraction - APIs, databases, files
4. **Module 03**: Data Transformation and Cleaning - pandas mastery
5. **Module 04**: Data Loading and Storage - Databases, files, formats
6. **Module 05**: Building ETL Pipelines - Design patterns and best practices
7. **Module 06**: Introduction to Apache Spark - Distributed data processing
8. **Module 07**: Workflow Orchestration - Apache Airflow concepts
9. **Module 08**: Data Quality and Validation - Testing and validation
10. **Module 09**: End-to-End Pipeline Project - Capstone project

### Learning Approach

- **Theory First**: Understand concepts before implementation
- **Hands-On**: Code examples you can run and modify
- **Progressive**: Each module builds on previous ones
- **Practical**: Real-world scenarios and best practices

### Estimated Time

- **Total**: 8-10 hours
- **Per Module**: 45-75 minutes
- **Recommended Pace**: 2-3 modules per week

---

## 6. Key Concepts from This Module

### What You Learned

[OK] **Environment Setup**: Your Python environment is ready for data engineering

[OK] **Data Engineering Definition**: Building systems to collect, store, and process data

[OK] **ETL Process**:
- **Extract**: Get data from sources
- **Transform**: Clean and process data
- **Load**: Store data in destination

[OK] **First Pipeline**: Built a complete working data pipeline

[OK] **Course Structure**: Know what's coming in the next modules

### Key Takeaways

1. Data engineering is about building reliable data infrastructure
2. ETL pipelines are the core of data engineering
3. Python + pandas is powerful for data processing
4. Real-world pipelines follow the same Extract-Transform-Load pattern
5. Code modularity makes pipelines maintainable

---

## 7. Practice Exercise

### Challenge: Modify the Pipeline

Try these modifications to reinforce your learning:

1. **Add a new column**: Calculate `discount` as 10% of revenue
2. **Filter data**: Only include sales where quantity > 5
3. **New aggregation**: Create a summary by region instead of product
4. **Add validation**: Check that all revenues are positive

Use the cells below to experiment:

In [None]:
# Your code here: Add discount column

In [None]:
# Your code here: Filter by quantity

In [None]:
# Your code here: Aggregate by region

---

## 8. Next Steps

Congratulations on completing Module 00!

### Ready to Continue?

In **Module 01: Introduction to Data Engineering**, you'll learn:
- Deep dive into data engineering concepts
- The modern data stack
- ETL vs. ELT patterns
- Data pipeline architectures
- Common challenges and solutions

### Before Moving On

Make sure you:
- [OK] Have all packages installed (no errors in section 1)
- [OK] Successfully ran the complete ETL pipeline
- [OK] Understand the basic ETL concept
- [OK] Can see the output files in the `outputs/` folder

### Resources

- [pandas documentation](https://pandas.pydata.org/docs/)
- [Python data structures](https://docs.python.org/3/tutorial/datastructures.html)
- Main README.md in the project root

---

**Ready?** Open `01_introduction_to_data_engineering.ipynb` to continue your learning journey!