# Module 01: Introduction to Data Engineering

**Estimated Time:** 45-60 minutes

## Learning Objectives

By the end of this module, you will:
- Understand the comprehensive role of a data engineer
- Learn about the modern data stack and its components
- Understand ETL vs. ELT patterns and when to use each
- Explore different data pipeline architectures
- Recognize common data engineering challenges and solutions
- Understand the data engineering lifecycle

---

## 1. What is Data Engineering?

### Definition Expanded

Data engineering is a multifaceted discipline that encompasses:

1. **Data Architecture**: Designing how data flows through systems
2. **Data Infrastructure**: Building and maintaining data platforms
3. **Data Pipeline Development**: Creating automated data workflows
4. **Data Quality**: Ensuring accuracy, completeness, and reliability
5. **Data Integration**: Combining data from multiple sources
6. **Performance Optimization**: Making data systems fast and efficient

### The Data Engineering Spectrum

```
Operational Data          →    Data Engineering    →    Analytical Data
(Live transactions)            (Transformation)         (Ready for analysis)

Examples:                      Processes:               Examples:
- User clicks                 - Extract                 - User behavior
- Sales records               - Transform               - Sales trends
- Sensor data                 - Load                    - Predictive models
- App logs                    - Validate                - Business dashboards
```

In [None]:
# Let's visualize a typical day in data flow
import pandas as pd
from datetime import datetime, timedelta

# Simulate hourly data volume through a pipeline
hours = list(range(24))
data_flow = {
    "hour": hours,
    "source_records": [10000 + (i * 500) + ((i % 8) * 2000) for i in hours],
    "after_validation": [
        int(x * 0.95) for x in [10000 + (i * 500) + ((i % 8) * 2000) for i in hours]
    ],
    "after_transformation": [
        int(x * 0.90) for x in [10000 + (i * 500) + ((i % 8) * 2000) for i in hours]
    ],
}

df_flow = pd.DataFrame(data_flow)
print("Typical Data Flow Through a Pipeline (24 hours)")
print("=" * 60)
print(f"Total source records: {df_flow['source_records'].sum():,}")
print(f"Records after validation: {df_flow['after_validation'].sum():,}")
print(f"Final transformed records: {df_flow['after_transformation'].sum():,}")
print(
    f"Data loss percentage: {((df_flow['source_records'].sum() - df_flow['after_transformation'].sum()) / df_flow['source_records'].sum() * 100):.2f}%"
)
print("\nNote: Data loss is often due to duplicates, invalid records, or filtering")

---

## 2. The Modern Data Stack

The modern data stack consists of specialized tools for each stage of the data lifecycle:

### Layer 1: Data Sources
- **Transactional Databases**: PostgreSQL, MySQL, Oracle
- **SaaS Applications**: Salesforce, HubSpot, Google Analytics
- **APIs**: REST, GraphQL
- **Streaming**: Kafka, Kinesis
- **Files**: CSV, JSON, Parquet, logs

### Layer 2: Data Ingestion
- **Batch**: Apache Airflow, Prefect, Dagster
- **Streaming**: Apache Kafka, Apache Flink, AWS Kinesis
- **ELT Tools**: Fivetran, Airbyte, Stitch

### Layer 3: Data Storage
- **Data Warehouses**: Snowflake, BigQuery, Redshift
- **Data Lakes**: S3, Azure Data Lake, GCS
- **Lakehouses**: Databricks, Delta Lake

### Layer 4: Data Transformation
- **SQL-based**: dbt (data build tool)
- **Python-based**: pandas, PySpark
- **Notebooks**: Jupyter, Databricks

### Layer 5: Data Orchestration
- **Workflow Management**: Apache Airflow, Prefect, Dagster
- **Job Scheduling**: Cron, Cloud Scheduler

### Layer 6: Data Quality & Governance
- **Quality**: Great Expectations, deequ, Soda
- **Cataloging**: Amundsen, DataHub, Alation
- **Lineage**: OpenLineage, Marquez

### Layer 7: Data Consumption
- **BI Tools**: Tableau, Looker, Power BI, Metabase
- **ML Platforms**: MLflow, Kubeflow, SageMaker
- **APIs**: REST, GraphQL for data access

In [None]:
# Example: Simulating the modern data stack layers
import json

data_stack = {
    "sources": ["PostgreSQL", "Salesforce API", "S3 Logs"],
    "ingestion": "Apache Airflow",
    "storage": "Snowflake Data Warehouse",
    "transformation": "dbt + Python",
    "quality": "Great Expectations",
    "consumption": ["Tableau Dashboard", "ML Model API"],
}

print("Example Modern Data Stack Configuration:")
print(json.dumps(data_stack, indent=2))

print("\n[DATA] Data Flow:")
print("Sources → Airflow → Snowflake → dbt → Quality Checks → Tableau/ML")

---

## 3. ETL vs. ELT: Understanding the Difference

### ETL (Extract, Transform, Load)

**Traditional approach**: Transform data BEFORE loading into the warehouse

```
Source → Extract → Transform → Load → Data Warehouse
         (Raw)     (Cleaned)   (Ready)
```

**When to use ETL:**
- Limited warehouse compute capacity
- Data privacy requirements (mask before storing)
- Complex transformations best done in specialized tools
- On-premise systems with legacy constraints

**Pros:**
- [OK] Reduced warehouse storage
- [OK] Data already cleaned before storage
- [OK] Compliance: sensitive data can be masked early

**Cons:**
- [FAIL] Slower - transformation bottleneck
- [FAIL] No access to raw data
- [FAIL] Less flexible for ad-hoc analysis

---

### ELT (Extract, Load, Transform)

**Modern approach**: Load raw data first, transform in the warehouse

```
Source → Extract → Load → Transform → Data Warehouse
         (Raw)     (Raw)   (Cleaned)   (Multiple views)
```

**When to use ELT:**
- Cloud data warehouses (Snowflake, BigQuery)
- Need for data flexibility
- Multiple transformation use cases
- Fast-changing requirements

**Pros:**
- [OK] Faster initial load
- [OK] Raw data always available
- [OK] Leverage warehouse compute power
- [OK] More flexible transformations

**Cons:**
- [FAIL] Requires powerful warehouse
- [FAIL] More storage needed
- [FAIL] Potential for messy data if not governed

---

### Comparison Table

| Aspect | ETL | ELT |
|--------|-----|-----|
| **Transform Location** | External processing | Inside warehouse |
| **Speed** | Slower (transformation bottleneck) | Faster (parallel processing) |
| **Flexibility** | Limited (fixed transformations) | High (ad-hoc SQL queries) |
| **Cost** | Lower storage, higher compute | Higher storage, lower compute |
| **Best For** | On-premise, legacy systems | Cloud, modern data warehouses |
| **Raw Data Access** | Not available | Always available |
| **Tools** | Informatica, Talend, SSIS | dbt, Snowflake, BigQuery |

### The Trend

**Today's Standard**: ELT is becoming dominant due to:
- Cheap cloud storage
- Powerful cloud data warehouses
- Need for data flexibility
- Faster time-to-insight

In [None]:
# Simulating ETL vs ELT performance difference
import time
import numpy as np


def simulate_etl_approach(records):
    """Simulates traditional ETL - transform before load"""
    print("ETL Approach:")
    start = time.time()

    # Extract
    print("  [1/3] Extracting data...")
    time.sleep(0.1)

    # Transform (bottleneck in ETL)
    print("  [2/3] Transforming data (sequential processing)...")
    time.sleep(0.5)  # Simulate longer transformation

    # Load
    print("  [3/3] Loading transformed data...")
    time.sleep(0.1)

    elapsed = time.time() - start
    print(f"  [OK] Complete in {elapsed:.2f}s")
    return elapsed


def simulate_elt_approach(records):
    """Simulates modern ELT - load then transform"""
    print("\nELT Approach:")
    start = time.time()

    # Extract
    print("  [1/3] Extracting data...")
    time.sleep(0.1)

    # Load (faster)
    print("  [2/3] Loading raw data...")
    time.sleep(0.1)

    # Transform (leverages warehouse parallelism)
    print("  [3/3] Transforming in warehouse (parallel)...")
    time.sleep(0.2)  # Faster due to warehouse compute

    elapsed = time.time() - start
    print(f"  [OK] Complete in {elapsed:.2f}s")
    return elapsed


# Compare
print("Comparing ETL vs ELT Performance")
print("=" * 50)

etl_time = simulate_etl_approach(10000)
elt_time = simulate_elt_approach(10000)

print("\n" + "=" * 50)
print(f"Performance Difference: {((etl_time - elt_time) / etl_time * 100):.1f}% faster with ELT")

---

## 4. Data Pipeline Architectures

### 1. Batch Processing Architecture

Process data in scheduled batches (hourly, daily, weekly)

```
Source DB → Batch Extract (Airflow) → Transform → Load → Data Warehouse
           (Every 6 hours)            (pandas)          (Snowflake)
```

**Use Cases:**
- Daily reports and analytics
- Historical data processing
- Non-time-sensitive transformations

---

### 2. Streaming Architecture

Process data in real-time as it arrives

```
Events → Kafka → Stream Processor (Flink) → Real-time DB → Dashboard
        (Queue)   (Continuous)               (Redis)       (Live)
```

**Use Cases:**
- Fraud detection
- Real-time recommendations
- Live dashboards
- IoT data processing

---

### 3. Lambda Architecture

Combines batch and streaming (complex but comprehensive)

```
           ┌─→ Batch Layer (Accurate, slow) ─┐
Data Source ─┤                                  ├─→ Serving Layer → Query
           └─→ Speed Layer (Fast, approximate)─┘
```

**Use Cases:**
- Need both real-time AND accurate historical data
- Complex analytics requirements

---

### 4. Kappa Architecture

Simplified - everything is a stream

```
Data Source → Kafka → Stream Processing → Storage → Query
             (All data as streams)
```

**Use Cases:**
- When batch processing can be replaced with streaming
- Simpler to maintain than Lambda

---

### Which Architecture to Choose?

| Factor | Batch | Streaming | Lambda | Kappa |
|--------|-------|-----------|--------|-------|
| **Latency Requirement** | Hours/Days | Seconds | Both | Seconds |
| **Complexity** | Low | Medium | High | Medium |
| **Cost** | Low | Medium | High | Medium |
| **Accuracy** | High | Good | High | Good |
| **Typical Usage** | 80% | 15% | 4% | 1% |

In [None]:
# Simulating different pipeline patterns
from datetime import datetime
import time


class BatchPipeline:
    def __init__(self, batch_size=1000):
        self.batch_size = batch_size
        self.buffer = []

    def process(self, records):
        """Process records in batches"""
        print(f"\n[Batch Pipeline] Processing {len(records)} records")
        batches = [
            records[i : i + self.batch_size] for i in range(0, len(records), self.batch_size)
        ]
        print(f"  Split into {len(batches)} batches of {self.batch_size} records")
        return f"Processed {len(batches)} batches"


class StreamPipeline:
    def __init__(self):
        self.processed_count = 0

    def process(self, record):
        """Process records one at a time (or micro-batches)"""
        self.processed_count += 1
        if self.processed_count % 100 == 0:
            print(f"  [Stream] Processed {self.processed_count} records in real-time...")
        return record


# Demo
print("=" * 60)
print("Pipeline Architecture Demo")
print("=" * 60)

# Create sample data
sample_records = list(range(5000))

# Batch processing
batch_pipeline = BatchPipeline(batch_size=1000)
result = batch_pipeline.process(sample_records)
print(f"  Result: {result}")

# Stream processing simulation
print("\n[Stream Pipeline] Processing records as they arrive")
stream_pipeline = StreamPipeline()
for i, record in enumerate(sample_records):
    stream_pipeline.process(record)
    if i >= 300:  # Simulate first 300 for demo
        print(f"  ... (continuing to process remaining records)")
        stream_pipeline.processed_count = len(sample_records)
        break

print(f"\n  Total streamed: {stream_pipeline.processed_count} records")

print("\n" + "=" * 60)
print("Key Difference:")
print("  Batch: Waits for all data, processes in chunks")
print("  Stream: Processes each record immediately as it arrives")
print("=" * 60)

---

## 5. Common Data Engineering Challenges

### Challenge 1: Data Quality Issues

**Problems:**
- Missing values
- Duplicate records
- Incorrect data types
- Inconsistent formats
- Schema changes

**Solutions:**
- Data validation frameworks (Great Expectations)
- Schema enforcement
- Automated quality checks
- Data contracts between teams

---

### Challenge 2: Scalability

**Problems:**
- Data volume growing exponentially
- Pipelines becoming slower
- Infrastructure costs rising

**Solutions:**
- Distributed processing (Spark, Flink)
- Incremental loading strategies
- Data partitioning
- Caching and optimization

---

### Challenge 3: Pipeline Failures

**Problems:**
- Source systems unavailable
- Network issues
- Code bugs
- Resource exhaustion

**Solutions:**
- Retry logic with exponential backoff
- Circuit breakers
- Dead letter queues
- Comprehensive monitoring and alerting
- Idempotent operations

---

### Challenge 4: Changing Requirements

**Problems:**
- Business needs evolve
- New data sources added
- Schema changes

**Solutions:**
- Modular pipeline design
- Configuration-driven pipelines
- Version control and CI/CD
- Backward compatibility

---

### Challenge 5: Data Governance

**Problems:**
- Who owns what data?
- Privacy regulations (GDPR, CCPA)
- Data lineage tracking
- Access control

**Solutions:**
- Data catalog (Amundsen, DataHub)
- Lineage tracking tools
- Role-based access control
- Data classification

---

### Challenge 6: Monitoring and Debugging

**Problems:**
- Pipeline failures hard to diagnose
- No visibility into data flow
- Performance degradation over time

**Solutions:**
- Comprehensive logging
- Metrics and dashboards
- Data quality metrics
- Alerting systems

In [None]:
# Example: Implementing basic data quality checks
import pandas as pd
import numpy as np


def validate_data_quality(df, column_rules):
    """
    Validate data quality based on rules

    column_rules: dict of {column_name: {rule_type: rule_value}}
    """
    validation_results = []

    for column, rules in column_rules.items():
        if column not in df.columns:
            validation_results.append(
                {
                    "column": column,
                    "check": "column_exists",
                    "passed": False,
                    "message": f"Column {column} not found",
                }
            )
            continue

        # Check for null values
        if "allow_null" in rules and not rules["allow_null"]:
            null_count = df[column].isnull().sum()
            validation_results.append(
                {
                    "column": column,
                    "check": "no_nulls",
                    "passed": null_count == 0,
                    "message": (
                        f"Found {null_count} null values" if null_count > 0 else "No nulls found"
                    ),
                }
            )

        # Check data type
        if "dtype" in rules:
            expected_type = rules["dtype"]
            actual_type = str(df[column].dtype)
            validation_results.append(
                {
                    "column": column,
                    "check": "data_type",
                    "passed": actual_type.startswith(expected_type),
                    "message": f"Expected {expected_type}, got {actual_type}",
                }
            )

        # Check value range
        if "min_value" in rules:
            min_val = df[column].min()
            passed = min_val >= rules["min_value"]
            validation_results.append(
                {
                    "column": column,
                    "check": "min_value",
                    "passed": passed,
                    "message": (
                        f'Min value {min_val} >= {rules["min_value"]}'
                        if passed
                        else f'Min value {min_val} < {rules["min_value"]}'
                    ),
                }
            )

    return validation_results


# Create sample data with quality issues
data = {
    "user_id": [1, 2, 3, None, 5],  # Has null
    "age": [25, 30, -5, 40, 150],  # Has invalid values
    "revenue": [100.5, 200.0, 150.0, 300.0, 250.0],
}

df = pd.DataFrame(data)

# Define quality rules
quality_rules = {
    "user_id": {"allow_null": False, "dtype": "int"},
    "age": {"allow_null": False, "min_value": 0, "dtype": "int"},
    "revenue": {"allow_null": False, "min_value": 0, "dtype": "float"},
}

# Run validation
results = validate_data_quality(df, quality_rules)

# Display results
print("Data Quality Validation Results:")
print("=" * 70)
for result in results:
    status = "[OK] PASS" if result["passed"] else "[FAIL] FAIL"
    print(f"{status} | {result['column']:15} | {result['check']:15} | {result['message']}")

print("\n" + "=" * 70)
total = len(results)
passed = sum(1 for r in results if r["passed"])
print(f"Overall: {passed}/{total} checks passed ({(passed/total*100):.1f}%)")

---

## 6. The Data Engineering Lifecycle

Understanding the full lifecycle helps you see where each task fits:

### Phase 1: Requirements Gathering
- Understand business needs
- Identify data sources
- Define success metrics
- Determine latency requirements

### Phase 2: Design
- Choose architecture (batch/stream)
- Select tech stack
- Design data models
- Plan for scalability

### Phase 3: Implementation
- Build extraction logic
- Implement transformations
- Set up storage layer
- Create data quality checks

### Phase 4: Testing
- Unit tests for transformations
- Integration tests
- Data quality validation
- Performance testing

### Phase 5: Deployment
- CI/CD pipelines
- Monitoring setup
- Alerting configuration
- Documentation

### Phase 6: Maintenance
- Monitor performance
- Handle incidents
- Optimize queries
- Scale infrastructure

### Phase 7: Evolution
- Add new data sources
- Implement new features
- Refactor for performance
- Adapt to changing needs

---

## 7. Key Takeaways

### What You Learned

[OK] **Data Engineering Role**: Building and maintaining data infrastructure

[OK] **Modern Data Stack**: 7 layers from source to consumption

[OK] **ETL vs ELT**: 
- ETL: Transform before loading (traditional)
- ELT: Load then transform (modern)
- ELT is winning due to cloud warehouses

[OK] **Architectures**:
- Batch: Scheduled processing (most common)
- Streaming: Real-time processing
- Lambda/Kappa: Hybrid approaches

[OK] **Challenges**: Quality, scalability, failures, governance

[OK] **Solutions**: Validation, monitoring, modular design, automation

### Important Principles

1. **Start simple, scale as needed** - Don't over-engineer
2. **Data quality is paramount** - Garbage in, garbage out
3. **Design for failure** - Systems will fail, plan for it
4. **Automate everything** - Manual processes don't scale
5. **Monitor and measure** - You can't improve what you don't measure

---

## 8. Practice Questions

Test your understanding:

1. **When would you choose ETL over ELT?**
   - Consider: data privacy, warehouse costs, transformation complexity

2. **What architecture would you use for:**
   - Daily sales reports?
   - Real-time fraud detection?
   - Monthly financial analysis?

3. **How would you handle:**
   - A source system that goes down during extraction?
   - Schema changes in source data?
   - 10x increase in data volume?

4. **What's the difference between:**
   - Data Engineer vs. Data Scientist?
   - Data Warehouse vs. Data Lake?
   - Batch vs. Stream processing?

Think about these - we'll apply the answers in upcoming modules!

---

## 9. Next Steps

Congratulations! You now understand the fundamental concepts of data engineering.

### Coming Up in Module 02

In **Module 02: Data Sources and Extraction**, you'll learn:
- How to extract data from different sources
- Working with files (CSV, JSON, Parquet)
- Connecting to databases
- Calling REST APIs
- Error handling and retries
- Hands-on extraction examples

### Resources for Further Learning

- [The Data Engineering Cookbook](https://github.com/andkret/Cookbook)
- [Fundamentals of Data Engineering (Book)](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/)
- [Data Engineering Weekly Newsletter](https://www.dataengineeringweekly.com/)

---

**Ready to start extracting data?**

Open `02_data_sources_and_extraction.ipynb` to continue!