# Data Engineering Core Concepts

## Overview
This notebook introduces fundamental data engineering concepts essential for working with Databricks. We'll cover architecture patterns, data processing paradigms, and best practices.

## Learning Objectives
- Understand batch vs streaming processing
- Learn data storage architectures (warehouse, lake, lakehouse)
- Master partitioning strategies
- Implement Slowly Changing Dimensions (SCD)
- Understand Medallion architecture

---

## 1. Batch vs Streaming Processing

Understanding when to use batch or streaming is critical for data engineering.

### Batch Processing

**Definition**: Processing large volumes of data at scheduled intervals.

**Use Cases**:
- Daily/weekly/monthly reports
- Historical data analysis
- Data warehouse ETL
- ML model training on historical data

**Example Workflow**:
```
Extract (source) → Transform (business logic) → Load (target)
Runs: Every night at 2 AM
```

**Advantages**:
- Simple to implement and debug
- Cost-effective for large volumes
- Can process entire dataset
- Easier to maintain

**Disadvantages**:
- High latency (hours to process)
- Not suitable for real-time needs
- Resource intensive during batch window

In [None]:
# Batch Processing Example (Pseudo-code)

batch_example = """
# Daily batch job - runs at 2 AM
def daily_sales_etl():
    # Extract: Read yesterday's data
    raw_sales = spark.read.parquet(
        f"/bronze/sales/date={yesterday}"
    )
    
    # Transform: Apply business logic
    cleaned_sales = (
        raw_sales
        .filter("amount > 0")
        .withColumn("revenue", col("quantity") * col("price"))
        .groupBy("product_id", "region")
        .agg(
            sum("revenue").alias("total_revenue"),
            count("*").alias("order_count")
        )
    )
    
    # Load: Write to gold layer
    cleaned_sales.write.mode("append").partitionBy("date").save(
        "/gold/daily_sales"
    )
    
    print(f"Processed {raw_sales.count()} records")
"""

print(batch_example)

### Streaming Processing

**Definition**: Continuous processing of data as it arrives.

**Use Cases**:
- Real-time dashboards
- Fraud detection
- IoT sensor monitoring
- Clickstream analysis
- Real-time alerts

**Example Workflow**:
```
Stream Source → Micro-batch Processing → Stream Sink
Runs: Continuously (every few seconds)
```

**Advantages**:
- Low latency (seconds to minutes)
- Immediate insights
- Handles continuous data
- Better resource utilization

**Disadvantages**:
- More complex to implement
- Harder to debug
- Requires careful state management
- Higher operational overhead

In [None]:
# Streaming Processing Example (Pseudo-code)

streaming_example = """
# Real-time fraud detection stream
def fraud_detection_stream():
    # Read stream from Kafka/Event Hub
    transactions_stream = (
        spark.readStream
        .format("kafka")
        .option("subscribe", "transactions")
        .load()
    )
    
    # Transform: Apply fraud rules in real-time
    fraud_checks = (
        transactions_stream
        .withColumn("is_suspicious", 
            (col("amount") > 10000) | 
            (col("location_change_speed") > 500)
        )
        .filter(col("is_suspicious"))
    )
    
    # Write stream to Delta table
    query = (
        fraud_checks.writeStream
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", "/checkpoints/fraud")
        .start("/alerts/fraud_detected")
    )
    
    return query
"""

print(streaming_example)

### Choosing Batch vs Streaming

| Factor | Batch | Streaming |
|--------|-------|----------|
| **Latency** | Hours to days | Seconds to minutes |
| **Complexity** | Low | High |
| **Cost** | Lower (scheduled) | Higher (always on) |
| **Use Case** | Historical analysis | Real-time insights |
| **Data Volume** | Large volumes | Continuous flow |
| **Debugging** | Easy | Challenging |

**Hybrid Approach**: Lambda Architecture
- Batch layer: Complete, accurate historical data
- Speed layer: Real-time approximate data
- Serving layer: Combines both views

## 2. Data Storage Architectures

Evolution from data warehouses to modern lakehouses.

### Data Warehouse

**Traditional structured data storage**

**Characteristics**:
- Structured data only (tables, schemas)
- SQL-based queries
- ETL before storage (schema-on-write)
- Expensive scaling
- ACID transactions

**Examples**: Snowflake, Redshift, BigQuery

**Pros**:
- ✅ Fast queries on structured data
- ✅ Strong consistency
- ✅ Well-established tools

**Cons**:
- ❌ Can't handle unstructured data
- ❌ Expensive for large volumes
- ❌ Limited flexibility

```
Data Warehouse Architecture:

Source Systems → ETL → Data Warehouse → BI Tools
                       (Structured only)
```

### Data Lake

**Store all data in raw format**

**Characteristics**:
- All data types (structured, semi-structured, unstructured)
- Schema-on-read (define structure when reading)
- Cost-effective storage (object storage)
- No ACID guarantees initially

**Examples**: S3, ADLS, GCS with Parquet/CSV/JSON

**Pros**:
- ✅ Handles any data type
- ✅ Very cost-effective
- ✅ Scalable storage
- ✅ Good for ML/AI

**Cons**:
- ❌ No ACID transactions (traditional lakes)
- ❌ Can become "data swamp"
- ❌ Slower queries
- ❌ Data quality issues

```
Data Lake Architecture:

All Sources → Data Lake → Multiple Tools
              (All types)  (Spark, Python, SQL, ML)
```

### Lakehouse (Databricks Architecture)

**Best of both worlds: Warehouse + Lake**

**Characteristics**:
- Built on data lake (cost-effective storage)
- ACID transactions (Delta Lake)
- Schema enforcement + evolution
- Unified for BI, ML, and streaming
- Time travel, versioning

**Technology**: Delta Lake on cloud storage

**Pros**:
- ✅ Cost of data lake
- ✅ Performance of warehouse
- ✅ ACID transactions
- ✅ Handles all data types
- ✅ Single platform for all workloads

```
Lakehouse Architecture:

                    Lakehouse (Delta Lake)
                           |
                    Cloud Storage
                  (S3, ADLS, GCS)
                           |
         ┌─────────────────┼─────────────────┐
      BI Tools         ML/AI            Streaming
```

In [None]:
# Comparison table
comparison = """
| Feature | Warehouse | Lake | Lakehouse |
|---------|-----------|------|------------|
| Data Types | Structured | All | All |
| ACID | Yes | No | Yes |
| Cost | High | Low | Low |
| Performance | High | Medium | High |
| Schema | On-write | On-read | Both |
| BI | Excellent | Poor | Excellent |
| ML/AI | Limited | Excellent | Excellent |
| Governance | Strong | Weak | Strong |
| Examples | Snowflake | S3+Parquet | Delta Lake |
"""

print(comparison)

## 3. Partitioning Strategies

Partitioning organizes data for efficient querying and processing.

### Why Partition?

**Benefits**:
1. **Query Performance**: Skip irrelevant data
2. **Parallel Processing**: Process partitions in parallel
3. **Data Management**: Easy to update/delete specific partitions
4. **Cost**: Read less data = lower cost

### Common Partitioning Strategies

#### 1. Date/Time Partitioning (Most Common)
```python
# Partition by date
/data/transactions/
    date=2024-01-01/
    date=2024-01-02/
    date=2024-01-03/

# Hierarchical date partitioning
/data/transactions/
    year=2024/
        month=01/
            day=01/
```

**Best for**: Time-series data, logs, events
**Query example**: `WHERE date >= '2024-01-01'`

#### 2. Category Partitioning
```python
/data/sales/
    region=US/
    region=EU/
    region=APAC/
```

**Best for**: Geographic data, product categories
**Query example**: `WHERE region = 'US'`

#### 3. Hybrid Partitioning
```python
/data/orders/
    year=2024/
        month=01/
            region=US/
            region=EU/
```

**Best for**: Multiple filter dimensions
**Query example**: `WHERE year=2024 AND region='US'`

In [None]:
# Partitioning examples in PySpark

partitioning_code = """
# Write data with date partitioning
df.write.mode("append") \
    .partitionBy("date") \
    .format("delta") \
    .save("/mnt/data/transactions")

# Multiple partition columns
df.write.mode("append") \
    .partitionBy("year", "month", "region") \
    .format("delta") \
    .save("/mnt/data/sales")

# Query leveraging partitions
df = spark.read.format("delta") \
    .load("/mnt/data/transactions") \
    .filter("date >= '2024-01-01' AND date < '2024-02-01'")
# Only reads January 2024 partition!
"""

print(partitioning_code)

### Partitioning Best Practices

✅ **DO**:
- Partition by commonly filtered columns
- Use date for time-series data
- Keep partition size 100MB-1GB
- Limit partition columns (1-3 usually)

❌ **DON'T**:
- Over-partition (too many small files)
- Partition by high-cardinality columns (user_id)
- Use too many partition levels (>3)
- Partition small datasets (<1GB)

## 4. Slowly Changing Dimensions (SCD)

Handling changes in dimensional data over time.

### SCD Type 1: Overwrite

**Strategy**: Overwrite old values, no history

**Use Case**: Corrections, unimportant changes

**Example**: Fix customer email typo

```
Before: customer_id=1, email='old@email.com'
After:  customer_id=1, email='new@email.com'
```

In [None]:
# SCD Type 1 implementation

scd_type1 = """
# Simple UPDATE operation
MERGE INTO customers target
USING updates source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET target.email = source.email,
             target.phone = source.phone,
             target.updated_at = current_timestamp()
WHEN NOT MATCHED THEN
  INSERT *;
"""

print(scd_type1)

### SCD Type 2: Track History

**Strategy**: Keep full history with effective dates

**Use Case**: Track changes over time (price history, address changes)

**Implementation**: Add surrogate key, effective dates, current flag

```
Table: customer_history
| surrogate_key | customer_id | address | effective_from | effective_to | is_current |
|---------------|-------------|---------|----------------|--------------|------------|
| 1 | 101 | "123 Old St" | 2023-01-01 | 2024-03-01 | False |
| 2 | 101 | "456 New Ave" | 2024-03-01 | 9999-12-31 | True |
```

In [None]:
# SCD Type 2 implementation

scd_type2 = """
# Close current record and insert new one

# Step 1: Close expired records
UPDATE customer_history
SET effective_to = '2024-03-01',
    is_current = False
WHERE customer_id = 101
  AND is_current = True;

# Step 2: Insert new record
INSERT INTO customer_history VALUES (
  2,  -- new surrogate key
  101,  -- same customer_id
  '456 New Ave',  -- new address
  '2024-03-01',  -- effective_from
  '9999-12-31',  -- effective_to (future)
  True  -- is_current
);

# Query current state
SELECT * FROM customer_history WHERE is_current = True;

# Query historical state
SELECT * FROM customer_history 
WHERE customer_id = 101
  AND '2023-06-15' BETWEEN effective_from AND effective_to;
"""

print(scd_type2)

### SCD Type 3: Track Limited History

**Strategy**: Keep previous value in separate column

**Use Case**: Track one previous value only

```
| customer_id | current_address | previous_address | change_date |
|-------------|----------------|------------------|-------------|
| 101 | "456 New Ave" | "123 Old St" | 2024-03-01 |
```

## 5. Medallion Architecture (Bronze-Silver-Gold)

Databricks recommended pattern for organizing data lakehouse.

### Architecture Overview

```
Source Systems → BRONZE → SILVER → GOLD → Analytics/ML
                 (Raw)   (Cleaned) (Aggregated)
```

### Bronze Layer (Raw)

**Purpose**: Store raw, unprocessed data

**Characteristics**:
- Exact copy of source data
- All data types (structured, semi-structured, unstructured)
- Append-only (immutable history)
- Includes ingestion metadata
- No business logic applied

**Example Structure**:
```
/bronze/
  /orders_raw/
  /customers_raw/
  /clickstream_raw/
```

**Schema**:
```python
- source_file (string)
- ingestion_timestamp (timestamp)
- raw_data (string/binary)
```

In [None]:
# Bronze layer example

bronze_code = """
# Ingest raw data to bronze
from pyspark.sql.functions import current_timestamp, input_file_name

bronze_df = (
    spark.read
    .format("json")
    .load("/source/orders/*.json")
    .withColumn("ingestion_timestamp", current_timestamp())
    .withColumn("source_file", input_file_name())
)

# Write to bronze (append-only)
bronze_df.write \
    .format("delta") \
    .mode("append") \
    .partitionBy("ingestion_date") \
    .save("/bronze/orders_raw")
"""

print(bronze_code)

### Silver Layer (Cleaned/Conformed)

**Purpose**: Cleaned, validated, enriched data

**Characteristics**:
- Data quality checks applied
- Standardized formats
- Deduplication
- Schema enforcement
- Joins with dimension tables
- Still detailed/granular

**Transformations**:
- Remove duplicates
- Fix data types
- Handle nulls
- Standardize formats (dates, phone numbers)
- Add business keys
- Apply data quality rules

**Example Structure**:
```
/silver/
  /orders/
  /customers/
  /products/
```

In [None]:
# Silver layer example

silver_code = """
# Transform bronze to silver
from pyspark.sql.functions import col, to_date, when, trim, upper

# Read from bronze
bronze_df = spark.read.format("delta").load("/bronze/orders_raw")

# Clean and transform
silver_df = (
    bronze_df
    # Remove duplicates
    .dropDuplicates(["order_id"])
    
    # Fix data types
    .withColumn("order_date", to_date(col("order_date")))
    .withColumn("amount", col("amount").cast("decimal(10,2)"))
    
    # Data quality: filter invalid records
    .filter(col("order_id").isNotNull())
    .filter(col("amount") > 0)
    
    # Standardize
    .withColumn("country_code", upper(trim(col("country_code"))))
    
    # Add derived columns
    .withColumn("year", year(col("order_date")))
    .withColumn("month", month(col("order_date")))
    
    # Select final columns
    .select(
        "order_id", "customer_id", "product_id",
        "order_date", "amount", "country_code",
        "year", "month"
    )
)

# Write to silver
silver_df.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .save("/silver/orders")
"""

print(silver_code)

### Gold Layer (Business-Level Aggregates)

**Purpose**: Business-ready aggregated data

**Characteristics**:
- Aggregated metrics
- Business logic applied
- Optimized for reporting/analytics
- Denormalized for performance
- Feature tables for ML

**Examples**:
- Daily/Monthly sales summaries
- Customer lifetime value
- Product analytics
- KPI dashboards
- ML feature stores

**Example Structure**:
```
/gold/
  /daily_sales_by_region/
  /customer_ltv/
  /product_performance/
  /ml_features/customer_features/
```

In [None]:
# Gold layer example

gold_code = """
# Aggregate silver to gold
from pyspark.sql.functions import sum, count, avg, max, min

# Read from silver
orders_df = spark.read.format("delta").load("/silver/orders")
customers_df = spark.read.format("delta").load("/silver/customers")

# Create business-level aggregate: Daily sales by region
daily_sales = (
    orders_df
    .join(customers_df, "customer_id")
    .groupBy("order_date", "region")
    .agg(
        count("*").alias("total_orders"),
        sum("amount").alias("total_revenue"),
        avg("amount").alias("avg_order_value"),
        count("customer_id").alias("unique_customers")
    )
    .withColumn("revenue_per_customer", 
                col("total_revenue") / col("unique_customers"))
)

# Write to gold
daily_sales.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("order_date") \
    .save("/gold/daily_sales_by_region")

# ML Feature table
customer_features = (
    orders_df
    .groupBy("customer_id")
    .agg(
        count("*").alias("order_count"),
        sum("amount").alias("total_spent"),
        avg("amount").alias("avg_order_value"),
        max("order_date").alias("last_order_date"),
        min("order_date").alias("first_order_date")
    )
    .withColumn("customer_lifetime_days",
                datediff(col("last_order_date"), col("first_order_date")))
)

customer_features.write \
    .format("delta") \
    .mode("overwrite") \
    .save("/gold/ml_features/customer_features")
"""

print(gold_code)

### Medallion Architecture Benefits

✅ **Clear Data Lineage**: Easy to trace data flow
✅ **Incremental Processing**: Process only new data
✅ **Data Quality**: Progressive refinement
✅ **Flexibility**: Support multiple use cases
✅ **Recovery**: Can rebuild downstream from upstream
✅ **Governance**: Clear ownership and policies per layer

### Best Practices

1. **Bronze**: Keep forever (immutable history)
2. **Silver**: Apply quality checks, not business logic
3. **Gold**: Optimize for specific use cases
4. **Idempotency**: Re-running should produce same results
5. **Documentation**: Document transformations at each layer

## Summary

In this notebook, you learned:

✅ Batch vs Streaming processing paradigms
✅ Evolution from Data Warehouse → Data Lake → Lakehouse
✅ Partitioning strategies for performance
✅ Slowly Changing Dimensions (SCD Types 1, 2, 3)
✅ Medallion Architecture (Bronze-Silver-Gold)

## Next Steps

1. Review the concepts and examples
2. Think about how to apply these to your use cases
3. Move to [04-Environment-Setup.ipynb](./04-Environment-Setup.ipynb)
4. Then proceed to Module 2 for hands-on Databricks

## Additional Resources

- [Databricks Lakehouse Platform](https://www.databricks.com/product/data-lakehouse)
- [Medallion Architecture Guide](https://www.databricks.com/glossary/medallion-architecture)
- [Delta Lake Documentation](https://docs.delta.io/)