
---

# **CHAPTER 20: DATA ENGINEERING FOR ML**

*Building Robust Data Pipelines for Production AI*

## **Chapter Overview**

Machine learning models are only as reliable as the data pipelines feeding them. This chapter bridges the gap between raw data and ML-ready datasets, covering the engineering practices required to build scalable, maintainable, and observable data infrastructure. You will learn to design ETL/ELT pipelines, implement feature stores, validate data quality, and handle both batch and streaming data at scale.

**Estimated Time:** 35-45 hours (3 weeks)  
**Prerequisites:** Chapter 19 (ML System Design), Chapter 5 (Data Preprocessing), familiarity with SQL and Python

---

## **20.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Architect batch and streaming data pipelines using modern orchestration tools (Airflow, Dagster, Prefect)
2. Design data storage strategies across data lakes, warehouses, and lakehouses (Delta Lake, Iceberg)
3. Implement production-grade feature stores ensuring training-serving consistency
4. Validate data quality through automated schema validation, anomaly detection, and drift monitoring
5. Handle streaming data with exactly-once semantics for real-time feature computation
6. Optimize data pipelines for cost, latency, and fault tolerance

---

## **20.1 Data Pipelines**

#### **20.1.1 ETL vs. ELT Architectures**

**ETL (Extract-Transform-Load):** Transform data before loading to warehouse. Best for heavy transformations, data privacy (PII masking), and strict schema enforcement.

**ELT (Extract-Load-Transform):** Load raw data first, transform within warehouse. Leverages cloud scalability (BigQuery, Snowflake), enables raw data replay for new features.

```
ETL Flow:  Source → [Transform] → Warehouse → Analytics
ELT Flow:  Source → Warehouse → [Transform] → Analytics
```

**Hybrid Approach (Modern ML):**
- Raw ingestion: ELT to data lake (S3/Data Lakehouse)
- Feature engineering: ETL to feature store (validated, aggregated)
- Training datasets: ELT from feature store (point-in-time joins)

#### **20.1.2 Orchestration with Apache Airflow**

Directed Acyclic Graphs (DAGs) defining task dependencies and execution schedules.

```python
# airflow/dags/ml_feature_pipeline.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'ml-engineering',
    'depends_on_past': True,  # Sequential execution
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'sla': timedelta(hours=2)
}

with DAG(
    'user_features_daily',
    default_args=default_args,
    schedule_interval='@daily',
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=['features', 'production']
) as dag:
    
    extract = PythonOperator(
        task_id='extract_raw_events',
        python_callable=extract_from_kafka,
        op_kwargs={'date': '{{ ds }}'}
    )
    
    transform = PythonOperator(
        task_id='compute_aggregates',
        python_callable=compute_user_features,
        op_kwargs={'execution_date': '{{ ds }}'}
    )
    
    load = S3ToRedshiftOperator(
        task_id='load_to_feature_store',
        s3_bucket='ml-features',
        s3_key='user_features/{{ ds }}/',
        schema='features',
        table='user_aggregates',
        copy_options=["FORMAT AS PARQUET"]
    )
    
    validate = PythonOperator(
        task_id='validate_schema',
        python_callable=validate_feature_schema
    )
    
    extract >> transform >> load >> validate
```

**Key Concepts:**
- **Idempotency:** Running the same task twice produces identical results (crucial for backfills)
- **SLA Monitoring:** Alerts when pipelines exceed time limits
- **Backfills:** Reprocessing historical data after code changes or schema updates

#### **20.1.3 Modern Orchestrators: Dagster & Prefect**

**Dagster (Asset-based):**
```python
# dagster/assets.py
from dagster import asset, Definitions
import pandas as pd

@asset(group_name="user_features")
def raw_clickstream() -> pd.DataFrame:
    """Raw click events from Kafka"""
    return pd.read_parquet("s3://raw-data/clicks/")

@asset(group_name="user_features")
def user_session_features(raw_clickstream) -> pd.DataFrame:
    """Aggregated session metrics"""
    return raw_clickstream.groupby('user_id').agg({
        'session_duration': 'sum',
        'page_views': 'count'
    })

defs = Definitions(assets=[raw_clickstream, user_session_features])
```
- **Data Awareness:** Understands data lineage (which assets depend on which)
- **Partitioning:** Handle incremental updates (daily/hourly partitions)
- **Type Safety:** Runtime type checking of data assets

**Prefect (Modern Python):**
- Dynamic DAG generation (tasks created at runtime based on data)
- Async-native (Python 3.9+ async/await)
- Hybrid mode (local debugging with cloud scaling)

---

## **20.2 Data Storage Architectures**

#### **20.2.1 Data Lakes vs. Warehouses vs. Lakehouses**

| Feature | Data Lake (S3/GCS) | Data Warehouse (Snowflake/BQ) | Lakehouse (Delta/Iceberg) |
|---------|-------------------|------------------------------|---------------------------|
| **Format** | Raw (JSON, CSV, Parquet) | Optimized proprietary | Open formats (Parquet) + Metadata layer |
| **Schema** | Schema-on-read | Schema-on-write | Schema enforcement + evolution |
| **ACID** | No | Yes | Yes (transactions) |
| **ML Workloads** | Direct access | Export required | Native Delta Lake ML integration |
| **Time Travel** | No | Limited | Full version history |

#### **20.2.2 Delta Lake for ML**

Open-source storage layer bringing ACID transactions to data lakes.

```python
# Writing features with schema enforcement
from delta import DeltaTable, configure_spark_with_delta_pip
from pyspark.sql import SparkSession

builder = SparkSession.builder \
    .appName("ML-Features") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

# Create table with schema evolution handling
df_features.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("s3://ml-features/user-features")

# Time travel: Access features as of yesterday for training consistency
df_yesterday = spark.read \
    .format("delta") \
    .option("timestampAsOf", "2024-01-01T00:00:00Z") \
    .load("s3://ml-features/user-features")

# Optimize for ML reads (Z-ordering by common join keys)
spark.sql("""
    OPTIMIZE delta.`s3://ml-features/user-features` 
    ZORDER BY (user_id)
""")
```

**Key Features:**
- **ACID:** Concurrent writes from multiple pipelines without corruption
- **Time Travel:** Reproduce exact training dataset state from 3 months ago
- **Schema Evolution:** Add new features without breaking existing pipelines
- **Optimize:** Compaction of small files for efficient ML reads

#### **20.2.3 Feature Stores: Feast & Tecton**

Centralized storage for feature vectors with offline/online duality.

```python
# feast/feature_repo/features.py
from feast import Entity, Feature, FeatureView, ValueType, FileSource
from feast.types import Float32, Int64
from datetime import timedelta

# Define entity (primary key)
user = Entity(name="user_id", value_type=ValueType.INT64, join_key="user_id")

# Source of raw data (offline store)
user_stats_source = FileSource(
    path="s3://feast-data/user_stats.parquet",
    timestamp_field="event_timestamp"
)

# Feature definition
user_stats_view = FeatureView(
    name="user_transaction_stats",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="total_spend_30d", dtype=Float32),
        Feature(name="transaction_count_7d", dtype=Int64),
        Feature(name="avg_transaction_amount", dtype=Float32)
    ],
    source=user_stats_source,
    online=True  # Materialize to Redis/DynamoDB
)

# materialize.py - Sync offline to online store
from feast import FeatureStore
from datetime import datetime

store = FeatureStore(repo_path=".")
store.materialize(
    start_date=datetime(2024, 1, 1),
    end_date=datetime.now()
)
```

**Training-Serving Consistency:**
- **Point-in-Time Joins:** Retrieve feature values as they existed at prediction time (preventing data leakage)
- **Online Store:** Redis/DynamoDB for <10ms retrieval
- **Offline Store:** Data warehouse for batch training data generation

---

## **20.3 Data Validation & Quality**

#### **20.3.1 Great Expectations Framework**

Declarative data validation with automated documentation.

```python
# validate_data.py
import great_expectations as gx

context = gx.get_context()

# Create expectation suite
suite = context.add_expectation_suite(expectation_suite_name="user_features_validation")

# Add expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="user_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="total_spend_30d", 
        min_value=0, 
        max_value=1000000
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnPairValuesToBeEqual(
        column_A="transaction_count_7d",
        column_B="transaction_count_30d",
        or_equal=True  # 7d count <= 30d count
    )
)

# Validate batch
checkpoint = context.add_checkpoint(
    name="feature_pipeline_checkpoint",
    validations=[{
        "batch_request": {
            "datasource_name": "features",
            "data_asset_name": "user_stats"
        },
        "expectation_suite_name": "user_features_validation"
    }],
    action_list=[{
        "name": "send_slack_notification",
        "action": {"class_name": "SlackNotificationAction"}
    }]
)

result = checkpoint.run()
if not result.success:
    raise ValueError("Data validation failed - halting pipeline")
```

**Expectation Types:**
- **Schema:** Column existence, types, ordering
- **Distribution:** Value ranges, uniqueness, null rates
- **Relational:** Foreign key validation, referential integrity
- **Temporal:** No future timestamps, sequential ordering

#### **20.3.2 Drift Detection**

**Data Drift:** Input feature distributions change (P(X) changes)
**Concept Drift:** Relationship between features and target changes (P(Y|X) changes)

```python
# drift_detection.py
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd

def detect_drift(reference_data: pd.DataFrame, current_data: pd.DataFrame):
    """Compare production data against training baseline"""
    
    column_mapping = ColumnMapping(
        numerical_features=['age', 'income', 'transaction_amount'],
        categorical_features=['country', 'device_type']
    )
    
    report = Report(metrics=[DataDriftPreset()])
    report.run(
        reference_data=reference_data,
        current_data=current_data,
        column_mapping=column_mapping
    )
    
    results = report.as_dict()
    
    # Check if drift detected in >30% of features
    drifted_features = sum(
        1 for metric in results['metrics'][0]['result']['drift_by_columns'].values() 
        if metric['drift_detected']
    )
    
    total_features = len(results['metrics'][0]['result']['drift_by_columns'])
    drift_ratio = drifted_features / total_features
    
    if drift_ratio > 0.3:
        alert_mlops_team(f"Data drift detected: {drift_ratio:.1%} of features")
        trigger_model_retraining()
    
    return report
```

**Statistical Tests:**
- **Numerical:** Kolmogorov-Smirnov test, Wasserstein distance
- **Categorical:** Chi-squared test, Jensen-Shannon divergence
- **Embeddings:** Maximum Mean Discrepancy (MMD) for high-dimensional data

---

## **20.4 Streaming Data Processing**

#### **20.4.1 Kafka Fundamentals for ML**

Distributed event streaming platform for real-time feature computation.

```
Producer → Kafka Topic → Consumer Group (Feature Engineering) → Feature Store
                ↓
           Partitions (parallelism based on key, e.g., user_id % 12)
```

**Exactly-Once Semantics (EOS):**
```python
# kafka_streams_features.py
from kafka import KafkaConsumer, KafkaProducer
import json

consumer = KafkaConsumer(
    'transactions',
    bootstrap_servers=['kafka:9092'],
    group_id='feature-engineering-group',
    auto_offset_reset='earliest',
    enable_auto_commit=False  # Manual commit for EOS
)

producer = KafkaProducer(
    bootstrap_servers=['kafka:9092'],
    transactional_id='feature-producer-1'  # Enables transactions
)

producer.init_transactions()

for message in consumer:
    try:
        producer.begin_transaction()
        
        # Process event
        event = json.loads(message.value)
        features = compute_real_time_features(event)
        
        # Send to feature topic
        producer.send('computed-features', json.dumps(features).encode())
        
        # Commit offsets and features atomically
        producer.send_offsets_to_transaction(
            consumer.position(consumer.assignment()),
            consumer.consumer_group_metadata()
        )
        producer.commit_transaction()
        
    except Exception as e:
        producer.abort_transaction()
        log_error(e)
```

#### **20.4.2 Spark Structured Streaming**

Micro-batch processing with MLlib integration.

```python
# streaming_features.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, col, count, avg

spark = SparkSession.builder \
    .appName("StreamingFeatures") \
    .getOrCreate()

# Read from Kafka
transactions = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "transactions") \
    .load()

# Parse JSON
parsed = transactions.select(
    from_json(col("value").cast("string"), schema).alias("data")
).select("data.*")

# Windowed aggregations (tumbling window)
windowed_stats = parsed \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(
        window(col("timestamp"), "5 minutes"),
        col("user_id")
    ) \
    .agg(
        count("*").alias("txn_count_5m"),
        avg("amount").alias("avg_amount_5m")
    )

# Write to feature store (Delta Lake)
query = windowed_stats \
    .writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://checkpoints/streaming") \
    .start("s3://features/streaming/user-stats")

query.awaitTermination()
```

---

## **20.5 Workbook Labs**

### **Lab 1: End-to-End Batch Pipeline**
Build a data pipeline for e-commerce user behavior features:

1. **Extract:** Read raw clickstream data (CSV/JSON) from S3
2. **Transform:** Compute aggregates (session duration, pages per session, conversion flags)
3. **Load:** Write to Delta Lake with schema enforcement
4. **Validate:** Implement Great Expectations (null checks, range validation)
5. **Orchestrate:** Deploy Airflow DAG with SLAs and retry logic

**Deliverable:** Running DAG with documentation showing lineage from raw data to features.

### **Lab 2: Feature Store Implementation**
Implement a Feast feature store:

1. **Define:** Create feature definitions for fraud detection (transaction velocity, merchant risk)
2. **Materialize:** Set up offline store (BigQuery/Snowflake) and online store (Redis)
3. **Retrieve:** Python script fetching features for both training (point-in-time) and serving (online)
4. **Consistency Check:** Verify that training data matches serving features for identical timestamps

**Deliverable:** Git repo with `feature_repo/` directory and retrieval demo notebook.

### **Lab 3: Streaming Pipeline**
Build real-time feature computation:

1. **Setup:** Local Kafka cluster (Docker Compose) with transaction events
2. **Process:** Flink or Spark Streaming job computing 5-minute windowed aggregates
3. **Validate:** Exactly-once semantics verification (simulate failures, verify no duplicates)
4. **Monitor:** Lag monitoring (consumer offset vs. producer offset)

**Deliverable:** Streaming application with metrics dashboard showing throughput and lag.

### **Lab 4: Data Quality Monitoring**
Implement drift detection:

1. **Baseline:** Capture training data distribution (Evidently or custom)
2. **Simulation:** Deploy model, simulate data drift (shift numerical distributions)
3. **Detection:** Automated alerts when drift exceeds threshold
4. **Remediation:** Trigger retraining pipeline automatically

**Deliverable:** Drift detection service with Slack/email alerts and runbook.

---

## **20.6 Common Pitfalls**

1. **Schema Evolution Without Backfills:** Adding new features but not computing them for historical data prevents training on older time periods. **Solution:** Always backfill new features to the beginning of your data retention window.

2. **Data Leakage in Feature Engineering:** Computing aggregates that include the current row (e.g., daily count including the transaction being predicted). **Solution:** Use only data available strictly before the timestamp of the prediction.

3. **Small Files Problem:** Writing thousands of tiny Parquet files to S3 kills read performance. **Solution:** Use `OPTIMIZE` (Delta Lake) or `VACUUM` to compact files; aim for 128MB-1GB per file.

4. **Ignoring Late Arriving Data:** Setting watermark too low drops events that arrive out of order. **Solution:** Tune `withWatermark()` based on business requirements (e.g., accept data up to 24 hours late).

5. **Silent Schema Changes:** Upstream producers adding columns breaks pipelines expecting specific column indices. **Solution:** Use schema validation at ingestion; reject or quarantine non-conforming records.

---

## **20.7 Interview Questions**

**Q1:** How do you handle late-arriving data in streaming feature pipelines?
*A: Use watermarks in Spark/Flink to specify how long to wait for late data. For example, `withWatermark("timestamp", "1 hour")` maintains state for 1 hour, updating windowed aggregates if late data arrives. For critical late data (e.g., financial transactions), use the "append" output mode with infinite watermarks and periodic compaction, or process in batch mode with reconciliation jobs.*

**Q2:** Explain the difference between ETL and ELT, and when to use each for ML.
*A: ETL transforms before loading—best for heavy PII masking, strict governance, or complex joins that reduce data volume. ELT loads raw first, transforming in warehouse—best for exploratory feature engineering, preserving raw history for new features, and leveraging cloud warehouse scalability. Modern ML uses hybrid: ELT to data lake (cheap storage), ETL to feature store (validated, serveable features).*

**Q3:** What is training-serving skew in feature stores, and how do you prevent it?
*A: Skew occurs when training uses offline features computed differently than serving uses online features (e.g., training uses batch SQL aggregations, serving uses real-time Redis lookups). Prevention: (1) Shared transformation code (same UDFs in both paths), (2) Feast/Tecton ensure identical logic, (3) Point-in-time correctness (training retrieves features as they existed historically, not current values), (4) Integration tests comparing offline/online feature values.*

**Q4:** How do you ensure exactly-once semantics in Kafka streaming for features?
*A: Use Kafka transactions: (1) Enable idempotent producers (`enable.idempotence=true`), (2) Use transactional producer with `transactional.id`, (3) Consume-process-produce loop within transaction boundaries, (4) Commit offsets atomically with output using `send_offsets_to_transaction()`. This ensures that even if the consumer restarts, features aren't double-computed.*

**Q5:** Design a data pipeline for 1TB/day of clickstream data supporting both real-time recommendations and batch analytics.
*A: Architecture: Kafka for ingestion (high throughput, durability). Dual path: (1) Real-time: Flink/Spark Streaming → Redis (online features for recommendations) with 5-minute windowed aggregates. (2) Batch: Kafka → S3 (landing zone) → Spark (hourly ETL) → Delta Lake (feature store offline). Use Delta Lake's CDF (Change Data Feed) to sync batch corrections to online store. Partition by date/user for efficient reads.*

---

## **20.8 Further Reading**

**Books:**
- *Data Engineering with Python* (Paul Crickard) - Pipeline patterns
- *Streaming Systems* (Tyler Akidau et al.) - Beam model, watermarks
- *Building Analytics Teams* (John K. Thompson) - Data strategy

**Papers:**
- "Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores" (Databricks)
- "Feast: A Feature Store for Machine Learning" (Gojek/Google)

**Tools:**
- **dbt (data build tool):** SQL-based transformations with testing
- **Pandera:** Statistical data validation for Pandas
- **Soda Core:** Data quality monitoring

---

## **20.9 Checkpoint Project: Real-Time Feature Platform**

Build a production-grade feature platform for a ride-sharing company.

**Requirements:**

1. **Data Sources:**
   - Kafka stream: GPS location updates (5M events/minute)
   - Database CDC: Ride transactions (MySQL binlog)
   - Batch files: Driver onboarding documents (daily)

2. **Features:**
   - **Real-time:** Driver availability (last GPS timestamp), demand heatmap (aggregated by 1km grid)
   - **Batch:** Driver lifetime rating, vehicle inspection status
   - **Streaming:** Average wait time per city (5-minute tumbling window)

3. **Architecture:**
   - **Ingestion:** Kafka Connect for CDC, Spark Streaming for GPS
   - **Storage:** Delta Lake (bronze/silver/gold layers), Redis (online features)
   - **Orchestration:** Dagster with partitioned assets (hourly/daily)
   - **Quality:** Great Expectations on gold layer features

4. **SLAs:**
   - Online feature retrieval: p99 < 10ms
   - Feature freshness: Real-time features lag < 30 seconds
   - Data quality: 99.9% of batches pass validation

**Deliverables:**
- `data_platform/` with Docker Compose for local dev
- Dagster repository with partitioned assets
- Feature definitions (Feast or Tecton YAML)
- Data quality reports (Great Expectations docs)
- Performance benchmark showing throughput/latency

**Success Criteria:**
- Successfully backfill 30 days of synthetic data
- Real-time feature computation handles 10k events/sec
- Automated rollback on data quality failures
- Point-in-time correctness verified for training data

---

**End of Chapter 20**

*You now understand how to build the data foundations of ML systems. Chapter 21 covers Model Training & Experimentation—scaling from notebooks to distributed clusters.*

---


<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='19. ml_system_design_and_architecture.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='21. model_training_and_experimentation.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
