# üèÜ Capstone Project: Production E-Commerce ML Data Pipeline

**Module 3: Data & Pipeline Engineering ‚Äî Comprehensive Exercise**

---

## Project Overview

You are a **Senior Data Scientist at ShopStream**, a mid-size e-commerce company.
The business wants a **churn prediction model** that runs daily. Your job: build the
**entire data pipeline** from raw multi-source data to model-ready features.

### What You'll Build

```
 STAGE 1        STAGE 2       STAGE 3        STAGE 4       STAGE 5       STAGE 6
 ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
 ‚îÇMulti- ‚îÇ‚ñ∂‚îÇValidate‚îÇ‚ñ∂‚îÇLeak-Free‚îÇ‚ñ∂‚îÇFeature ‚îÇ‚ñ∂‚îÇQuality ‚îÇ‚ñ∂‚îÇ Store ‚îÇ
 ‚îÇSource ‚îÇ  ‚îÇ& Clean ‚îÇ  ‚îÇSampling ‚îÇ  ‚îÇEngineer‚îÇ  ‚îÇ Check  ‚îÇ  ‚îÇParquet‚îÇ
 ‚îÇIngest ‚îÇ  ‚îÇ        ‚îÇ  ‚îÇ& Split  ‚îÇ  ‚îÇ        ‚îÇ  ‚îÇ        ‚îÇ  ‚îÇ       ‚îÇ
 ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
 L1: Sources  L2: ETL     L3: Sampling  L5: Feast   L4: Leakage  L1: Parquet
              L7: E2E     L4: Leakage   L6: Spark   L7: E2E
```

### Concepts Exercised

| Stage | Lessons Applied |
|-------|----------------|
| Multi-Source Ingest | L1 (Data Sources & Formats) |
| Validate & Clean | L2 (ETL Pipelines), L7 (E2E Pipeline) |
| Leak-Free Sampling | L3 (Sampling Strategies), L4 (Data Leakage) |
| Feature Engineering | L5 (Feature Stores), L6 (Spark/PySpark) |
| Quality Check | L4 (Leakage Detection), L7 (Validation) |
| Store as Parquet | L1 (Formats), L7 (Production Storage) |

---

In [2]:
# ============================================================
# SETUP & IMPORTS
# ============================================================
import pandas as pd
import numpy as np
import os, time, json, warnings, shutil
from datetime import datetime, timedelta
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score, accuracy_score
warnings.filterwarnings('ignore')
np.random.seed(42)

# Pipeline output directory
OUTPUT_DIR = 'pipeline_output'
os.makedirs(OUTPUT_DIR, exist_ok=True)
print("‚úÖ Setup complete")

‚úÖ Setup complete


---
## Stage 1: Multi-Source Data Ingestion

**üìù Concept Revision (Lesson 1 & 2):**

In production ML systems, data never comes from a single clean CSV. You're pulling from
multiple sources with wildly different reliability, formats, and schemas. Understanding
the data source taxonomy is critical for designing robust pipelines:

```
  DATA SOURCE RELIABILITY SPECTRUM:

  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
  ‚îÇ Databases   ‚îÇ  ‚îÇ Internal    ‚îÇ  ‚îÇ APIs        ‚îÇ  ‚îÇ Logs/Events ‚îÇ
  ‚îÇ (Postgres)  ‚îÇ  ‚îÇ Services    ‚îÇ  ‚îÇ (3rd party) ‚îÇ  ‚îÇ (Clickstr.) ‚îÇ
  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
  Most reliable ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∑ Least reliable
  Structured         Semi-struct.    Variable        Noisy
  Schema enforced    Rate-limited    May change      Duplicates
```

**Key principles:**
- **Schema-on-write** (databases): Schema is enforced when data is written. Most reliable.
- **Schema-on-read** (logs, APIs): You define schema at read time. Risky ‚Äî schema can change without warning.
- Always assume the worst: nulls, duplicates, wrong types, outliers, late-arriving data.
- Design your ingestion to handle ALL these issues gracefully.

**Data Format Choice matters hugely:** CSV is human-readable but slow and schema-less.
Parquet is binary, columnar, compressed, and embeds schema ‚Äî the gold standard for ML workloads.
We'll demonstrate the size difference at the end of this pipeline.

We simulate 4 realistic sources below: **transactions** (API ‚Äî semi-structured, unreliable),
**user profiles** (DB ‚Äî structured, reliable), **product catalog** (internal DB), and
**web sessions** (logs ‚Äî noisy with outliers).

---

In [1]:
# ============================================================
# SOURCE 1: Transaction API (semi-structured, unreliable)
# ============================================================
n_txn = 200_000
dates = pd.date_range('2023-06-01', '2024-03-31', periods=n_txn)

transactions = pd.DataFrame({
    'txn_id': np.arange(n_txn),
    'user_id': np.random.randint(1, 10001, n_txn),
    'product_id': np.random.randint(1, 2001, n_txn),
    'amount': np.round(np.random.exponential(45, n_txn), 2),
    'quantity': np.random.randint(1, 8, n_txn),
    'payment_method': np.random.choice(['credit_card', 'debit_card', 'upi', 'wallet', 'cod'], n_txn,
                                       p=[0.35, 0.25, 0.20, 0.10, 0.10]),
    'timestamp': dates,
})

# Inject realistic issues: nulls, negatives, duplicates
transactions.loc[np.random.choice(n_txn, 3000, replace=False), 'amount'] = np.nan
transactions.loc[np.random.choice(n_txn, 500, replace=False), 'amount'] *= -1
duplicates = transactions.sample(500, random_state=42)
transactions = pd.concat([transactions, duplicates]).reset_index(drop=True)

print(f"‚úÖ Source 1 - Transactions: {len(transactions):,} rows | Issues: nulls, negatives, dupes")

NameError: name 'pd' is not defined

In [None]:
# ============================================================
# SOURCE 2: User Profiles DB (structured, reliable)
# ============================================================
n_users = 10_000
user_profiles = pd.DataFrame({
    'user_id': np.arange(1, n_users + 1),
    'signup_date': pd.date_range('2020-01-01', '2024-01-01', periods=n_users),
    'age': np.random.randint(18, 72, n_users),
    'gender': np.random.choice(['M', 'F', 'Other'], n_users, p=[0.48, 0.48, 0.04]),
    'region': np.random.choice(['North', 'South', 'East', 'West', 'Central'], n_users),
    'is_premium': np.random.choice([0, 1], n_users, p=[0.75, 0.25]),
})

# Generate churn labels: ~12% churn rate (imbalanced!)
churn_prob = 0.05 + 0.15 * (user_profiles['age'] > 50).astype(float) + \
             0.10 * (1 - user_profiles['is_premium']) + \
             np.random.uniform(-0.05, 0.05, n_users)
user_profiles['churned'] = (np.random.random(n_users) < churn_prob.clip(0, 0.5)).astype(int)

print(f"‚úÖ Source 2 - User Profiles: {len(user_profiles):,} rows | Churn rate: {user_profiles['churned'].mean():.1%}")

In [None]:
# ============================================================
# SOURCE 3: Product Catalog (internal DB)
# ============================================================
n_products = 2000
product_catalog = pd.DataFrame({
    'product_id': np.arange(1, n_products + 1),
    'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books', 'Sports',
                                  'Beauty', 'Food', 'Toys'], n_products),
    'base_price': np.round(np.random.uniform(5, 500, n_products), 2),
    'avg_rating': np.round(np.random.uniform(1.5, 5.0, n_products), 1),
})

print(f"‚úÖ Source 3 - Product Catalog: {len(product_catalog):,} rows")

In [None]:
# ============================================================
# SOURCE 4: Web Session Logs (semi-structured, noisy)
# ============================================================
n_sessions = 500_000
sessions = pd.DataFrame({
    'session_id': np.arange(n_sessions),
    'user_id': np.random.randint(1, n_users + 1, n_sessions),
    'timestamp': pd.date_range('2023-06-01', '2024-03-31', periods=n_sessions),
    'pages_viewed': np.random.poisson(5, n_sessions),
    'time_on_site_sec': np.random.exponential(180, n_sessions).astype(int),
    'device': np.random.choice(['mobile', 'desktop', 'tablet'], n_sessions, p=[0.55, 0.35, 0.10]),
    'bounce': np.random.choice([0, 1], n_sessions, p=[0.65, 0.35]),
})

# Inject noise: some extreme outliers
sessions.loc[np.random.choice(n_sessions, 200, replace=False), 'time_on_site_sec'] = \
    np.random.randint(50000, 200000, 200)

print(f"‚úÖ Source 4 - Web Sessions: {len(sessions):,} rows | With outliers injected")
print(f"\n{'='*60}")
print(f"INGESTION SUMMARY")
print(f"{'='*60}")
print(f"  Total data points: {len(transactions)+len(user_profiles)+len(product_catalog)+len(sessions):,}")
print(f"  Sources: 4 (API, DB, Internal, Logs)")

---
## Stage 2: Validation & Cleaning (ETL)

**üìù Concept Revision (Lesson 2 & 7):**

The ETL (Extract-Transform-Load) pattern is the backbone of every data pipeline.
We've already Extracted (Stage 1). Now we Transform (validate + clean). Load comes later.

**The cardinal rule: NEVER trust raw data.** Even data from "reliable" databases can have:
- **Schema changes**: A column was renamed upstream and no one told you
- **Null spikes**: A service outage caused 50% of rows to miss a field
- **Duplicates**: Retry logic in the API created double-writes
- **Out-of-range values**: Negative prices from refunds mixed with sales

```
  VALIDATION LAYERS (defense in depth):

  Layer 1: SCHEMA       ‚Üí  Are the right columns present with correct types?
  Layer 2: NULL CHECK   ‚Üí  Are null rates within acceptable thresholds?
  Layer 3: RANGE CHECK  ‚Üí  Are values within business-valid ranges?
  Layer 4: UNIQUENESS   ‚Üí  Are primary keys actually unique?
  Layer 5: STATISTICAL  ‚Üí  Do distributions match historical baselines?

  Each layer catches different failure modes.
  Validate BEFORE processing to fail fast (don't waste compute on bad data).
```

**ETL vs ELT:** In traditional ETL, you transform before loading. In ELT (used with
cloud data warehouses like BigQuery/Snowflake), you load raw data first, then transform
in-place using SQL. The validation step is critical in BOTH patterns.

The **DataValidator pattern** below is a production-grade pattern used at companies like
Airbnb and Uber. It's essentially a lightweight version of Great Expectations ‚Äî the
industry-standard data validation library.

---

In [None]:
# ============================================================
# PRODUCTION DATA VALIDATOR
# ============================================================
@dataclass
class ValidationResult:
    check: str
    passed: bool
    detail: str
    severity: str  # 'error', 'warning'

class DataValidator:
    def __init__(self):
        self.results: List[ValidationResult] = []
    
    def check_schema(self, df, expected_cols):
        missing = set(expected_cols) - set(df.columns)
        self.results.append(ValidationResult(
            'schema', len(missing)==0,
            f'Missing: {missing}' if missing else 'OK', 'error' if missing else 'info'))
    
    def check_nulls(self, df, max_pct=0.05):
        violations = {c: f'{v:.1%}' for c, v in df.isnull().mean().items() if v > max_pct}
        self.results.append(ValidationResult(
            'null_rate', len(violations)==0,
            f'High nulls: {violations}' if violations else f'All <{max_pct:.0%}', 
            'error' if violations else 'info'))
    
    def check_duplicates(self, df, key_col):
        n_dupes = df[key_col].duplicated().sum()
        self.results.append(ValidationResult(
            f'duplicates({key_col})', n_dupes==0,
            f'{n_dupes} duplicates' if n_dupes else 'No duplicates',
            'error' if n_dupes else 'info'))
    
    def check_range(self, df, col, lo, hi):
        oob = ((df[col].dropna() < lo) | (df[col].dropna() > hi)).sum()
        self.results.append(ValidationResult(
            f'range({col})', oob==0,
            f'{oob} out of [{lo},{hi}]' if oob else 'In range',
            'warning' if oob else 'info'))
    
    def report(self):
        print(f"{'='*55}")
        print(f"VALIDATION REPORT  ({datetime.now().strftime('%H:%M:%S')})")
        print(f"{'='*55}")
        for r in self.results:
            icon = '‚úÖ' if r.passed else ('‚ö†Ô∏è' if r.severity=='warning' else '‚ùå')
            print(f"  {icon} {r.check}: {r.detail}")
        ok = all(r.passed or r.severity!='error' for r in self.results)
        print(f"\n  {'PASSED' if ok else 'FAILED'}")
        self.results = []
        return ok

print("‚úÖ DataValidator defined")

In [None]:
# ============================================================
# VALIDATE RAW TRANSACTIONS
# ============================================================
v = DataValidator()
v.check_schema(transactions, ['txn_id','user_id','product_id','amount','quantity','payment_method','timestamp'])
v.check_nulls(transactions)
v.check_duplicates(transactions, 'txn_id')
v.check_range(transactions, 'amount', 0.01, 5000)
v.check_range(transactions, 'quantity', 1, 50)
v.report()

In [None]:
# ============================================================
# CLEAN TRANSACTIONS
# ============================================================
txn = transactions.copy()
before = len(txn)

# 1. Deduplicate
txn = txn.drop_duplicates(subset='txn_id', keep='first')
print(f"  Deduped: {before - len(txn)} removed")

# 2. Fix negative amounts
neg = (txn['amount'] < 0).sum()
txn['amount'] = txn['amount'].abs()
print(f"  Fixed {neg} negative amounts")

# 3. Impute nulls with category-aware median (join product info first)
txn = txn.merge(product_catalog[['product_id','category']], on='product_id', how='left')
null_ct = txn['amount'].isna().sum()
txn['amount'] = txn.groupby('category')['amount'].transform(lambda x: x.fillna(x.median()))
txn['amount'] = txn['amount'].fillna(txn['amount'].median())  # fallback
print(f"  Imputed {null_ct} null amounts (category-median)")

# 4. Cap outliers at 99th percentile
cap = txn['amount'].quantile(0.99)
capped = (txn['amount'] > cap).sum()
txn['amount'] = txn['amount'].clip(upper=cap)
print(f"  Capped {capped} amounts at p99={cap:.2f}")

# 5. Add derived columns
txn['total_value'] = txn['amount'] * txn['quantity']
txn['hour'] = txn['timestamp'].dt.hour
txn['dow'] = txn['timestamp'].dt.dayofweek
txn['month'] = txn['timestamp'].dt.month

print(f"\n‚úÖ Clean transactions: {len(txn):,} rows, {len(txn.columns)} cols")

In [None]:
# ============================================================
# CLEAN WEB SESSIONS (cap outlier durations)
# ============================================================
sess = sessions.copy()
p99 = sess['time_on_site_sec'].quantile(0.99)
outliers = (sess['time_on_site_sec'] > p99).sum()
sess['time_on_site_sec'] = sess['time_on_site_sec'].clip(upper=p99)
print(f"‚úÖ Clean sessions: capped {outliers} outliers at {p99:.0f}s")

---
## Stage 3: Leak-Free Sampling & Temporal Split

**üìù Concept Revision (Lesson 3 & 4):**

This is where most ML projects silently fail. **Data leakage** is the #1 reason models
look amazing in development and crash in production.

**Three types of leakage to watch for:**
1. **Target leakage**: A feature that directly encodes the label (e.g., 'cancellation_date' for churn)
2. **Train-test contamination**: Information from test set leaks into training (e.g., fitting
   a scaler on ALL data before splitting)
3. **Temporal leakage**: Using future data to predict the past (most insidious with time-series)

```
  WRONG (random split ‚Äî temporal leakage!):
  Jun 2023  Jul  Aug  Sep  Oct  Nov  Dec  Jan 2024  Feb  Mar
    [T]  [V]  [T]  [T]  [V]  [T]  [V]  [T]     [V]  [T]
    Future data leaks into training! ‚ùå

  CORRECT (temporal split ‚Äî no leakage!):
  Jun 2023  Jul  Aug  Sep  Oct  Nov  Dec  Jan 2024 | Feb  Mar
    [T]  [T]  [T]  [T]  [T]  [T]  [T]  [T]      | [V]  [V]
    Only past data in training ‚úÖ                   | Test
```

**Class imbalance** (Lesson 3): With ~12% churn rate, a naive model predicting "not churned" for
everyone gets 88% accuracy! That's useless. We handle this with:
- `class_weight='balanced'` in the model (adjusts loss function to penalize misses on minority class)
- Stratified sampling to preserve class ratios in splits
- SMOTE (Synthetic Minority Oversampling) ‚Äî but ONLY on training data, never before splitting!

**Why temporal > random for this problem:** Customer behavior is time-dependent. Spending patterns
in December (holidays) differ from January. A random split would let December test data inform
January training data, creating an unrealistic advantage.

---

In [None]:
# ============================================================
# TEMPORAL SPLIT: train on past, test on future
# This prevents temporal leakage (Lesson 4)
# ============================================================
SPLIT_DATE = pd.Timestamp('2024-02-01')

txn_train = txn[txn['timestamp'] < SPLIT_DATE]
txn_test = txn[txn['timestamp'] >= SPLIT_DATE]
sess_train = sess[sess['timestamp'] < SPLIT_DATE]
sess_test = sess[sess['timestamp'] >= SPLIT_DATE]

print(f"Temporal Split at {SPLIT_DATE.date()}:")
print(f"  Train transactions: {len(txn_train):,}  ({txn_train['timestamp'].min().date()} to {txn_train['timestamp'].max().date()})")
print(f"  Test  transactions: {len(txn_test):,}  ({txn_test['timestamp'].min().date()} to {txn_test['timestamp'].max().date()})")
print(f"\n>>> No future data leaks into training features! (Lesson 4: Temporal Leakage Prevention)")

---
## Stage 4: Feature Engineering

**üìù Concept Revision (Lesson 5 & 6):**

Feature engineering is where the ML value is actually created. Raw data is useless to a model ‚Äî
the features you compute from it determine model performance more than algorithm choice.

**The Training-Serving Skew Problem (Lesson 5):**
If you compute features differently during training vs serving, the model gets confused.
Example: during training you compute `avg_order_value` over all historical data, but during
serving you only use the last 30 days. The distributions differ ‚Üí model degrades.

This is why **feature stores** (like Feast) exist ‚Äî they serve the SAME feature computation
logic for both training and inference. In this project, we simulate this by using a
**single reusable function** (`build_user_features`) for both train and test.

```
  FEATURE STORE PATTERN (what we're implementing):

  build_user_features()  ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚ñ∂  Training Features  ‚îÄ‚îÄ‚ñ∂ Model.fit()
  (single function)        ‚îÇ
  Same logic for both!     ‚îî‚îÄ‚îÄ‚ñ∂  Serving Features   ‚îÄ‚îÄ‚ñ∂ Model.predict()
                                No skew! ‚úÖ
```

**Performance (Lesson 6):** We use vectorized Pandas operations (groupby + agg) instead of
row-by-row iteration. In Spark, these same operations would distribute across a cluster
for TB-scale data. The pattern is identical: think in columns, not rows.

**RFM Features:** Recency (days since last purchase), Frequency (purchases per day),
Monetary (total spend) ‚Äî the classic customer behavior feature set used by Netflix, Amazon,
Spotify, etc. These three alone are often the most predictive features for churn.

**Feature categories we build:**
- **Transaction features**: spend, frequency, recency, product diversity
- **Session features**: engagement, bounce rate, device preference
- **Profile features**: demographics, account age, premium status

---

In [None]:
# ============================================================
# FEATURE ENGINEERING FUNCTION (reusable for train AND test)
# This is the "feature definition" that a Feature Store would manage
# ============================================================
def build_user_features(txn_df, sess_df, user_df, product_df, label='_train'):
    """Build user-level features from transaction and session data.
    Uses only vectorized operations (Lesson 6: Spark-style thinking).
    """
    start = time.time()
    
    # --- Transaction features ---
    txn_feats = txn_df.groupby('user_id').agg(
        total_transactions=('txn_id', 'count'),
        total_spend=('total_value', 'sum'),
        avg_order_value=('amount', 'mean'),
        median_order_value=('amount', 'median'),
        max_order_value=('amount', 'max'),
        std_order_value=('amount', 'std'),
        avg_quantity=('quantity', 'mean'),
        unique_products=('product_id', 'nunique'),
        unique_categories=('category', 'nunique'),
        unique_payment_methods=('payment_method', 'nunique'),
        pct_weekend=('dow', lambda x: (x >= 5).mean()),
        pct_evening=('hour', lambda x: ((x >= 18) & (x <= 23)).mean()),
        first_txn=('timestamp', 'min'),
        last_txn=('timestamp', 'max'),
    ).reset_index()
    
    # Recency & frequency (RFM-style)
    ref_date = txn_df['timestamp'].max()
    txn_feats['days_since_last_txn'] = (ref_date - txn_feats['last_txn']).dt.days
    txn_feats['customer_tenure_days'] = (txn_feats['last_txn'] - txn_feats['first_txn']).dt.days + 1
    txn_feats['purchase_frequency'] = txn_feats['total_transactions'] / txn_feats['customer_tenure_days'].clip(lower=1)
    txn_feats = txn_feats.drop(columns=['first_txn', 'last_txn'])
    txn_feats['std_order_value'] = txn_feats['std_order_value'].fillna(0)
    
    # Favorite category
    fav_cat = txn_df.groupby('user_id')['category'].agg(lambda x: x.mode().iloc[0]).reset_index()
    fav_cat.columns = ['user_id', 'fav_category']
    txn_feats = txn_feats.merge(fav_cat, on='user_id', how='left')
    
    # --- Session features ---
    sess_feats = sess_df.groupby('user_id').agg(
        total_sessions=('session_id', 'count'),
        avg_pages_viewed=('pages_viewed', 'mean'),
        avg_time_on_site=('time_on_site_sec', 'mean'),
        total_time_on_site=('time_on_site_sec', 'sum'),
        bounce_rate=('bounce', 'mean'),
        pct_mobile=('device', lambda x: (x == 'mobile').mean()),
    ).reset_index()
    
    # --- Merge all features ---
    features = user_df[['user_id', 'age', 'gender', 'region', 'is_premium', 'churned']].copy()
    features['account_age_days'] = (ref_date - user_df['signup_date']).dt.days
    features = features.merge(txn_feats, on='user_id', how='left')
    features = features.merge(sess_feats, on='user_id', how='left')
    
    # Fill NaN for users with no transactions/sessions
    numeric_cols = features.select_dtypes(include=[np.number]).columns
    features[numeric_cols] = features[numeric_cols].fillna(0)
    
    elapsed = time.time() - start
    print(f"  ‚úÖ Built {len(features.columns)-2} features for {len(features):,} users ({elapsed:.2f}s) [{label}]")
    return features

print("‚úÖ Feature engineering function defined (reusable for train/test)")

In [None]:
# ============================================================
# BUILD FEATURES (separately for train and test to prevent leakage!)
# ============================================================
print("Building features...")
train_features = build_user_features(txn_train, sess_train, user_profiles, product_catalog, 'TRAIN')
test_features = build_user_features(txn_test, sess_test, user_profiles, product_catalog, 'TEST')

print(f"\n  Train features shape: {train_features.shape}")
print(f"  Test features shape:  {test_features.shape}")
print(f"  Train churn rate:     {train_features['churned'].mean():.1%}")
print(f"  Test churn rate:      {test_features['churned'].mean():.1%}")

---
## Stage 5: Leakage Detection & Quality Check

**üìù Concept Revision (Lesson 4 & 7):**

Even after careful pipeline design, you need to **verify** that no leakage slipped through.
This is your last line of defense before the model sees the data.

**Leakage Detection Checklist:**

```
  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
  ‚îÇ              LEAKAGE RED FLAGS                   ‚îÇ
  ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
  ‚îÇ üö© Any feature with >0.9 corr to target        ‚îÇ
  ‚îÇ üö© Accuracy >95% on first try (too good!)      ‚îÇ
  ‚îÇ üö© Single feature dominates importance (>50%)   ‚îÇ
  ‚îÇ üö© Performance drops hugely in production       ‚îÇ
  ‚îÇ üö© Future timestamps in feature columns         ‚îÇ
  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

We run two checks here:
1. **Correlation analysis**: Compute correlation of every feature with the target. Anything >0.9
   is almost certainly leakage (no real-world feature predicts churn that perfectly).
2. **Output quality validation**: Even if there's no leakage, are the features valid? No nulls?
   Values in reasonable ranges? User IDs unique? This is the "quality gate" from Lesson 7.

**Post-model sanity check:** After training, examine feature importances. If a single feature
dominates (>50% importance), it's likely leakage. Real-world churn prediction should rely on
a mix of behavioral, demographic, and engagement signals.

---

In [None]:
# ============================================================
# LEAKAGE DETECTION (Lesson 4)
# ============================================================
print("LEAKAGE DETECTION")
print("="*55)

numeric_feats = train_features.select_dtypes(include=[np.number]).columns.drop('churned')
correlations = train_features[numeric_feats].corrwith(train_features['churned']).abs().sort_values(ascending=False)

print("\nTop 10 feature-target correlations:")
for feat, corr in correlations.head(10).items():
    flag = ' üö© SUSPICIOUS!' if corr > 0.5 else ''
    print(f"  {feat:<30} {corr:.3f}{flag}")

leaky = correlations[correlations > 0.9]
if len(leaky) > 0:
    print(f"\n‚ùå LEAKAGE DETECTED in: {list(leaky.index)}")
    print("  These features likely encode the target. REMOVE THEM.")
else:
    print(f"\n‚úÖ No target leakage detected (all correlations < 0.9)")

In [None]:
# ============================================================
# OUTPUT QUALITY VALIDATION (Lesson 7)
# ============================================================
v2 = DataValidator()
v2.check_nulls(train_features, max_pct=0.0)
v2.check_range(train_features, 'total_spend', 0, 500_000)
v2.check_range(train_features, 'age', 18, 100)
v2.check_range(train_features, 'bounce_rate', 0, 1)
v2.check_duplicates(train_features, 'user_id')
v2.report()

---
## Stage 6: Model-Ready Pipeline & Storage

**üìù Concept Revision (Lesson 1, 3, 4, 7):**

**Why sklearn.Pipeline matters (Lesson 4):**
The most common source of train-test contamination is fitting a scaler or imputer on the
full dataset before splitting. `sklearn.Pipeline` solves this by guaranteeing that `fit()`
only touches training data, and `transform()` is applied consistently to test data.

```
  WRONG (leaky):                    RIGHT (Pipeline):
  scaler.fit(ALL_DATA)               pipeline.fit(X_train)
  X_train = scaler.transform(train)   ‚îú‚îÄ imputer.fit(X_train)
  X_test = scaler.transform(test)     ‚îú‚îÄ scaler.fit(X_train)
  ‚ùå test stats leaked into scaler     ‚îú‚îÄ model.fit(X_train)
                                     pipeline.predict(X_test)
                                     ‚îî‚îÄ uses train-fitted transforms ‚úÖ
```

**Why Parquet for storage (Lesson 1):**
- **Columnar format**: Read only the columns you need (column pruning)
- **Schema embedded**: No guessing types, no silent type coercion
- **Compression**: Snappy compression gives 3-8x size reduction over CSV
- **Predicate pushdown**: Spark/Presto can filter without reading entire files

**Class imbalance (Lesson 3):**
`class_weight='balanced'` multiplies the loss for minority class samples by
`n_samples / (n_classes * n_samples_in_class)`. With 12% churn, the churned class gets
~7.3x weight. This forces the model to pay equal attention to both classes instead of
learning the lazy "predict majority class" shortcut.

---

In [None]:
# ============================================================
# PREPARE FEATURES FOR MODELING
# ============================================================
# Encode categoricals
cat_cols = ['gender', 'region', 'fav_category']

# One-hot encode (fit on train, transform both)
train_encoded = pd.get_dummies(train_features, columns=cat_cols, drop_first=True)
test_encoded = pd.get_dummies(test_features, columns=cat_cols, drop_first=True)

# Align columns (test might have missing categories)
train_cols = set(train_encoded.columns)
test_cols = set(test_encoded.columns)
for col in train_cols - test_cols:
    test_encoded[col] = 0
test_encoded = test_encoded[train_encoded.columns]

# Separate X, y
drop_cols = ['user_id', 'churned']
X_train = train_encoded.drop(columns=drop_cols)
y_train = train_encoded['churned']
X_test = test_encoded.drop(columns=drop_cols)
y_test = test_encoded['churned']

print(f"X_train: {X_train.shape} | X_test: {X_test.shape}")
print(f"y_train churn rate: {y_train.mean():.1%} | y_test churn rate: {y_test.mean():.1%}")

In [None]:
# ============================================================
# LEAK-FREE SKLEARN PIPELINE (Lesson 4)
# fit() happens ONLY on X_train
# ============================================================
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(
        n_estimators=200,
        max_depth=12,
        class_weight='balanced',  # Lesson 3: handle imbalance
        random_state=42,
        n_jobs=-1
    ))
])

# Train
start = time.time()
pipeline.fit(X_train, y_train)
train_time = time.time() - start

# Predict
y_pred = pipeline.predict(X_test)

print(f"\nModel trained in {train_time:.2f}s")
print(f"\n{'='*55}")
print("CLASSIFICATION REPORT (Test Set)")
print(f"{'='*55}")
print(classification_report(y_test, y_pred, target_names=['Active', 'Churned']))
print(f"Overall F1: {f1_score(y_test, y_pred, average='weighted'):.4f}")
print(f"Accuracy:   {accuracy_score(y_test, y_pred):.4f}")

In [None]:
# ============================================================
# FEATURE IMPORTANCE (sanity check for leakage)
# ============================================================
importances = pd.Series(
    pipeline.named_steps['model'].feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

print("\nTop 15 Feature Importances:")
print("-"*40)
for feat, imp in importances.head(15).items():
    bar = '‚ñà' * int(imp * 100)
    print(f"  {feat:<30} {imp:.3f} {bar}")

# Leakage red flag: single feature dominance
if importances.iloc[0] > 0.5:
    print(f"\n‚ö†Ô∏è WARNING: '{importances.index[0]}' dominates ({importances.iloc[0]:.1%}). Check for leakage!")
else:
    print(f"\n‚úÖ No single feature dominates. Importance is well-distributed.")

In [None]:
# ============================================================
# STORE AS PARQUET (Lesson 1: binary columnar > CSV)
# ============================================================
print("\nSTORING ARTIFACTS")
print("="*55)

# Save train features
train_path = os.path.join(OUTPUT_DIR, 'train_features.parquet')
train_encoded.to_parquet(train_path, index=False, compression='snappy')
train_size = os.path.getsize(train_path) / 1024

# Save test features
test_path = os.path.join(OUTPUT_DIR, 'test_features.parquet')
test_encoded.to_parquet(test_path, index=False, compression='snappy')
test_size = os.path.getsize(test_path) / 1024

# Compare with CSV size
csv_path = os.path.join(OUTPUT_DIR, 'train_features.csv')
train_encoded.to_csv(csv_path, index=False)
csv_size = os.path.getsize(csv_path) / 1024

print(f"  Parquet (train): {train_size:.1f} KB")
print(f"  Parquet (test):  {test_size:.1f} KB")
print(f"  CSV equivalent:  {csv_size:.1f} KB")
print(f"  Compression:     {csv_size/train_size:.1f}x smaller with Parquet!")

# Save pipeline metadata
metadata = {
    'pipeline_run_date': datetime.now().isoformat(),
    'split_date': str(SPLIT_DATE.date()),
    'train_rows': len(train_encoded),
    'test_rows': len(test_encoded),
    'n_features': len(X_train.columns),
    'churn_rate_train': float(y_train.mean()),
    'churn_rate_test': float(y_test.mean()),
    'test_f1': float(f1_score(y_test, y_pred, average='weighted')),
    'test_accuracy': float(accuracy_score(y_test, y_pred)),
    'feature_columns': list(X_train.columns),
}
with open(os.path.join(OUTPUT_DIR, 'pipeline_metadata.json'), 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"\n‚úÖ All artifacts saved to {OUTPUT_DIR}/")

# Cleanup
shutil.rmtree(OUTPUT_DIR)
print("‚úÖ Cleaned up output directory")

---
## üèÜ Pipeline Summary & Production Readiness Checklist

```
 ‚úÖ Multi-source ingestion       (Lesson 1: 4 data source types)
 ‚úÖ Schema + null validation      (Lesson 2 & 7: DataValidator pattern)
 ‚úÖ Data cleaning & dedup         (Lesson 2: ETL Transform)
 ‚úÖ Temporal train/test split     (Lesson 4: no temporal leakage)
 ‚úÖ Feature engineering (20+ feats)(Lesson 5 & 6: reusable function)
 ‚úÖ Target leakage detection      (Lesson 4: correlation check)
 ‚úÖ Output quality validation     (Lesson 7: post-transform checks)
 ‚úÖ Leak-free sklearn Pipeline    (Lesson 4: fit only on train)
 ‚úÖ Class imbalance handling      (Lesson 3: balanced class weights)
 ‚úÖ Feature importance audit      (Lesson 4: no single-feature dominance)
 ‚úÖ Parquet storage + metadata    (Lesson 1: binary columnar format)
```

### What Would Change in Full Production

| This Project | Production |
|-------------|------------|
| Synthetic data | Real DB/API sources |
| Pandas | Spark for >10GB data (Lesson 6) |
| Manual runs | Airflow/Prefect orchestration (Lesson 7) |
| In-memory features | Feast feature store (Lesson 5) |
| Print statements | Structured logging + Prometheus metrics |
| Local files | S3/GCS data lake with partitioning |

---

**üéì All 7 lessons of Module 3 applied in one coherent, production-grade pipeline!**