# CHAPTER 5: DATA PREPROCESSING & FEATURE ENGINEERING

*The Foundation of Every Great Model*

## Chapter Overview

In production ML, 80% of your time will be spent on data — cleaning it, understanding it, transforming it into features that models can learn from. This chapter transforms you from someone who runs `.fit()` on clean CSVs into an engineer who can rescue messy real-world data, engineer features that capture domain knowledge, and build preprocessing pipelines that scale from laptops to distributed clusters.

**Estimated Time:** 35-45 hours (2-3 weeks)  
**Prerequisites:** Chapters 1-4, pandas proficiency, basic statistics

---

## 5.0 Learning Objectives

By the end of this chapter, you will be able to:

1. Diagnose and fix real-world data quality issues: missing values, outliers, duplicates, and inconsistencies
2. Apply appropriate scaling and encoding techniques based on model requirements and data distributions
3. Engineer domain-aware features that capture temporal patterns, interactions, and aggregates
4. Reduce dimensionality while preserving signal using PCA, t-SNE, and feature selection methods
5. Build production-ready preprocessing pipelines with scikit-learn that transform consistently across train/inference
6. Detect data drift and validate incoming data against training distributions

---

## 5.1 Data Cleaning: The 80% Work

Real-world data is never clean. Your first job is to make it usable without introducing bias or losing signal.

### 5.1.1 Exploratory Data Analysis (EDA) First

Before cleaning, understand what you're working with:

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load and inspect
df = pd.read_csv('customer_data.csv')

# First glance
print(df.shape)
print(df.info())                 # Data types, non-null counts
print(df.describe(include='all')) # Summary statistics

# Missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
print(pd.DataFrame({'missing': missing, 'percent': missing_pct}).sort_values('missing', ascending=False))

# Data types and unique values
for col in df.columns:
    print(f"{col}: {df[col].dtype} - {df[col].nunique()} unique values")
    
# Visual distributions
df.hist(bins=50, figsize=(20,15))
plt.tight_layout()
plt.show()

# Correlation matrix (numerical features only)
plt.figure(figsize=(12,10))
sns.heatmap(df.select_dtypes(include=[np.number]).corr(), annot=True, fmt='.2f')
plt.show()
```

**Key Questions to Answer:**
- What's the data type? (Should it be numeric, categorical, datetime?)
- What's the range/domain? (Negative ages? Future birth dates?)
- How much data is missing? Is it random or systematic?
- Are there obvious outliers? (Visualize with box plots)
- Are categorical variables balanced?

### 5.1.2 Missing Data: The Silent Model Killer

Missing values aren't just empty cells — they're information about your data collection process.

**Types of Missing Data:**
- **MCAR (Missing Completely at Random):** No pattern (e.g., sensor randomly failed)
- **MAR (Missing at Random):** Pattern related to observed data (e.g., women less likely to report income)
- **MNAR (Missing Not at Random):** Pattern related to unobserved data (e.g., high-income people hide income)

**Detection:**
```python
# Visualize missing patterns
import missingno as msno

msno.matrix(df)              # Missing pattern visualization
msno.heatmap(df)             # Correlation between missing columns
msno.dendrogram(df)          # Hierarchical clustering of missingness

# Test if missing is random (simplified)
# Group by a categorical column and check missing rates
for col in df.select_dtypes(include=['object']).columns:
    missing_by_group = df.groupby(col)['target_column'].apply(lambda x: x.isnull().mean())
    print(f"\nMissing rates by {col}:")
    print(missing_by_group)
```

**Handling Strategies:**

| Strategy | When to Use | Code | Pros/Cons |
|----------|------------|------|-----------|
| **Delete rows** | <5% missing, MCAR | `df.dropna(subset=['important_col'])` | Simple, loses data |
| **Delete columns** | >70% missing, low importance | `df.drop(columns=['bad_col'])` | Preserves rows, loses feature |
| **Mean/Median** | Numerical, low missing, no relationship | `df['age'].fillna(df['age'].median())` | Fast, reduces variance |
| **Mode** | Categorical, low missing | `df['gender'].fillna(df['gender'].mode()[0])` | Simple for categories |
| **Forward/Backward fill** | Time series | `df['sales'].fillna(method='ffill')` | Respects temporal order |
| **Interpolation** | Time series with trend | `df['temp'].interpolate(method='linear')` | More accurate than ffill |
| **KNN Imputation** | MAR, relationships exist | `from sklearn.impute import KNNImputer` | Captures patterns, slow |
| **Model-based** | Complex patterns | `from sklearn.experimental import enable_iterative_imputer` | Best accuracy, expensive |
| **Add "missing" indicator** | Missingness itself is informative | `df['age_missing'] = df['age'].isnull().astype(int)` | Preserves signal |

**Production-Grade Imputation:**
```python
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer

# Numerical imputation
num_imputer = SimpleImputer(strategy='median')  # Robust to outliers

# Categorical imputation
cat_imputer = SimpleImputer(strategy='constant', fill_value='MISSING')

# KNN imputation for complex patterns
knn_imputer = KNNImputer(n_neighbors=5, weights='distance')

# Column transformer for mixed types
preprocessor = ColumnTransformer([
    ('num_impute', num_imputer, numerical_cols),
    ('cat_impute', cat_imputer, categorical_cols)
])

# Fit on training data only (never on validation/test!)
X_train_imputed = preprocessor.fit_transform(X_train)
X_test_imputed = preprocessor.transform(X_test)
```

### 5.1.3 Outlier Detection and Treatment

Outliers can be genuine rare events or data errors. Your treatment depends on which.

**Detection Methods:**

```python
# 1. Z-Score (assumes normal distribution)
from scipy import stats
z_scores = np.abs(stats.zscore(df['sales']))
outliers_z = df[z_scores > 3]  # Beyond 3 std deviations

# 2. IQR Method (distribution-agnostic)
Q1 = df['sales'].quantile(0.25)
Q3 = df['sales'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df[(df['sales'] < Q1 - 1.5 * IQR) | (df['sales'] > Q3 + 1.5 * IQR)]

# 3. Isolation Forest (multivariate)
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(contamination=0.05, random_state=42)
outliers_if = iso_forest.fit_predict(df[numerical_cols]) == -1

# 4. DBSCAN (density-based)
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=0.5, min_samples=10).fit(df[numerical_cols])
outliers_db = clustering.labels_ == -1

# 5. Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
df.boxplot(column='sales', ax=axes[0])
axes[0].set_title('Box Plot')
df['sales'].hist(bins=50, ax=axes[1])
axes[1].set_title('Histogram')
axes[2].scatter(df.index, df['sales'])
axes[2].set_title('Scatter Plot')
plt.tight_layout()
```

**Treatment Options:**

```python
# 1. Cap/Clip (Winsorization)
upper_limit = df['sales'].quantile(0.99)
lower_limit = df['sales'].quantile(0.01)
df['sales_capped'] = df['sales'].clip(lower_limit, upper_limit)

# 2. Log transform (pulls in long tails)
df['sales_log'] = np.log1p(df['sales'])  # log(1+x) handles zeros

# 3. Box-Cox transform (finds optimal normalization)
from scipy import stats
df['sales_boxcox'], lambda_opt = stats.boxcox(df['sales'] + 1)  # +1 for zeros
print(f"Optimal lambda: {lambda_opt}")

# 4. Remove if clearly erroneous
df = df[df['age'].between(0, 120)]  # Remove impossible ages

# 5. Treat as separate class (for anomalies)
df['is_outlier'] = (z_scores > 3).astype(int)
```

### 5.1.4 Duplicates and Inconsistencies

```python
# Exact duplicates
duplicates = df[df.duplicated(keep=False)]  # Show all duplicates
df = df.drop_duplicates()  # Keep first occurrence

# Near duplicates (fuzzy matching for text)
from rapidfuzz import fuzz
def find_duplicate_strings(series, threshold=90):
    """Find near-duplicate strings using Levenshtein distance"""
    duplicates = []
    for i, val1 in enumerate(series):
        for j, val2 in enumerate(series[i+1:], i+1):
            if fuzz.ratio(str(val1), str(val2)) > threshold:
                duplicates.append((i, j, val1, val2))
    return duplicates

# Inconsistent categorical values
df['country'] = df['country'].str.lower().str.strip()
country_mapping = {
    'usa': 'US', 'united states': 'US', 'america': 'US',
    'uk': 'UK', 'united kingdom': 'UK', 'england': 'UK'
}
df['country_clean'] = df['country'].map(country_mapping).fillna(df['country'])

# Date inconsistencies
df['date'] = pd.to_datetime(df['date'], errors='coerce')  # Invalid dates become NaT
df = df.dropna(subset=['date'])  # Drop rows with invalid dates
```

---

## 5.2 Feature Scaling: Giving All Features Equal Voice

Models that use distance (KNN, SVM, neural nets) or regularization (Ridge, Lasso) require scaled features. Tree-based models don't.

### 5.2.1 Scaling Techniques Compared

```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
import numpy as np

# Generate sample data with outliers
np.random.seed(42)
data = np.random.normal(100, 20, 1000)  # Normal distribution
data = np.append(data, [500, -200])     # Add outliers

# Create DataFrame for comparison
df_scaling = pd.DataFrame({'original': data})

# Apply different scalers
scalers = {
    'Standard (Z-score)': StandardScaler(),
    'Min-Max': MinMaxScaler(),
    'Robust (IQR)': RobustScaler(),
    'MaxAbs': MaxAbsScaler()
}

for name, scaler in scalers.items():
    df_scaling[name] = scaler.fit_transform(data.reshape(-1, 1))

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for i, (name, scaler) in enumerate(scalers.items()):
    axes[i].hist(df_scaling[name], bins=50, edgecolor='black')
    axes[i].set_title(f'{name}\nMean: {df_scaling[name].mean():.2f}, Std: {df_scaling[name].std():.2f}')
    axes[i].axvline(x=0, color='red', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()
```

| Scaler | Formula | Range | Best For | Sensitivity |
|--------|---------|-------|----------|-------------|
| **StandardScaler** | (x - μ) / σ | ~[-3,3] | Normally distributed data | Outliers |
| **MinMaxScaler** | (x - min) / (max - min) | [0,1] | Bounded data, neural nets | Outliers |
| **RobustScaler** | (x - median) / IQR | ~[-4,4] | Data with outliers | Robust |
| **MaxAbsScaler** | x / max\|x\| | [-1,1] | Sparse data | Outliers |

### 5.2.2 When to Scale What

```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Mixed scaling based on feature characteristics
preprocessor = ColumnTransformer([
    # Features with outliers -> RobustScaler
    ('robust', RobustScaler(), ['income', 'transaction_amount']),
    
    # Features bounded by nature -> MinMaxScaler
    ('minmax', MinMaxScaler(), ['age', 'satisfaction_score']),
    
    # Normally distributed features -> StandardScaler
    ('standard', StandardScaler(), ['height', 'weight']),
    
    # Leave binary features as-is
    ('passthrough', 'passthrough', ['is_male', 'has_children'])
])

# Fit on training
X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)
```

### 5.2.3 Critical Pitfall: Data Leakage

Never fit scalers on the entire dataset before splitting!

```python
# WRONG - leaks test information into training
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses test data in fit!
X_train, X_test = train_test_split(X_scaled, ...)

# RIGHT - fit only on training
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Uses training statistics
```

---

## 5.3 Encoding Categorical Variables: Speaking the Model's Language

Models understand numbers, not strings. Encoding transforms categories into numbers while preserving information.

### 5.3.1 Nominal vs Ordinal Categories

```python
# Ordinal (has order) - preserve ordering
education_order = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
df['education_encoded'] = df['education'].map(education_order)

# Nominal (no order) - don't imply relationships
# WRONG: df['color'] = {'red':1, 'green':2, 'blue':3}  # Implies green > red
```

### 5.3.2 One-Hot Encoding (The Workhorse)

```python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Pandas way
df_encoded = pd.get_dummies(df, columns=['color', 'city'], drop_first=True)

# Scikit-learn way (better for pipelines)
encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[['color', 'city']])

# Create DataFrame with feature names
feature_names = encoder.get_feature_names_out(['color', 'city'])
df_encoded = pd.DataFrame(encoded, columns=feature_names, index=df.index)

# Handling high cardinality (many categories)
# Option 1: Keep top K categories, group rest as 'other'
top_cities = df['city'].value_counts().nlargest(10).index
df['city_grouped'] = df['city'].where(df['city'].isin(top_cities), 'Other')

# Option 2: Frequency encoding (replace with count)
city_counts = df['city'].value_counts()
df['city_freq'] = df['city'].map(city_counts) / len(df)  # Normalized frequency
```

### 5.3.3 Target Encoding (Powerful but Dangerous)

Replace category with mean of target for that category. Excellent for tree models but causes target leakage if done wrong.

```python
from sklearn.model_selection import KFold
import numpy as np

def target_encode(X, y, cat_col, n_folds=5):
    """
    Safe target encoding with cross-validation to prevent leakage
    """
    X = X.copy()
    X['target_encoded'] = np.nan
    
    kfold = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kfold.split(X):
        # Compute mean on training fold only
        cat_means = y.iloc[train_idx].groupby(X.iloc[train_idx][cat_col]).mean()
        
        # Apply to validation fold
        X.loc[val_idx, 'target_encoded'] = X.loc[val_idx, cat_col].map(cat_means)
    
    # For new categories, use global mean
    global_mean = y.mean()
    X['target_encoded'].fillna(global_mean, inplace=True)
    
    return X['target_encoded']

# Usage (for binary classification)
df['city_target_encoded'] = target_encode(df, df['churned'], 'city')
```

### 5.3.4 Advanced: Embeddings for High Cardinality

For categories with hundreds/thousands of values (user IDs, ZIP codes), learn embeddings during model training.

```python
import torch
import torch.nn as nn

class CategoryEmbedding(nn.Module):
    def __init__(self, num_categories, embedding_dim):
        super().__init__()
        self.embedding = nn.Embedding(num_categories, embedding_dim)
        
    def forward(self, x):
        # x: category indices
        return self.embedding(x)

# In practice, this is integrated into your neural network
# embedding_dim = min(50, num_categories // 2) is a common heuristic
```

### 5.3.5 Encoding Decision Matrix

| Encoding | Cardinality | Model Type | Pros | Cons |
|----------|-------------|------------|------|------|
| **One-Hot** | Low (<10) | Any | No false ordinality | Creates many columns |
| **Label** | Low, ordinal | Tree-based | Simple | Implies order |
| **Frequency** | Medium | Tree-based | Single column, captures popularity | Collisions possible |
| **Target** | Medium | Tree-based | Powerful signal | Leakage risk, overfitting |
| **Embeddings** | High | Neural nets | Learns semantic similarity | Complex, needs data |
| **Binary** | Medium | Any | Fewer cols than one-hot | Less interpretable |

---

## 5.4 Feature Engineering: Where Domain Expertise Shines

The best features come from understanding the problem, not just the data.

### 5.4.1 Temporal Features

```python
# From timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Time components
df['hour'] = df['timestamp'].dt.hour
df['day'] = df['timestamp'].dt.day
df['dayofweek'] = df['timestamp'].dt.dayofweek  # Monday=0, Sunday=6
df['month'] = df['timestamp'].dt.month
df['quarter'] = df['timestamp'].dt.quarter
df['year'] = df['timestamp'].dt.year
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)

# Cyclical encoding (preserves circular nature)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# Lag features (past values)
df['sales_lag_1'] = df.groupby('store_id')['sales'].shift(1)
df['sales_lag_7'] = df.groupby('store_id')['sales'].shift(7)

# Rolling statistics
df['sales_rolling_mean_7'] = df.groupby('store_id')['sales'].transform(
    lambda x: x.rolling(7, min_periods=1).mean()
)
df['sales_rolling_std_7'] = df.groupby('store_id')['sales'].transform(
    lambda x: x.rolling(7, min_periods=1).std()
)

# Time since last event
df['days_since_last_purchase'] = df.groupby('customer_id')['date'].diff().dt.days
```

### 5.4.2 Interaction Features

```python
from sklearn.preprocessing import PolynomialFeatures

# Manual interactions
df['income_per_capita'] = df['household_income'] / df['household_size']
df['price_per_unit'] = df['total_price'] / df['quantity']
df['age_squared'] = df['age'] ** 2  # Capture non-linear effects

# Polynomial features (auto-generate interactions)
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
interactions = poly.fit_transform(df[['income', 'age', 'education']])
interaction_df = pd.DataFrame(
    interactions, 
    columns=poly.get_feature_names_out(['income', 'age', 'education'])
)

# Domain-specific interactions
# e.g., in e-commerce: user browsing behavior × item category
df['user_cat_click_rate'] = df.groupby(['user_id', 'category'])['clicked'].transform('mean')
```

### 5.4.3 Aggregation Features

```python
# Customer-level aggregations
customer_features = df.groupby('customer_id').agg({
    'transaction_amount': ['mean', 'std', 'max', 'min', 'sum', 'count'],
    'timestamp': lambda x: (x.max() - x.min()).days,  # customer lifetime
    'product_id': 'nunique',  # distinct products bought
    'is_return': 'mean'  # return rate
}).reset_index()

# Flatten column names
customer_features.columns = ['_'.join(col).strip() for col in customer_features.columns.values]

# Window aggregations (last N days)
df['last_7d_spend'] = df.groupby('customer_id')['amount'].transform(
    lambda x: x.rolling('7D', on='date').sum()
)

# Ratio features
df['return_rate'] = df.groupby('customer_id')['is_return'].transform('mean')
df['avg_order_value'] = df['amount'] / df['order_count']
```

### 5.4.4 Text Features

```python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Basic text features
df['text_length'] = df['description'].str.len()
df['word_count'] = df['description'].str.split().str.len()
df['avg_word_length'] = df['text_length'] / df['word_count']

# Bag of words (for ML, not deep learning)
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
text_features = vectorizer.fit_transform(df['description'])
text_df = pd.DataFrame(
    text_features.toarray(),
    columns=vectorizer.get_feature_names_out()
)

# Sentiment scores (using pre-trained models)
from textblob import TextBlob
df['sentiment'] = df['review'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['subjectivity'] = df['review'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
```

### 5.4.5 Feature Engineering Checklist

- [ ] Did I create features that capture domain knowledge?
- [ ] Did I handle temporal patterns (seasonality, trends, lags)?
- [ ] Did I create interaction features that might matter?
- [ ] Did I aggregate to appropriate levels (customer, product, store)?
- [ ] Did I avoid leakage (using future information)?
- [ ] Did I validate that new features improve model performance?

---

## 5.5 Dimensionality Reduction: When Less is More

Too many features cause overfitting, slow training, and the curse of dimensionality.

### 5.5.1 Feature Selection Methods

```python
from sklearn.feature_selection import (
    SelectKBest, SelectPercentile,
    f_classif, mutual_info_classif,
    RFE, SelectFromModel
)
from sklearn.ensemble import RandomForestClassifier

# 1. Filter Methods (univariate)
selector_kbest = SelectKBest(score_func=mutual_info_classif, k=20)
X_selected = selector_kbest.fit_transform(X, y)

# Get selected features
selected_features = X.columns[selector_kbest.get_support()].tolist()

# 2. Wrapper Methods (RFE - Recursive Feature Elimination)
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
selector_rfe = RFE(estimator, n_features_to_select=20, step=5)
selector_rfe.fit(X, y)
selected_rfe = X.columns[selector_rfe.support_].tolist()

# 3. Embedded Methods (from model)
selector_model = SelectFromModel(
    RandomForestClassifier(n_estimators=100, random_state=42),
    threshold='median',  # Keep features above median importance
    max_features=20
)
selector_model.fit(X, y)
selected_model = X.columns[selector_model.get_support()].tolist()

# Feature importance visualization
importances = pd.DataFrame({
    'feature': X.columns,
    'importance': selector_model.estimator_.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(importances.head(20)['feature'], importances.head(20)['importance'])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances')
plt.gca().invert_yaxis()
plt.tight_layout()
```

### 5.5.2 PCA (Principal Component Analysis)

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Always scale before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

# Find number of components for 95% variance
n_components = np.argmax(cumulative_variance >= 0.95) + 1
print(f"Components needed for 95% variance: {n_components}")

# Plot elbow curve
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(range(1, len(explained_variance) + 1), explained_variance, 'bo-')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot')

plt.subplot(1, 2, 2)
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 'ro-')
plt.axhline(y=0.95, color='k', linestyle='--', label='95% threshold')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Variance')
plt.legend()

plt.tight_layout()
plt.show()

# Transform with optimal components
pca_optimal = PCA(n_components=n_components)
X_pca_optimal = pca_optimal.fit_transform(X_scaled)

# Interpret components (what do they represent?)
components_df = pd.DataFrame(
    pca_optimal.components_,
    columns=X.columns,
    index=[f'PC{i+1}' for i in range(n_components)]
)
print("Top features in PC1:")
print(components_df.loc['PC1'].abs().sort_values(ascending=False).head(10))
```

### 5.5.3 t-SNE and UMAP for Visualization

```python
from sklearn.manifold import TSNE
import umap

# t-SNE (slow, for visualization only)
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled)

# UMAP (faster, often better)
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', s=10, alpha=0.7)
axes[0].set_title('t-SNE Visualization')
axes[0].set_xlabel('t-SNE 1')
axes[0].set_ylabel('t-SNE 2')

axes[1].scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis', s=10, alpha=0.7)
axes[1].set_title('UMAP Visualization')
axes[1].set_xlabel('UMAP 1')
axes[1].set_ylabel('UMAP 2')

plt.tight_layout()
plt.show()
```

### 5.5.4 Autoencoders for Non-linear Reduction

```python
import torch
import torch.nn as nn
import torch.optim as optim

class Autoencoder(nn.Module):
    def __init__(self, input_dim, encoding_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, encoding_dim)
        )
        
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Linear(128, input_dim)
        )
    
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded
    
    def encode(self, x):
        return self.encoder(x)

# Training (simplified)
autoencoder = Autoencoder(input_dim=X.shape[1], encoding_dim=32)
optimizer = optim.Adam(autoencoder.parameters())
criterion = nn.MSELoss()

# Convert to tensors
X_tensor = torch.FloatTensor(X_scaled)

for epoch in range(100):
    reconstructed = autoencoder(X_tensor)
    loss = criterion(reconstructed, X_tensor)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

# Get reduced features
with torch.no_grad():
    X_encoded = autoencoder.encode(X_tensor).numpy()
```

---

## 5.6 Building Production Preprocessing Pipelines

### 5.6.1 Scikit-learn Pipelines

```python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

# Define feature types
numeric_features = ['age', 'income', 'transaction_count']
categorical_features = ['gender', 'city', 'education']
binary_features = ['is_active', 'has_children']

# Numeric pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Binary pipeline (just impute)
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))
])

# Combine all
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('bin', binary_transformer, binary_features)
    ])

# Full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train
pipeline.fit(X_train, y_train)

# Predict
predictions = pipeline.predict(X_test)

# Save pipeline
import joblib
joblib.dump(pipeline, 'model_pipeline.pkl')

# Load and use
loaded_pipeline = joblib.load('model_pipeline.pkl')
new_predictions = loaded_pipeline.predict(new_data)
```

### 5.6.2 Custom Transformers

```python
from sklearn.base import BaseEstimator, TransformerMixin

class DateFeatureExtractor(BaseEstimator, TransformerMixin):
    """Extract features from datetime column"""
    
    def __init__(self, date_column):
        self.date_column = date_column
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        dates = pd.to_datetime(X[self.date_column])
        
        X['hour'] = dates.dt.hour
        X['dayofweek'] = dates.dt.dayofweek
        X['month'] = dates.dt.month
        X['is_weekend'] = (dates.dt.dayofweek >= 5).astype(int)
        
        # Cyclical encoding
        X['hour_sin'] = np.sin(2 * np.pi * X['hour'] / 24)
        X['hour_cos'] = np.cos(2 * np.pi * X['hour'] / 24)
        
        # Drop original if desired
        X = X.drop(columns=[self.date_column])
        
        return X

# Usage in pipeline
pipeline = Pipeline([
    ('date_features', DateFeatureExtractor('transaction_date')),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])
```

### 5.6.3 Handling Inference vs Training Consistency

```python
# Critical: Save all fitted transformers
import pickle
from sklearn.pipeline import Pipeline

class SafePreprocessor:
    def __init__(self):
        self.pipeline = None
        self.feature_names = None
        
    def fit(self, X, y=None):
        # Build pipeline with fitted statistics
        self.pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ])
        self.pipeline.fit(X)
        
        # Save feature names for validation
        self.feature_names = X.columns.tolist()
        return self
    
    def transform(self, X):
        # Validate input has same columns
        missing_cols = set(self.feature_names) - set(X.columns)
        if missing_cols:
            raise ValueError(f"Missing columns: {missing_cols}")
        
        # Ensure column order matches training
        X = X[self.feature_names]
        
        return self.pipeline.transform(X)
    
    def save(self, path):
        with open(path, 'wb') as f:
            pickle.dump({'pipeline': self.pipeline, 
                        'feature_names': self.feature_names}, f)
    
    def load(self, path):
        with open(path, 'rb') as f:
            data = pickle.load(f)
            self.pipeline = data['pipeline']
            self.feature_names = data['feature_names']
        return self
```

---

## 5.7 Workbook Labs

### Lab 1: Rescue the Titanic Dataset
The Titanic dataset is famously messy. Your job: clean it for modeling.

**Tasks:**
1. Handle missing values in Age, Cabin, Embarked
2. Extract titles from Name (Mr., Mrs., etc.) and create feature
3. Create family size feature from SibSp + Parch + 1
4. Engineer "is_alone" feature
5. Create age groups (child, adult, elderly)
6. Build a preprocessing pipeline and compare model performance before/after

**Deliverable:** Jupyter notebook with EDA, preprocessing steps, and model comparison.

### Lab 2: Time Series Feature Engineering for Retail Sales
Given daily sales data for multiple stores:

**Tasks:**
1. Create lag features (1, 7, 30 days)
2. Create rolling statistics (7-day mean, 30-day std)
3. Add calendar features (day of week, month, holiday indicator)
4. Create price elasticity features (sales/price ratio)
5. Handle outliers in sales (spikes during promotions vs genuine anomalies)
6. Build pipeline that handles temporal ordering correctly (no future leakage)

**Deliverable:** Python script with functions for each feature type, plus validation that features improve forecast accuracy.

### Lab 3: High-Cardinality Categorical Encoding
Dataset with 10,000+ product categories and 1M users.

**Tasks:**
1. Compare one-hot (impossible), frequency encoding, target encoding, and embeddings
2. Implement safe target encoding with 5-fold cross-validation
3. Measure impact on model performance vs baseline
4. Create embedding layer in PyTorch and compare to traditional methods

**Deliverable:** Performance comparison table, code for each method, recommendation for production.

### Lab 4: Production Pipeline with Validation
Build a complete preprocessing pipeline that:

**Tasks:**
1. Reads raw CSV
2. Performs cleaning (missing values, outliers)
3. Engineers features (temporal, interactions, aggregates)
4. Scales appropriately
5. Saves fitted preprocessor to disk
6. Includes validation script that checks new data against training distribution
7. Handles errors gracefully (missing columns, new categories)

**Deliverable:** `preprocess.py` with classes, `test_preprocess.py` with unit tests, and example usage.

---

## 5.8 Common Pitfalls

1. **Data Leakage in Feature Engineering:**
   - Using future information (e.g., computing customer's total spend before prediction date)
   - Using target to create features (e.g., mean target encoding without CV)
   - Scaling before train/test split

2. **Assuming Missing is Random:**
   - Always investigate *why* data is missing
   - Consider adding "missing indicator" for important features

3. **One-Hot Encoding High Cardinality:**
   - Creates thousands of sparse columns
   - Memory explosion, slow training
   - Solution: frequency encoding, embeddings, or grouping

4. **Ignoring Domain Constraints:**
   - Negative predictions for inherently positive values (prices, counts)
   - Probabilities outside [0,1] without proper transformation
   - Always clip predictions to valid ranges post-model

5. **Inconsistent Transformations:**
   - Different preprocessing between training and inference
   - Forgetting to save fitted scalers/encoders
   - Solution: Always use pipelines and save entire fitted pipeline

6. **Outlier Removal Without Investigation:**
   - "Outliers" might be the most interesting cases (fraud, rare diseases)
   - Consider separate modeling for outliers or treat as separate class

---

## 5.9 Interview Questions

**Q1:** How do you handle missing values in a production system where new data arrives with different missing patterns?

*A: I'd use a pipeline with saved imputation parameters. For new missing patterns, I'd log alerts for data drift, have fallback strategies (global median if category missing), and regularly retrain the imputer on recent data. Critical: never impute with statistics from future data.*

**Q2:** Explain the difference between normalization and standardization. When would you use each?

*A: Standardization (z-score) centers data to mean 0, std 1. Good for normally distributed data and algorithms assuming Gaussian. Normalization (min-max) scales to [0,1]. Good for bounded data, neural nets with sigmoid outputs, and algorithms requiring same scale (SVM, KNN). Robust scaler better with outliers.*

**Q3:** Your categorical feature has 10,000 unique values. How do you encode it?

*A: One-hot is impossible (10k columns). I'd try: (1) Frequency encoding (count or ratio), (2) Target encoding with proper CV to avoid leakage, (3) Embedding layer in neural net, (4) Hash encoding for memory efficiency, (5) Group rare categories as "other" then one-hot. Choice depends on model type and data size.*

**Q4:** What is the curse of dimensionality and how do you combat it?

*A: As dimensions increase, data becomes sparse, distances become less meaningful, and models overfit. Combat with: (1) Feature selection (filter/wrapper/embedded), (2) Dimensionality reduction (PCA/t-SNE/UMAP), (3) Regularization, (4) More data, (5) Domain knowledge to remove irrelevant features.*

**Q5:** How do you detect and handle data drift in preprocessing?

*A: Monitor distributions of input features vs training: (1) Statistical tests (KS-test, chi-square) for each feature, (2) Population Stability Index (PSI), (3) Feature importance drift. If drift detected: alert, investigate root cause, potentially retrain model on recent data with updated preprocessing statistics.*

**Q6:** You notice that after one-hot encoding, your model performance drops. Why?

*A: Possible reasons: (1) Too many sparse features causing overfitting, (2) Loss of ordinal information if categories had natural order, (3) Curse of dimensionality, (4) Insufficient data for rare categories. Solution: try different encoding (frequency, target) or add regularization.*

---

## 5.10 Further Reading

**Classic Papers:**
- *A Comparative Study of Categorical Variable Encoding Techniques* (Pargent et al., 2022)
- *Random Search for Hyper-Parameter Optimization* (Bergstra & Bengio, 2012) - feature selection context
- *Visualizing Data using t-SNE* (Van der Maaten & Hinton, 2008)

**Books:**
- *Feature Engineering for Machine Learning* (Alice Zheng, O'Reilly)
- *The Kaggle Book* (Konrad Banachewicz, Luca Massaron) - practical feature engineering examples
- *Python Feature Engineering Cookbook* (Soledad Galli)

**Tools to Explore:**
- **Feature-engine:** Library with advanced feature engineering transformers
- **Category Encoders:** scikit-learn-contrib with many encoding methods
- **Pandas Profiling / ydata-profiling:** Automated EDA report generation
- **Great Expectations:** Data validation and testing

---

## 5.11 Checkpoint Project: End-to-End Data Pipeline for Customer Churn

Build a complete, production-ready data preprocessing system for a telecom customer churn dataset.

### Dataset
Use a publicly available churn dataset (e.g., Telco Customer Churn from Kaggle) or simulate one with:
- Customer demographics (age, gender, income)
- Account information (tenure, contract type, payment method)
- Service usage (monthly charges, total charges, number of calls)
- Support interactions (tickets, complaints)
- Target: churned (yes/no)

### Requirements

**Phase 1: EDA and Cleaning**
- Profile data with automated and manual EDA
- Document missing data patterns and handling strategy
- Identify and decide on outliers
- Create data quality report with issues found and fixes applied

**Phase 2: Feature Engineering**
- Create at least 5 new features from existing data
- Document rationale for each (why it should predict churn)
- Handle temporal aspects correctly (no future leakage)
- Create interaction features between demographics and usage

**Phase 3: Encoding and Scaling**
- Apply appropriate encoding to all categorical features
- Scale numerical features appropriately
- Compare at least 3 encoding methods for high-cardinality features
- Justify your final choices

**Phase 4: Dimensionality Reduction**
- Apply feature selection to reduce features by 30% while maintaining performance
- Try PCA and compare interpretability vs performance
- Document which features were kept and why

**Phase 5: Production Pipeline**
- Build a complete sklearn pipeline with all preprocessing steps
- Include custom transformers for your engineered features
- Add validation to check new data matches training schema
- Save fitted pipeline to disk
- Create inference script that loads pipeline and scores new customers

**Phase 6: Monitoring Setup**
- Define data drift metrics for each feature
- Create a simple dashboard (even static HTML) showing training vs current distributions
- Implement alerting if drift exceeds thresholds

### Deliverables

1. **Code Repository:**
   - `eda.py` - EDA and visualization
   - `features.py` - feature engineering functions
   - `pipeline.py` - complete preprocessing pipeline
   - `validate.py` - data validation checks
   - `inference.py` - scoring new data
   - `tests/` - unit tests for critical functions

2. **Documentation:**
   - `FEATURE_CATALOG.md` - description of each feature
   - `PIPELINE.md` - how to use the pipeline
   - `DATA_DICTIONARY.md` - expected schema

3. **Visualization:**
   - Before/after distributions for key features
   - Feature importance plot
   - Drift detection dashboard (static or interactive)

4. **Model Comparison:**
   - Train 3 models (logistic regression, random forest, XGBoost) using your pipeline
   - Compare performance with baseline (minimal preprocessing)
   - Show that your feature engineering improves performance

### Success Criteria
- Pipeline runs end-to-end in <5 minutes on a laptop
- All transformations are reversible/documented
- New unseen data can be scored without errors
- Code includes error handling for edge cases
- Clear demonstration that engineered features improve model performance (AUC improvement >0.05)

---

**End of Chapter 5**

*You've now mastered the art and science of turning raw data into model-ready features. Chapter 6 begins Supervised Learning with Regression — where you'll finally train models on the beautiful, clean data you've prepared.*

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../1. Foundations/4. development_enironment_and_tools.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='6. supervised_learning_regression.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
