# Module 3: Data Preprocessing and Feature Engineering

---

Real-world data is rarely clean. Missing values, inconsistent formats, and varied scales can significantly degrade model performance. This module covers the essential techniques for preparing data before feeding it into a machine learning algorithm.

**What you will learn:**
- Loading and inspecting datasets
- Handling missing values
- Encoding categorical variables
- Feature scaling and normalization
- Feature selection
- Train-test splitting strategies

---

## Table of Contents

1. [Loading and Inspecting Data](#1.-Loading-and-Inspecting-Data)
2. [Handling Missing Values](#2.-Handling-Missing-Values)
3. [Encoding Categorical Variables](#3.-Encoding-Categorical-Variables)
4. [Feature Scaling](#4.-Feature-Scaling)
5. [Feature Selection](#5.-Feature-Selection)
6. [Train-Test Splitting](#6.-Train-Test-Splitting)
7. [End-to-End Preprocessing Pipeline](#7.-End-to-End-Preprocessing-Pipeline)
8. [Exercises](#8.-Exercises)
9. [Summary and Further Reading](#9.-Summary-and-Further-Reading)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

---

## 1. Loading and Inspecting Data

The first step in any ML project is to load the data and understand its structure. We will create a realistic synthetic dataset that contains the types of issues you will encounter in practice.

In [None]:
# Create a synthetic dataset simulating housing data with realistic issues
np.random.seed(42)
n = 200

data = {
    'area_sqft': np.random.randint(500, 5000, n).astype(float),
    'bedrooms': np.random.choice([1, 2, 3, 4, 5], n),
    'age_years': np.random.randint(0, 50, n).astype(float),
    'location': np.random.choice(['Urban', 'Suburban', 'Rural'], n),
    'garage': np.random.choice(['Yes', 'No'], n),
    'condition': np.random.choice(['Poor', 'Fair', 'Good', 'Excellent'], n),
}

# Generate price as a function of features plus noise
price = (data['area_sqft'] * 100 +
         data['bedrooms'] * 15000 +
         (50 - data['age_years']) * 1000 +
         np.random.normal(0, 20000, n))
data['price'] = price

df = pd.DataFrame(data)

# Introduce missing values deliberately (to simulate real-world data)
missing_indices_area = np.random.choice(n, 15, replace=False)
missing_indices_age = np.random.choice(n, 10, replace=False)
missing_indices_garage = np.random.choice(n, 20, replace=False)

df.loc[missing_indices_area, 'area_sqft'] = np.nan
df.loc[missing_indices_age, 'age_years'] = np.nan
df.loc[missing_indices_garage, 'garage'] = np.nan

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print()
df.head(10)

In [None]:
# Essential inspection steps
print("=" * 60)
print("DATA TYPES")
print("=" * 60)
print(df.dtypes)

print("\n" + "=" * 60)
print("BASIC STATISTICS (Numerical Columns)")
print("=" * 60)
print(df.describe().round(2))

print("\n" + "=" * 60)
print("MISSING VALUES")
print("=" * 60)
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)
missing_summary = pd.DataFrame({'Missing Count': missing, 'Percentage': missing_pct})
print(missing_summary[missing_summary['Missing Count'] > 0])
print(f"\nTotal missing values: {df.isnull().sum().sum()} out of {df.size} entries")

In [None]:
# Visualize missing values
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of missing values
missing_cols = df.isnull().sum()
missing_cols = missing_cols[missing_cols > 0]
axes[0].barh(missing_cols.index, missing_cols.values, color='#FF5722', edgecolor='white')
axes[0].set_xlabel('Number of Missing Values')
axes[0].set_title('Missing Values by Column', fontsize=14, fontweight='bold')
for i, v in enumerate(missing_cols.values):
    axes[0].text(v + 0.3, i, str(v), va='center', fontsize=11)

# Heatmap of missing values
sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='YlOrRd', ax=axes[1])
axes[1].set_title('Missing Value Heatmap', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

---

## 2. Handling Missing Values

Missing data must be addressed before training. The main strategies are:

| Strategy | When to Use | Method |
|----------|------------|--------|
| **Drop rows** | Very few missing values, large dataset | `df.dropna()` |
| **Drop columns** | Column has > 50% missing values | `df.drop(columns=...)` |
| **Mean/Median imputation** | Numerical features, roughly symmetric | `SimpleImputer(strategy='mean')` |
| **Mode imputation** | Categorical features | `SimpleImputer(strategy='most_frequent')` |
| **Forward/Backward fill** | Time series data | `df.fillna(method='ffill')` |

In [None]:
from sklearn.impute import SimpleImputer

# Work on a copy
df_clean = df.copy()

# Strategy 1: Impute numerical columns with the median
# (Median is preferred over mean when data may have outliers)
num_imputer = SimpleImputer(strategy='median')
numerical_cols = ['area_sqft', 'age_years']
df_clean[numerical_cols] = num_imputer.fit_transform(df_clean[numerical_cols])

# Strategy 2: Impute categorical columns with the mode (most frequent value)
cat_imputer = SimpleImputer(strategy='most_frequent')
df_clean[['garage']] = cat_imputer.fit_transform(df_clean[['garage']])

# Verify no missing values remain
print("Missing values after imputation:")
print(df_clean.isnull().sum())
print(f"\nTotal missing: {df_clean.isnull().sum().sum()}")

In [None]:
# Compare distributions before and after imputation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for idx, col in enumerate(numerical_cols):
    ax = axes[idx]
    ax.hist(df[col].dropna(), bins=25, alpha=0.6, color='#2196F3',
            label='Before (with NaN dropped)', density=True, edgecolor='white')
    ax.hist(df_clean[col], bins=25, alpha=0.6, color='#FF5722',
            label='After imputation', density=True, edgecolor='white')
    ax.set_title(f'{col}: Before vs After Imputation', fontsize=13, fontweight='bold')
    ax.legend(fontsize=10)
    ax.set_xlabel(col)
    ax.set_ylabel('Density')

plt.tight_layout()
plt.show()
print("Median imputation preserves the overall distribution shape while filling in gaps.")

---

## 3. Encoding Categorical Variables

ML algorithms work with numbers, not strings. We need to convert categorical variables into numerical representations.

### Three Common Encoding Methods

| Method | Use Case | Example |
|--------|---------|--------|
| **Label Encoding** | Ordinal categories (natural order) | Poor=0, Fair=1, Good=2, Excellent=3 |
| **One-Hot Encoding** | Nominal categories (no order) | Urban=[1,0,0], Suburban=[0,1,0], Rural=[0,0,1] |
| **Ordinal Encoding** | Similar to Label, but with explicit order mapping | Same as Label, but for sklearn pipelines |

In [None]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

df_encoded = df_clean.copy()

# --- Label Encoding for binary variables (Yes/No) ---
le = LabelEncoder()
df_encoded['garage_encoded'] = le.fit_transform(df_encoded['garage'])
print("Label Encoding for 'garage':")
print(f"  Classes: {list(le.classes_)}")
print(f"  Mapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")
print()

# --- Ordinal Encoding for ordered categories ---
condition_order = [['Poor', 'Fair', 'Good', 'Excellent']]
oe = OrdinalEncoder(categories=condition_order)
df_encoded['condition_encoded'] = oe.fit_transform(df_encoded[['condition']])
print("Ordinal Encoding for 'condition':")
print(f"  Order: Poor=0, Fair=1, Good=2, Excellent=3")
print()

# --- One-Hot Encoding for nominal categories (no inherent order) ---
location_dummies = pd.get_dummies(df_encoded['location'], prefix='location', dtype=int)
df_encoded = pd.concat([df_encoded, location_dummies], axis=1)
print("One-Hot Encoding for 'location':")
print(location_dummies.head())

# Drop original categorical columns
df_encoded = df_encoded.drop(columns=['garage', 'condition', 'location'])

print("\n--- Encoded DataFrame (first 5 rows) ---")
df_encoded.head()

---

## 4. Feature Scaling

Features with different scales can cause problems for many algorithms (e.g., KNN, SVM, gradient descent-based models). Scaling ensures all features contribute equally.

### Common Scaling Methods

| Method | Formula | Range | Best For |
|--------|---------|-------|----------|
| **StandardScaler** | (x - mean) / std | Centered at 0 | Most ML algorithms |
| **MinMaxScaler** | (x - min) / (max - min) | [0, 1] | Neural networks, bounded algorithms |
| **RobustScaler** | (x - median) / IQR | Varies | Data with outliers |

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Select numerical features for scaling
features_to_scale = ['area_sqft', 'bedrooms', 'age_years']
original_data = df_encoded[features_to_scale].copy()

# Apply each scaler
scalers = {
    'Original (unscaled)': original_data.values,
    'StandardScaler': StandardScaler().fit_transform(original_data),
    'MinMaxScaler': MinMaxScaler().fit_transform(original_data),
    'RobustScaler': RobustScaler().fit_transform(original_data),
}

# Compare statistics
print(f"{'':>20} {'area_sqft':>12} {'bedrooms':>12} {'age_years':>12}")
print("-" * 58)
for name, values in scalers.items():
    means = values.mean(axis=0)
    print(f"{name:>20}: mean = [{means[0]:>8.2f}, {means[1]:>8.2f}, {means[2]:>8.2f}]")

In [None]:
# Visualize the effect of different scalers
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
colors = ['#2196F3', '#FF5722', '#4CAF50']

for idx, (name, values) in enumerate(scalers.items()):
    ax = axes[idx // 2, idx % 2]
    df_temp = pd.DataFrame(values, columns=features_to_scale)
    for j, col in enumerate(features_to_scale):
        ax.hist(df_temp[col], bins=25, alpha=0.6, color=colors[j],
                label=col, edgecolor='white')
    ax.set_title(name, fontsize=13, fontweight='bold')
    ax.legend(fontsize=9)
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')

plt.suptitle('Effect of Different Scaling Methods', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("Observations:")
print("  - Original: features are on vastly different scales (area_sqft vs bedrooms).")
print("  - StandardScaler: centers each feature at mean=0, std=1.")
print("  - MinMaxScaler: compresses all features into the [0, 1] range.")
print("  - RobustScaler: uses median and IQR — less affected by outliers.")

---

## 5. Feature Selection

Not all features are useful. Irrelevant or redundant features can hurt performance and slow training. Feature selection identifies the most informative features.

### Common Approaches

1. **Correlation analysis** — remove features that are highly correlated with each other
2. **Variance threshold** — remove features with very low variance (near-constant)
3. **Statistical tests** — select features based on their relationship with the target

In [None]:
# Correlation analysis
fig, ax = plt.subplots(figsize=(10, 8))
corr_matrix = df_encoded.corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # show only lower triangle
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='RdYlBu_r',
            center=0, square=True, linewidths=0.5, ax=ax,
            cbar_kws={'label': 'Correlation'})
ax.set_title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Which features correlate most with the target (price)?
print("\nCorrelation with target variable (price):")
print("-" * 40)
target_corr = corr_matrix['price'].drop('price').abs().sort_values(ascending=False)
for feat, corr in target_corr.items():
    print(f"  {feat:>25s}: {corr:.3f}")

In [None]:
# Variance Threshold — remove near-constant features
from sklearn.feature_selection import VarianceThreshold

# Apply to numerical features only
selector = VarianceThreshold(threshold=0.01)  # remove features with variance < 0.01
numerical_features = df_encoded.select_dtypes(include=[np.number])

print("Feature variances:")
for col in numerical_features.columns:
    print(f"  {col:>25s}: {numerical_features[col].var():.2f}")

print("\nIn this dataset, all features have sufficient variance. In practice,")
print("you would remove features with near-zero variance as they carry no information.")

---

## 6. Train-Test Splitting

Splitting the data correctly is critical for reliable model evaluation.

In [None]:
from sklearn.model_selection import train_test_split

# Prepare features and target
X = df_encoded.drop(columns=['price'])
y = df_encoded['price']

# Standard 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Total samples:    {len(X)}")
print(f"Training samples: {len(X_train)} ({len(X_train)/len(X):.0%})")
print(f"Test samples:     {len(X_test)} ({len(X_test)/len(X):.0%})")

print(f"\nFeature columns ({X.shape[1]}): {list(X.columns)}")
print(f"\nTraining target statistics:")
print(f"  Mean:  {y_train.mean():,.0f}")
print(f"  Std:   {y_train.std():,.0f}")
print(f"\nTest target statistics:")
print(f"  Mean:  {y_test.mean():,.0f}")
print(f"  Std:   {y_test.std():,.0f}")

---

## 7. End-to-End Preprocessing Pipeline

Scikit-learn provides `Pipeline` and `ColumnTransformer` to chain preprocessing steps together. This is best practice because it:
- Prevents data leakage (scaler is fitted only on training data)
- Makes the workflow reproducible
- Simplifies deployment

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Start from the original (dirty) data
X_raw = df.drop(columns=['price'])
y_raw = df['price']

# Define column groups
numeric_features = ['area_sqft', 'bedrooms', 'age_years']
categorical_features = ['location', 'garage', 'condition']

# Numeric pipeline: impute missing values with median, then standardize
numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline: impute missing with mode, then one-hot encode
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False))  # drop='first' to avoid multicollinearity
])

# Combine into a single ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numeric_features),
        ('cat', categorical_pipeline, categorical_features)
    ]
)

# Split data FIRST (before fitting the preprocessor)
X_train_raw, X_test_raw, y_train_raw, y_test_raw = train_test_split(
    X_raw, y_raw, test_size=0.2, random_state=42
)

# Fit on training data, transform both training and test
X_train_processed = preprocessor.fit_transform(X_train_raw)
X_test_processed = preprocessor.transform(X_test_raw)  # only transform, no fit!

# Get feature names after transformation
cat_feature_names = preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical_features)
all_feature_names = list(numeric_features) + list(cat_feature_names)

print("Pipeline Steps:")
print("  1. Numerical: Median Imputation -> Standard Scaling")
print("  2. Categorical: Mode Imputation -> One-Hot Encoding")
print(f"\nOriginal features: {X_raw.shape[1]}")
print(f"Processed features: {X_train_processed.shape[1]}")
print(f"Feature names: {all_feature_names}")
print(f"\nTraining set shape: {X_train_processed.shape}")
print(f"Test set shape:     {X_test_processed.shape}")

# Quick sanity check: processed training data
processed_df = pd.DataFrame(X_train_processed, columns=all_feature_names)
print("\n--- Processed Training Data (first 5 rows) ---")
processed_df.head()

In [None]:
# Quick model to verify the pipeline works end-to-end
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train_processed, y_train_raw)
y_pred = model.predict(X_test_processed)

rmse = np.sqrt(mean_squared_error(y_test_raw, y_pred))
r2 = r2_score(y_test_raw, y_pred)

print("Quick Linear Regression Test (to validate the pipeline):")
print(f"  RMSE: {rmse:,.0f}")
print(f"  R2:   {r2:.4f}")
print("\nThe pipeline successfully preprocesses raw data for model consumption.")

---

## 8. Exercises

### Exercise 1: Handle a Messy Dataset

In [None]:
# Exercise 1: Clean the following messy dataset

np.random.seed(99)
messy_data = pd.DataFrame({
    'age': [25, np.nan, 30, 45, np.nan, 35, 28, np.nan, 50, 22],
    'salary': [50000, 60000, np.nan, 80000, 70000, np.nan, 55000, 65000, np.nan, 45000],
    'department': ['Sales', 'IT', 'IT', np.nan, 'HR', 'Sales', np.nan, 'IT', 'HR', 'Sales'],
    'performance': ['Good', 'Excellent', 'Fair', 'Good', 'Poor', np.nan, 'Good', 'Excellent', 'Fair', 'Good']
})

print("Messy Dataset:")
print(messy_data)
print(f"\nMissing values:\n{messy_data.isnull().sum()}")

# TODO: 
# 1. Impute 'age' and 'salary' with the median
# 2. Impute 'department' with the most frequent value
# 3. Encode 'performance' with ordinal encoding (Poor=0, Fair=1, Good=2, Excellent=3)
# 4. One-hot encode 'department'
# 5. Print the cleaned dataframe

# Your code here:


### Exercise 2: Compare Scaling Effects on KNN

In [None]:
# Exercise 2: Train a KNN classifier on the Iris dataset with and without scaling.
# Compare the test accuracy.

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target

# TODO:
# 1. Split data into train/test (80/20)
# 2. Train KNN (K=5) WITHOUT scaling and record accuracy
# 3. Scale the features using StandardScaler (fit on train, transform both)
# 4. Train KNN (K=5) WITH scaling and record accuracy
# 5. Print both accuracies and comment on the difference

# Your code here:


### Exercise 3: Build a Complete Preprocessing Pipeline

In [None]:
# Exercise 3: Using sklearn Pipeline and ColumnTransformer,
# build a preprocessing pipeline for the Titanic-like dataset below.

titanic_data = pd.DataFrame({
    'pclass': [1, 3, 2, 1, 3, 2, 1, 3, 2, 3],
    'sex': ['male', 'female', 'female', 'male', 'male', 'female', 'female', 'male', 'male', 'female'],
    'age': [22, np.nan, 35, 45, np.nan, 28, 58, 19, np.nan, 30],
    'fare': [7.25, 71.28, 8.05, 52.0, np.nan, 13.0, 26.55, 8.05, 11.5, np.nan],
    'embarked': ['S', 'C', 'S', np.nan, 'S', 'Q', 'S', 'S', 'C', 'Q'],
    'survived': [0, 1, 1, 1, 0, 1, 1, 0, 0, 1]
})

print("Titanic-like Dataset:")
print(titanic_data)

# TODO:
# 1. Separate features (X) and target (y = 'survived')
# 2. Identify numerical and categorical columns
# 3. Build a ColumnTransformer with appropriate pipelines for each
# 4. Fit and transform the data
# 5. Print the processed feature matrix

# Your code here:


---

## 9. Summary and Further Reading

### What We Covered

- **Data Inspection**: Always start by understanding shape, types, distributions, and missing values.
- **Missing Values**: Impute using mean/median (numerical) or mode (categorical). Drop only when appropriate.
- **Encoding**: Use label/ordinal encoding for ordered categories, one-hot encoding for nominal categories.
- **Scaling**: StandardScaler (most common), MinMaxScaler (for bounded ranges), RobustScaler (for outlier-heavy data).
- **Feature Selection**: Use correlation analysis and variance thresholds to identify relevance.
- **Pipelines**: Use `Pipeline` and `ColumnTransformer` to build reproducible, leak-free preprocessing workflows.

### Recommended Reading

- [Scikit-learn Preprocessing Guide](https://scikit-learn.org/stable/modules/preprocessing.html)
- [Scikit-learn Pipelines](https://scikit-learn.org/stable/modules/compose.html)
- Chapter 2 of Aurélien Géron, *Hands-On Machine Learning* (end-to-end ML project with full preprocessing)

### Next Module

In **Module 4: Supervised Learning — Regression**, we will apply these preprocessing techniques to build regression models, covering linear regression, polynomial regression, regularization, and more.

---