# Module 02: Data Preparation and Train/Test Split

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 90 minutes  
**Prerequisites**: 
- [Module 00: Introduction to Machine Learning](00_introduction_to_machine_learning.ipynb)
- [Module 01: Supervised vs Unsupervised Learning](01_supervised_vs_unsupervised_learning.ipynb)
- Pandas and NumPy proficiency

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand why data preparation is critical for ML success
2. Handle missing data using various imputation strategies
3. Encode categorical variables (one-hot, label encoding)
4. Scale and normalize features properly
5. Split data correctly to avoid data leakage
6. Create training, validation, and test sets
7. Recognize and prevent common data preparation mistakes

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Scikit-learn preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    LabelEncoder, OneHotEncoder
)
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Example dataset
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("All libraries imported successfully!")

## 2. Why Data Preparation Matters

### The 80/20 Rule of Machine Learning

**80% of ML work is data preparation**, only 20% is modeling!

Real-world data is messy:
- **Missing values**: Incomplete records
- **Different scales**: Age (0-100) vs Income (0-1,000,000)
- **Categorical data**: Text labels that need encoding
- **Outliers**: Extreme values that can skew models
- **Inconsistent formats**: Dates, currencies, text

### Consequences of Poor Data Preparation

❌ **Models fail to converge** (never finish training)  
❌ **Poor performance** due to biased or noisy data  
❌ **Data leakage** - accidentally using test information in training  
❌ **Overfitting** on training quirks  

### Our Goal

Transform raw data into **clean, ML-ready format** where:
- No missing values (or handled appropriately)
- All features are numerical
- Features are on similar scales
- Training and test data are properly separated

## 3. Creating a Messy Dataset for Practice

Let's create a realistic messy dataset to practice data preparation.

In [None]:
# Create a synthetic messy dataset
np.random.seed(42)
n_samples = 200

# Create data with various issues
data = {
    'age': np.random.randint(18, 70, n_samples),
    'income': np.random.normal(50000, 20000, n_samples),
    'credit_score': np.random.randint(300, 850, n_samples),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples),
    'employment': np.random.choice(['Employed', 'Self-Employed', 'Unemployed'], n_samples),
    'loan_approved': np.random.choice([0, 1], n_samples, p=[0.3, 0.7])
}

df = pd.DataFrame(data)

# Introduce missing values (realistic scenario)
missing_indices_income = np.random.choice(df.index, size=20, replace=False)
missing_indices_education = np.random.choice(df.index, size=15, replace=False)
df.loc[missing_indices_income, 'income'] = np.nan
df.loc[missing_indices_education, 'education'] = np.nan

# Add some outliers
df.loc[np.random.choice(df.index, 3), 'income'] = np.random.uniform(200000, 300000, 3)

print("Created messy dataset:")
print(df.head(10))
print(f"\nDataset shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")

In [None]:
# Check for data quality issues
print("Data Quality Report:")
print("=" * 60)
print(f"\nMissing Values:")
print(df.isnull().sum())
print(f"\nMissing Value Percentages:")
print((df.isnull().sum() / len(df) * 100).round(2))
print(f"\nData Type Summary:")
print(df.dtypes.value_counts())
print(f"\nBasic Statistics:")
print(df.describe())

## 4. Handling Missing Values

### Strategies for Missing Data

1. **Remove rows/columns**: If too much data is missing (>50%)
2. **Imputation**: Fill missing values with:
   - Mean/Median/Mode (for numerical)
   - Most frequent category (for categorical)
   - Forward/backward fill (for time series)
   - Advanced: KNN or model-based imputation

### Rule of Thumb

- **< 5% missing**: Safe to impute
- **5-25% missing**: Impute carefully, consider impact
- **> 25% missing**: Consider dropping feature or collecting more data

In [None]:
# Visualize missing data patterns
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Data Pattern (Yellow = Missing)')
plt.xlabel('Features')
plt.ylabel('Samples')
plt.tight_layout()
plt.show()

print(f"Total missing values: {df.isnull().sum().sum()}")
print(f"Complete rows: {df.dropna().shape[0]} out of {len(df)}")

In [None]:
# Handle missing values
df_clean = df.copy()

# For numerical: Impute with median (robust to outliers)
income_imputer = SimpleImputer(strategy='median')
df_clean['income'] = income_imputer.fit_transform(df_clean[['income']])

# For categorical: Impute with most frequent
education_imputer = SimpleImputer(strategy='most_frequent')
df_clean['education'] = education_imputer.fit_transform(
    df_clean[['education']]
).ravel()

print("After imputation:")
print(df_clean.isnull().sum())
print("\nNo more missing values!")

## 5. Encoding Categorical Variables

ML algorithms work with numbers, not text. We need to convert categorical data.

### Two Main Approaches

#### 1. Label Encoding (Ordinal)
- Converts categories to integers: 0, 1, 2, ...
- **Use when**: Categories have natural order (Low < Medium < High)
- **Problem**: Implies order even when there isn't one

#### 2. One-Hot Encoding (Nominal)
- Creates binary column for each category
- **Use when**: No natural order (Red, Blue, Green)
- **Problem**: Can create many features (curse of dimensionality)

In [None]:
# Example: Label Encoding (when order matters)
# Let's say education has natural order
education_order = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df_clean['education_encoded'] = df_clean['education'].map(education_order)

print("Label Encoding Example:")
print(df_clean[['education', 'education_encoded']].head(10))

In [None]:
# Example: One-Hot Encoding (when no natural order)
# Employment status has no order
employment_dummies = pd.get_dummies(
    df_clean['employment'],
    prefix='employment',
    drop_first=True  # Avoid dummy variable trap
)

print("One-Hot Encoding Example:")
print("Original employment column:")
print(df_clean['employment'].head())
print("\nOne-hot encoded:")
print(employment_dummies.head())
print("\nNote: We dropped first category to avoid multicollinearity")

In [None]:
# Combine all features
df_encoded = df_clean.copy()
df_encoded = pd.concat([df_encoded, employment_dummies], axis=1)

# Drop original categorical columns
df_encoded = df_encoded.drop(['employment', 'education'], axis=1)

print("Fully encoded dataset:")
print(df_encoded.head())
print(f"\nShape: {df_encoded.shape}")
print(f"All numerical: {df_encoded.select_dtypes(include=[np.number]).shape[1] == df_encoded.shape[1]}")

## 6. Feature Scaling

### Why Scale Features?

Many ML algorithms are sensitive to feature scales:
- Distance-based: KNN, SVM, K-Means
- Gradient-based: Linear Regression, Neural Networks

Example problem:
- Age: 18-70 (range ~50)
- Income: 20,000-200,000 (range ~180,000)

Income will dominate distance calculations!

### Scaling Methods

#### 1. StandardScaler (Z-score normalization)
- Formula: (x - mean) / std
- Result: Mean=0, Std=1
- **Use when**: Features are normally distributed

#### 2. MinMaxScaler
- Formula: (x - min) / (max - min)
- Result: Range [0, 1]
- **Use when**: Bounded range needed

#### 3. RobustScaler
- Uses median and IQR (interquartile range)
- **Use when**: Data has outliers

In [None]:
# Compare different scaling methods
feature_to_scale = df_encoded[['age', 'income', 'credit_score']].copy()

# Original data
print("Original Data Statistics:")
print(feature_to_scale.describe())

In [None]:
# StandardScaler
standard_scaler = StandardScaler()
scaled_standard = pd.DataFrame(
    standard_scaler.fit_transform(feature_to_scale),
    columns=feature_to_scale.columns
)

# MinMaxScaler
minmax_scaler = MinMaxScaler()
scaled_minmax = pd.DataFrame(
    minmax_scaler.fit_transform(feature_to_scale),
    columns=feature_to_scale.columns
)

# RobustScaler
robust_scaler = RobustScaler()
scaled_robust = pd.DataFrame(
    robust_scaler.fit_transform(feature_to_scale),
    columns=feature_to_scale.columns
)

print("\nStandardScaler (mean≈0, std≈1):")
print(scaled_standard.describe().round(2))
print("\nMinMaxScaler (range [0,1]):")
print(scaled_minmax.describe().round(2))
print("\nRobustScaler (uses median & IQR):")
print(scaled_robust.describe().round(2))

In [None]:
# Visualize the effect of scaling
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Original
feature_to_scale.boxplot(ax=axes[0, 0])
axes[0, 0].set_title('Original Data')
axes[0, 0].set_ylabel('Value')

# StandardScaler
scaled_standard.boxplot(ax=axes[0, 1])
axes[0, 1].set_title('StandardScaler')
axes[0, 1].set_ylabel('Scaled Value')

# MinMaxScaler
scaled_minmax.boxplot(ax=axes[1, 0])
axes[1, 0].set_title('MinMaxScaler')
axes[1, 0].set_ylabel('Scaled Value')

# RobustScaler
scaled_robust.boxplot(ax=axes[1, 1])
axes[1, 1].set_title('RobustScaler')
axes[1, 1].set_ylabel('Scaled Value')

plt.tight_layout()
plt.show()

print("Notice how all features are on similar scales after transformation!")

## 7. Train/Test Split: The Right Way

### The Golden Rule

**ALWAYS split BEFORE any preprocessing!**

### Why?

To avoid **data leakage** - when information from test set influences training.

### Wrong Way (Data Leakage)

```python
# WRONG! Don't do this
scaler.fit(X)  # Fit on ALL data
X_scaled = scaler.transform(X)
X_train, X_test = train_test_split(X_scaled)  # Then split
```

Problem: Test data statistics influenced the scaling!

### Right Way

```python
# RIGHT! Do this
X_train, X_test = train_test_split(X)  # Split first
scaler.fit(X_train)  # Fit only on training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Apply same transformation
```

In [None]:
# Prepare features and target
X = df_encoded.drop('loan_approved', axis=1)
y = df_encoded['loan_approved']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns: {list(X.columns)}")
print(f"\nTarget distribution:")
print(y.value_counts())

In [None]:
# CORRECT: Split first, then preprocess
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,  # 20% for testing
    random_state=42,  # Reproducibility
    stratify=y  # Maintain class proportions
)

print("Data Split:")
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X):.1%})")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X):.1%})")

print(f"\nClass distribution in training set:")
print(y_train.value_counts(normalize=True))
print(f"\nClass distribution in test set:")
print(y_test.value_counts(normalize=True))
print("\nNote: Proportions are similar due to stratify=y")

In [None]:
# Now scale features (AFTER splitting)
scaler = StandardScaler()

# Fit scaler ONLY on training data
scaler.fit(X_train)

# Transform both sets using the same scaler
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Scaling completed correctly!")
print(f"\nTraining set mean: {X_train_scaled.mean():.6f}")
print(f"Training set std: {X_train_scaled.std():.6f}")
print(f"\nTest set mean: {X_test_scaled.mean():.6f}")
print(f"Test set std: {X_test_scaled.std():.6f}")
print("\nNote: Test stats differ slightly - this is expected and correct!")

## 8. Train/Validation/Test Split

For hyperparameter tuning, we need three sets:

- **Training Set (60-70%)**: Train the model
- **Validation Set (10-20%)**: Tune hyperparameters
- **Test Set (10-20%)**: Final evaluation (never touch until the end!)

### Why Three Sets?

- Train on training set
- Evaluate different hyperparameters on validation set
- Once you pick best model, test ONCE on test set
- Prevents overfitting to validation set

In [None]:
# Create three-way split
# First split: Separate test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: Divide remaining into train (75%) and validation (25%)
# This gives us 60% train, 20% val, 20% test overall
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print("Three-way split:")
print(f"Training: {len(X_train)} samples ({len(X_train)/len(X):.1%})")
print(f"Validation: {len(X_val)} samples ({len(X_val)/len(X):.1%})")
print(f"Test: {len(X_test)} samples ({len(X_test)/len(X):.1%})")
print(f"Total: {len(X_train) + len(X_val) + len(X_test)} samples")

# Visualize the split
sizes = [len(X_train), len(X_val), len(X_test)]
labels = ['Training\n60%', 'Validation\n20%', 'Test\n20%']
colors = ['#66c2a5', '#fc8d62', '#8da0cb']

plt.figure(figsize=(8, 6))
plt.pie(sizes, labels=labels, colors=colors, autopct='%d', startangle=90)
plt.title('Train/Validation/Test Split', fontsize=14)
plt.axis('equal')
plt.show()

## 9. Demonstrating Data Leakage Impact

Let's see why data leakage matters by comparing correct vs incorrect approaches.

In [None]:
# WRONG: Scale before split (data leakage)
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X)  # Fit on ALL data
X_train_wrong, X_test_wrong, y_train_wrong, y_test_wrong = train_test_split(
    X_scaled_wrong, y, test_size=0.2, random_state=42
)

# Train model with leaked data
model_wrong = LogisticRegression(max_iter=1000)
model_wrong.fit(X_train_wrong, y_train_wrong)
acc_wrong = accuracy_score(y_test_wrong, model_wrong.predict(X_test_wrong))

# RIGHT: Split first, then scale
X_train_right, X_test_right, y_train_right, y_test_right = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler_right = StandardScaler()
X_train_right_scaled = scaler_right.fit_transform(X_train_right)
X_test_right_scaled = scaler_right.transform(X_test_right)

# Train model correctly
model_right = LogisticRegression(max_iter=1000)
model_right.fit(X_train_right_scaled, y_train_right)
acc_right = accuracy_score(y_test_right, model_right.predict(X_test_right_scaled))

print("Comparison: Data Leakage Impact")
print("=" * 60)
print(f"WRONG (scale before split): Accuracy = {acc_wrong:.2%}")
print(f"RIGHT (split before scale): Accuracy = {acc_right:.2%}")
print(f"\nDifference: {abs(acc_wrong - acc_right):.2%}")
print("\nNote: In this example, difference is small, but with smaller")
print("datasets or more complex preprocessing, it can be significant!")

## 10. Practice Exercises

### Exercise 1: Handle Missing Data

Create a dataset with 30% missing values in one column. Compare imputation strategies:
- Mean imputation
- Median imputation
- Mode imputation

Which works best for your data?

In [None]:
# Your code here


### Exercise 2: One-Hot Encoding

Create a DataFrame with a categorical column that has 5 unique values. Apply one-hot encoding with and without `drop_first=True`. How many columns do you get in each case? Why?

In [None]:
# Your code here


### Exercise 3: Scaling Comparison

Load the Boston housing dataset (or California housing). Apply StandardScaler, MinMaxScaler, and RobustScaler. Train a simple model with each. Which scaler gives the best performance?

In [None]:
# Your code here


### Exercise 4: Stratified Splitting

Create an imbalanced dataset (90% class 0, 10% class 1). Split it with and without `stratify`. Compare the class distributions in test sets. What do you notice?

In [None]:
# Your code here


## 11. Summary

### Key Concepts

1. **Data Preparation is Critical**:
   - 80% of ML work is data preparation
   - Clean data = better models

2. **Handling Missing Values**:
   - Impute with mean/median for numerical
   - Impute with mode for categorical
   - Consider dropping if >25% missing

3. **Encoding Categorical Variables**:
   - Label encoding for ordinal data
   - One-hot encoding for nominal data
   - Always drop_first=True to avoid multicollinearity

4. **Feature Scaling**:
   - StandardScaler: For normally distributed data
   - MinMaxScaler: For bounded ranges
   - RobustScaler: For data with outliers

5. **Train/Test Split**:
   - ALWAYS split BEFORE preprocessing
   - Fit transformers on training data only
   - Use stratify for imbalanced datasets
   - Consider 3-way split for hyperparameter tuning

### The Correct Workflow

```python
# 1. Load data
X, y = load_data()

# 2. Split FIRST (prevent data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y)

# 3. Fit preprocessing on training data
imputer.fit(X_train)
scaler.fit(X_train)

# 4. Transform both sets
X_train_processed = scaler.transform(imputer.transform(X_train))
X_test_processed = scaler.transform(imputer.transform(X_test))

# 5. Train model
model.fit(X_train_processed, y_train)

# 6. Evaluate on test set
score = model.score(X_test_processed, y_test)
```

### Common Pitfalls to Avoid

❌ Preprocessing before splitting (data leakage)  
❌ Using mean when data has outliers  
❌ Forgetting to scale when using distance-based algorithms  
❌ Not using stratify with imbalanced data  
❌ Touching test set before final evaluation  

### Next Steps

In the next module, we'll dive into:
- **Linear Regression** in depth
- Mathematical foundations
- Interpretation of coefficients
- Evaluating regression models

### Additional Resources

- [Scikit-learn Preprocessing Guide](https://scikit-learn.org/stable/modules/preprocessing.html)
- [Pandas Missing Data Handling](https://pandas.pydata.org/docs/user_guide/missing_data.html)
- [Feature Engineering Book by Alice Zheng](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/)