# Module 02: Data Preparation and Train/Test Split

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 55 minutes  
**Prerequisites**: 
- [Module 00: Introduction to ML and scikit-learn](00_introduction_to_ml_and_sklearn.ipynb)
- [Module 01: Supervised vs Unsupervised Learning](01_supervised_vs_unsupervised_learning.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand why we split data into training and testing sets
2. Properly split data using train_test_split()
3. Handle missing values and categorical data
4. Scale and normalize features appropriately
5. Avoid data leakage - one of the most common ML mistakes
6. Prepare data following best practices

## 1. Why Split Data?

### The Golden Rule of Machine Learning
**Never test your model on the same data you used to train it!**

### Analogy: Studying for an Exam
- **Training data** = Practice problems you study
- **Testing data** = Actual exam questions (different from practice)
- **Goal** = Perform well on new, unseen questions

If the exam had the exact same questions you practiced, you'd get 100% but wouldn't prove you learned the concepts!

### The Problem: Overfitting
If we evaluate on training data, the model might just **memorize** the answers instead of learning general patterns. This is called **overfitting**.

**Training Data Performance** ≠ **Real-World Performance**

We need to simulate real-world conditions by holding out some data for testing.

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Visualization settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✓ Setup complete!")

## 2. The Train/Test Split

### Common Split Ratios
- **70/30 split**: 70% training, 30% testing (common)
- **80/20 split**: 80% training, 20% testing (also common)
- **60/20/20 split**: 60% train, 20% validation, 20% test (for hyperparameter tuning)

### Choosing the Right Ratio
- **More training data** → Better model learning
- **More testing data** → More reliable evaluation
- **Large datasets (>10,000 samples)** → Can use 90/10 or 95/5
- **Small datasets (<1,000 samples)** → Use 70/30 or cross-validation

In [None]:
# Load a dataset
from sklearn.model_selection import train_test_split

iris_df = pd.read_csv('data/sample/iris.csv')

# Separate features and target
feature_cols = ['sepal length (cm)', 'sepal width (cm)', 
                'petal length (cm)', 'petal width (cm)']
X = iris_df[feature_cols]
y = iris_df['species']

print(f"Total dataset size: {len(X)} samples")
print(f"Number of features: {X.shape[1]}")
print(f"Number of classes: {y.nunique()}")

In [None]:
# Perform a basic train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3,  # 30% for testing
    random_state=42  # For reproducibility
)

print("Data Split Results:")
print(f"Training set: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Testing set: {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")

print(f"\nTraining features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"Training target shape: {y_train.shape}")
print(f"Testing target shape: {y_test.shape}")

## 3. Stratified Split for Classification

When dealing with classification, especially with **imbalanced classes**, we want to ensure each split has the same proportion of each class.

**Problem**: Random split might give different class distributions in train/test

**Solution**: Use `stratify` parameter to maintain class proportions

In [None]:
# Check class distribution in original data
print("Original class distribution:")
print(y.value_counts(normalize=True).sort_index())

# Regular split (without stratification)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print("\nRegular split - Training set distribution:")
print(y_train_reg.value_counts(normalize=True).sort_index())
print("\nRegular split - Testing set distribution:")
print(y_test_reg.value_counts(normalize=True).sort_index())

In [None]:
# Stratified split (maintaining class proportions)
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
    X, y, 
    test_size=0.3, 
    random_state=42,
    stratify=y  # This ensures proportional class distribution
)

print("Stratified split - Training set distribution:")
print(y_train_strat.value_counts(normalize=True).sort_index())
print("\nStratified split - Testing set distribution:")
print(y_test_strat.value_counts(normalize=True).sort_index())

print("\n✓ Notice: Stratified split maintains the same proportions in both sets!")

## 4. Handling Missing Values

Real-world data often has missing values. We need to handle them before training.

### Common Strategies
1. **Drop rows** with missing values (if few)
2. **Impute** with mean, median, or mode
3. **Forward/backward fill** for time series
4. **Use algorithms** that handle missing values (e.g., XGBoost)

### WARNING: The Order Matters!
**ALWAYS split data BEFORE imputing to avoid data leakage!**

In [None]:
# Create a dataset with missing values
# Load diabetes dataset and artificially introduce missing values
diabetes_df = pd.read_csv('data/sample/diabetes.csv')

# Randomly set 10% of values to NaN
diabetes_missing = diabetes_df.copy()
n_missing = int(0.1 * diabetes_missing.shape[0] * diabetes_missing.shape[1])
np.random.seed(42)
for _ in range(n_missing):
    row = np.random.randint(0, len(diabetes_missing))
    col = np.random.choice(diabetes_missing.columns[:-1])  # Don't make target missing
    diabetes_missing.loc[row, col] = np.nan

print(f"Missing values per column:")
print(diabetes_missing.isnull().sum())
print(f"\nTotal missing values: {diabetes_missing.isnull().sum().sum()}")

In [None]:
# CORRECT WAY: Split THEN impute
from sklearn.impute import SimpleImputer

# Separate features and target
X_missing = diabetes_missing.drop('progression', axis=1)
y_missing = diabetes_missing['progression']

# Step 1: Split the data FIRST
X_train_miss, X_test_miss, y_train_miss, y_test_miss = train_test_split(
    X_missing, y_missing, test_size=0.3, random_state=42
)

print("Step 1: Split complete")
print(f"Training set missing values: {X_train_miss.isnull().sum().sum()}")
print(f"Testing set missing values: {X_test_miss.isnull().sum().sum()}")

# Step 2: Fit imputer on training data ONLY
imputer = SimpleImputer(strategy='mean')
imputer.fit(X_train_miss)

print("\nStep 2: Imputer fitted on training data")
print(f"Learned means: {imputer.statistics_[:3]}...")  # Show first 3

# Step 3: Transform both sets using the same imputer
X_train_imputed = pd.DataFrame(
    imputer.transform(X_train_miss),
    columns=X_train_miss.columns
)
X_test_imputed = pd.DataFrame(
    imputer.transform(X_test_miss),
    columns=X_test_miss.columns
)

print("\nStep 3: Imputation complete")
print(f"Training set missing values: {X_train_imputed.isnull().sum().sum()}")
print(f"Testing set missing values: {X_test_imputed.isnull().sum().sum()}")
print("\n✓ All missing values handled correctly!")

## 5. Feature Scaling

Many ML algorithms perform better when features are on similar scales.

### Why Scale?
- Features with large ranges can dominate the model
- Distance-based algorithms (KNN, SVM) are sensitive to scale
- Gradient descent converges faster with scaled features

### Two Common Methods
1. **Standardization (Z-score)**: Mean=0, Std=1
   - Formula: (x - mean) / std
   - Use when: Data is normally distributed
   
2. **Normalization (Min-Max)**: Scale to [0, 1]
   - Formula: (x - min) / (max - min)
   - Use when: Data has bounds or you need [0, 1] range

### CRITICAL: Fit on Training, Transform on Both
Calculate scaling parameters from **training data only** to avoid data leakage!

In [None]:
# Demonstrate why scaling matters
# Load housing data with features of different scales
housing_df = pd.read_csv('data/sample/california_housing.csv')

print("Feature Statistics (Different Scales):")
print(housing_df.describe()[['MedInc', 'HouseAge', 'Population']].T)

print("\nNotice: Features have vastly different ranges!")
print("- MedInc: 0.5 to 15")
print("- HouseAge: 1 to 52")
print("- Population: 100 to 5000")

In [None]:
# CORRECT WAY: Scale after splitting
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Prepare data
X_housing = housing_df.drop('median_house_value', axis=1)
y_housing = housing_df['median_house_value']

# Step 1: Split first
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_housing, y_housing, test_size=0.3, random_state=42
)

# Step 2: Fit scaler on training data
scaler = StandardScaler()
scaler.fit(X_train_h)

# Step 3: Transform both sets
X_train_scaled = pd.DataFrame(
    scaler.transform(X_train_h),
    columns=X_train_h.columns
)
X_test_scaled = pd.DataFrame(
    scaler.transform(X_test_h),
    columns=X_test_h.columns
)

print("After Standardization:")
print(X_train_scaled[['MedInc', 'HouseAge', 'Population']].describe().T)
print("\n✓ All features now have mean ≈ 0 and std ≈ 1!")

In [None]:
# Visualize the effect of scaling
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before scaling
X_train_h[['MedInc', 'HouseAge', 'Population']].boxplot(ax=axes[0])
axes[0].set_title('Before Scaling\n(Different ranges)', fontsize=13, fontweight='bold')
axes[0].set_ylabel('Value', fontsize=11)
axes[0].grid(True, alpha=0.3)

# After scaling
X_train_scaled[['MedInc', 'HouseAge', 'Population']].boxplot(ax=axes[1])
axes[1].set_title('After Standardization\n(Same scale)', fontsize=13, fontweight='bold')
axes[1].set_ylabel('Standardized Value', fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key Insight: Scaling puts all features on the same scale!")

## 6. Data Leakage - The Silent Killer

**Data leakage** occurs when information from outside the training set influences the model.

### Common Causes
1. **Scaling before splitting** - Test data influences the scaler
2. **Imputing before splitting** - Test data influences the imputation
3. **Feature engineering using all data** - Creates unrealistic features
4. **Using future information** - Including data not available at prediction time

### The Golden Rule
**ANY transformation that "learns" from data must be fit ONLY on training data!**

This includes:
- Scalers (StandardScaler, MinMaxScaler)
- Imputers (SimpleImputer)
- Encoders (LabelEncoder, OneHotEncoder)
- Feature selectors
- Dimensionality reducers (PCA)

In [None]:
# WRONG WAY - Causes data leakage!
print("❌ WRONG: Scaling before splitting\n")

# DON'T DO THIS!
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X_housing)  # Uses ALL data

# Then split
X_train_wrong, X_test_wrong, y_train_wrong, y_test_wrong = train_test_split(
    X_scaled_wrong, y_housing, test_size=0.3, random_state=42
)

print("Problem: The scaler saw the test data!")
print("This gives unrealistically good results and won't work in production.")
print("")

# CORRECT WAY
print("\n✓ CORRECT: Split first, then scale\n")

# Split first
X_train_right, X_test_right, y_train_right, y_test_right = train_test_split(
    X_housing, y_housing, test_size=0.3, random_state=42
)

# Fit scaler on training only
scaler_right = StandardScaler()
X_train_scaled_right = scaler_right.fit_transform(X_train_right)
X_test_scaled_right = scaler_right.transform(X_test_right)  # Only transform test

print("Correct: The scaler was fit on training data only!")
print("Test data is transformed using training statistics.")
print("This simulates real-world deployment.")

## 7. Complete Data Preparation Pipeline

Let's put it all together in the correct order:

1. **Load data**
2. **Explore and understand** (EDA)
3. **Separate features and target**
4. **Split into train/test sets**
5. **Handle missing values** (fit on train, transform both)
6. **Scale features** (fit on train, transform both)
7. **Train model**
8. **Evaluate on test set**

In [None]:
# Complete pipeline example
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 1. Load data
data = pd.read_csv('data/sample/california_housing.csv')

# 2. Separate features and target
X = data.drop('median_house_value', axis=1)
y = data['median_house_value']

# 3. Split data (70/30)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
print("Step 1: Data split")
print(f"  Training: {len(X_train)} samples")
print(f"  Testing: {len(X_test)} samples")

# 4. Handle missing values (if any)
imputer = SimpleImputer(strategy='mean')
X_train_clean = imputer.fit_transform(X_train)
X_test_clean = imputer.transform(X_test)
print("\nStep 2: Missing values handled")

# 5. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_clean)
X_test_scaled = scaler.transform(X_test_clean)
print("Step 3: Features scaled")

# 6. Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
print("Step 4: Model trained")

# 7. Evaluate
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
y_pred = model.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("\nStep 5: Evaluation")
print(f"  Training R²: {train_score:.3f}")
print(f"  Testing R²: {test_score:.3f}")
print(f"  RMSE: ${rmse:,.2f}")
print("\n✓ Complete pipeline executed correctly!")

## Exercises

### Exercise 1: Identify Data Leakage

Which of these code snippets cause data leakage? Mark them as CORRECT or WRONG:

```python
# Snippet A
X_scaled = scaler.fit_transform(X)
X_train, X_test = train_test_split(X_scaled)

# Snippet B
X_train, X_test = train_test_split(X)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Snippet C
X_train, X_test = train_test_split(X)
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.fit_transform(X_test)
```

In [None]:
# Your answers:
# Snippet A: 
# Snippet B: 
# Snippet C: 


### Exercise 2: Proper Train/Test Split

Load the breast cancer dataset and perform a stratified 80/20 train/test split.

Steps:
1. Load data from `data/sample/breast_cancer.csv`
2. Separate features (all columns except 'target' and 'diagnosis') and target ('target')
3. Perform stratified split (80% train, 20% test)
4. Verify that class proportions are maintained

In [None]:
# Your code here



### Exercise 3: Complete Preprocessing Pipeline

Create a complete preprocessing pipeline for the diabetes dataset:

1. Load the diabetes dataset
2. Create 15% artificial missing values (use the code from Section 4)
3. Split data (70/30)
4. Impute missing values (mean strategy)
5. Scale features using StandardScaler
6. Verify no missing values remain and features are scaled

Print the shape and statistics of your final training set.

In [None]:
# Your code here



### Exercise 4: Impact of Proper Splitting

Demonstrate the importance of proper train/test splitting:

1. Use the Iris dataset
2. Build two models:
   - Model A: Evaluate on training data (no split)
   - Model B: Evaluate on test data (proper split)
3. Compare the accuracies
4. Explain which one is more realistic

In [None]:
# Your code here



## Summary

Congratulations! You now understand proper data preparation - a critical skill in machine learning.

### Key Concepts

1. **Train/Test Split**:
   - Never test on training data - causes overfitting
   - Common splits: 70/30, 80/20, 60/20/20
   - Use stratified split for classification
   - Always set random_state for reproducibility

2. **Data Preparation Order**:
   ```
   Load → Explore → Separate X/y → SPLIT → Preprocess → Train → Evaluate
   ```

3. **Handling Missing Values**:
   - Common strategies: drop, impute (mean/median/mode)
   - Fit imputer on training data only
   - Transform both train and test with same imputer

4. **Feature Scaling**:
   - Standardization: mean=0, std=1 (preferred for most cases)
   - Normalization: scale to [0, 1]
   - Required for distance-based algorithms (KNN, SVM)
   - Fit scaler on training data only

5. **Avoiding Data Leakage**:
   - Golden Rule: Split FIRST, preprocess SECOND
   - Fit transformers on training data only
   - Transform both sets with the same fitted transformer
   - Never let test data influence the model

### What's Next?

In **Module 03: Linear Regression**, you'll learn:
- The theory behind linear regression
- Simple and multiple linear regression
- Interpreting coefficients and predictions
- Evaluating regression models
- Making predictions on new data

### Additional Resources

- [Train/Test Split - scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html)
- [Data Leakage in Machine Learning](https://machinelearningmastery.com/data-leakage-machine-learning/)
- [Feature Scaling - Why and How](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/)