This is the fundamental practice for estimating how well your model will perform on unseen data. We reserve portions of our data exclusively for testing or validation, ensuring our evaluation isn't biased by the data the model was trained on.

## Data Splitting Strategies for Model Evaluation

This document covers:

* **Why Split:** The fundamental need to evaluate models on unseen data to estimate generalization performance.
* **Train/Test Split:** Demonstrates the basic two-way split using `train_test_split` and explains key parameters like `test_size`, `random_state`, `shuffle`, and `stratify`.
* **Train/Validation/Test Split:** Explains the rationale for a three-way split (separating data for tuning/model selection from the final test data) and shows how to implement it using two calls to `train_test_split`.
* **Stratification:** Highlights the importance of using the `stratify` parameter in classification tasks to maintain representative class distributions in all splits.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris # Example dataset for classification

# --- 1. The Need for Splitting ---
# Evaluating a model on the same data it was trained on gives an overly
# optimistic performance estimate (training accuracy/error).
# We need to test the model on data it has *never* seen during training
# to estimate its ability to generalize to new, real-world data.

# --- 2. Strategy 1: Train/Test Split ---
# The most basic split. Divide the data into two sets:
# - Training Set: Used to train the model (fit the estimator).
# - Test Set: Used ONLY at the very end to evaluate the final, trained model's performance.

print("--- Strategy 1: Train/Test Split ---")

# Load Iris dataset (classification)
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

print(f"Original data shape: X={X.shape}, y={y.shape}")

# Perform the train/test split using train_test_split
# Common split ratios: 80/20, 70/30, 75/25
# Key parameters:
# - test_size: Proportion (float between 0.0 and 1.0) or absolute number for the test set.
# - train_size: Alternative to test_size.
# - random_state: Seed for the random number generator used for shuffling. Ensures reproducibility.
#                 Use the same integer value to get the same split every time.
# - shuffle: Whether to shuffle the data before splitting (default=True). Recommended.
# - stratify: Ensures class proportions are maintained in both splits. Crucial for classification.
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,       # 30% of data for the test set
    random_state=42,     # For reproducible results
    shuffle=True,        # Shuffle the data before splitting
    stratify=y           # Maintain class proportions based on 'y'
)

print(f"\nTraining set shape: X={X_train.shape}, y={y_train.shape}")
print(f"Test set shape: X={X_test.shape}, y={y_test.shape}")

# Verify stratification (optional check)
print("\nOriginal class distribution (%):\n", pd.Series(y).value_counts(normalize=True).sort_index() * 100)
print("\nTraining set class distribution (%):\n", pd.Series(y_train).value_counts(normalize=True).sort_index() * 100)
print("\nTest set class distribution (%):\n", pd.Series(y_test).value_counts(normalize=True).sort_index() * 100)
print("Note: Proportions should be very similar due to stratify=y.")
print("-" * 30)


# --- 3. Strategy 2: Train/Validation/Test Split ---
# Often needed during model development to tune hyperparameters or compare models
# without "contaminating" the final test set.
# Workflow:
# 1. Split data into Train+Validation and Test sets.
# 2. Split Train+Validation into Train and Validation sets.
# 3. Train models on the Train set.
# 4. Evaluate/tune models using the Validation set.
# 5. Select the best model/hyperparameters based on validation performance.
# 6. Train the final chosen model on the *entire* Train+Validation set.
# 7. Perform a final, single evaluation on the held-out Test set.

print("--- Strategy 2: Train/Validation/Test Split ---")

# Use the same original data (X, y)
# Step 1: Split into initial training (e.g., 80%) and final test (e.g., 20%)
X_train_val, X_final_test, y_train_val, y_final_test = train_test_split(
    X, y,
    test_size=0.20,      # 20% for the final test set
    random_state=42,
    stratify=y
)

# Step 2: Split the initial training set into actual training and validation sets
# Example: Use 25% of the train_val set for validation (0.25 * 80% = 20% of total)
X_train_final, X_val, y_train_final, y_val = train_test_split(
    X_train_val, y_train_val, # Split the 80% portion
    test_size=0.25,      # 25% of the train_val set -> 20% of original data
    random_state=42,
    stratify=y_train_val # Stratify based on the y of this subset
)

print(f"\nOriginal data size: {len(X)}")
print(f"Final Training set size: {len(X_train_final)} ({len(X_train_final)/len(X)*100:.1f}%)")
print(f"Validation set size: {len(X_val)} ({len(X_val)/len(X)*100:.1f}%)")
print(f"Final Test set size: {len(X_final_test)} ({len(X_final_test)/len(X)*100:.1f}%)")

# Now you would:
# - Train models on X_train_final, y_train_final
# - Tune/compare using X_val, y_val
# - Retrain best model on X_train_val, y_train_val
# - Get final performance estimate using X_final_test, y_final_test
print("\nThis creates three distinct sets for robust model development and evaluation.")
print("-" * 30)


# --- 4. Stratification Importance ---
# Stratification ensures that the class distribution in the original dataset
# is preserved in the split datasets. This is vital for:
# - Classification tasks in general.
# - Especially important for imbalanced datasets where one class is much rarer.
#   Without stratification, a split might accidentally put very few (or even zero)
#   samples of the minority class into the test or validation set, making
#   evaluation unreliable or impossible.
# - How it works: `stratify=y` uses the labels in `y` to guide the split.

print("--- Stratification Importance ---")
print("Stratification (using stratify=y) maintains class proportions across splits.")
print("This is crucial for reliable evaluation in classification tasks,")
print("especially when dealing with imbalanced datasets.")
print("-" * 30)

--- Strategy 1: Train/Test Split ---
Original data shape: X=(150, 4), y=(150,)

Training set shape: X=(105, 4), y=(105,)
Test set shape: X=(45, 4), y=(45,)

Original class distribution (%):
 0    33.333333
1    33.333333
2    33.333333
Name: proportion, dtype: float64

Training set class distribution (%):
 0    33.333333
1    33.333333
2    33.333333
Name: proportion, dtype: float64

Test set class distribution (%):
 0    33.333333
1    33.333333
2    33.333333
Name: proportion, dtype: float64
Note: Proportions should be very similar due to stratify=y.
------------------------------
--- Strategy 2: Train/Validation/Test Split ---

Original data size: 150
Final Training set size: 90 (60.0%)
Validation set size: 30 (20.0%)
Final Test set size: 30 (20.0%)

This creates three distinct sets for robust model development and evaluation.
------------------------------
--- Stratification Importance ---
Stratification (using stratify=y) maintains class proportions across splits.
This is crucial 