# Train-Test Splitting

In this notebook, we will cover:
- Why we split data into training and testing sets
- Using `train_test_split` in Scikit-learn
- Stratified splits (for classification)
- Train/validation/test sets

## 1. Why Train-Test Split?
- **Training set** → Used to fit the model
- **Testing set** → Used to evaluate model performance on unseen data
- Prevents **overfitting** and ensures the model generalizes well.
- Usually, split is 70–80% training and 20–30% testing.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

X.head()

## 2. Basic Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

## 3. Stratified Splits
- In classification, we want **class proportions** to be maintained in train and test sets.
- Use `stratify=y` in `train_test_split`.

In [None]:
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

print("Class distribution in full dataset:")
print(y.value_counts(normalize=True))
print("\nClass distribution in stratified training set:")
print(y_train_strat.value_counts(normalize=True))
print("\nClass distribution in stratified test set:")
print(y_test_strat.value_counts(normalize=True))

## 4. Train / Validation / Test Split
- Often we need **3 sets**:
  - Training → Fit model
  - Validation → Tune hyperparameters
  - Testing → Final unbiased evaluation

- One approach: Split data twice.

In [None]:
# First split: train+validation vs test
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Second split: train vs validation
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

print("Training set:", X_train.shape)
print("Validation set:", X_val.shape)
print("Testing set:", X_test.shape)

## ✅ Summary
- Train-test split prevents overfitting and gives an unbiased evaluation.
- Use **stratification** for classification tasks to preserve class ratios.
- A third **validation set** helps with model selection and tuning.
- Typical ratios:
  - 70/30 (train/test)
  - 60/20/20 (train/val/test)