# Train, Validation and Test Split

## Introduction

After implementing a Machine Learning model, a fundamental question arises: **how do we know if our model truly works well?** Training a model and testing it on the same data is not sufficient, as this does not guarantee it will perform well on new data it has never seen before.

Splitting data into **training**, **validation**, and **test** sets is an essential practice for properly evaluating the performance of machine learning models.

---

## The Problem: Overfitting vs Underfitting

### Overfitting

**Overfitting** occurs when the model "memorizes" the training data, capturing not only the real patterns but also the noise and specific peculiarities of that data.

**Characteristics:**
- ✅ **Excellent** performance on training data
- ❌ **Poor** performance on new data
- The model is too specific and does not generalize

**Analogy:** It's like a student who memorizes answers from old exams but doesn't understand the concepts. They do well on old exams but fail on new questions.

### Underfitting

**Underfitting** occurs when the model is too simple and cannot capture the patterns present in the data.

**Characteristics:**
- ❌ **Poor** performance on training data
- ❌ **Poor** performance on new data
- The model is too simple

**Analogy:** It's like a student who didn't study enough and can't answer even basic questions.

### The Ideal Balance

The goal is to find the **sweet spot** where the model:
- Learns the real patterns in the data
- Generalizes well to new data
- Does not memorize specific peculiarities

---

## Train/Test Split (Basic Division)

### Concept

The simplest division separates the data into **two sets**:

1. **Training Set**: ~70-80% of the data
   - Used to train the model
   - The model learns patterns here

2. **Test Set**: ~20-30% of the data
   - Used **only** to evaluate the final model
   - Simulates "real-world" data the model has never seen

### Why Do This?

**Without split:**
```
Train on dataset → Test on same dataset → R² = 0.99 ✅
```
Looks great, but it's **misleading**! The model might just be memorizing.

**With split:**
```
Train on training set → Test on test set → R² = 0.95 ✅
```
Now we have a **real** measure of how the model generalizes.

### Typical Proportions

| Dataset Size | Train | Test |
|-------------------|-------|------|
| Small (< 1000) | 70% | 30% |
| Medium (1k-100k) | 80% | 20% |
| Large (> 100k) | 90% | 10% |

**General rule:** The more data you have, the smaller the percentage needed for testing.

---

## Train/Validation/Test Split (Complete Division)

### Concept

For more robust projects, we divide the data into **three sets**:

1. **Training Set**: ~60-70% of the data
   - Used to train the model
   - Adjusts parameters (w, b)

2. **Validation Set**: ~15-20% of the data
   - Used to **tune hyperparameters**
   - Compare different models
   - Detect overfitting during training
   - **Not** used for training

3. **Test Set**: ~15-20% of the data
   - Used **only once** at the end
   - Final and unbiased evaluation
   - Simulates production performance

### Why Three Sets?

**Problem with only Train/Test:**
- If we use the test set to adjust hyperparameters, it "leaks" information (data leakage)
- The test set ceases to be unbiased
- We can overfit on the test set!

**Solution with Train/Validation/Test:**
- **Validation** is used for experimentation and adjustments
- **Test** remains "untouched" until the end
- We have a truly unbiased evaluation

---

## Random vs Stratified Split

### Random Split

Randomly selects examples for each set.

```python
np.random.shuffle(data)
train = data[:60%]
val = data[60%:80%]
test = data[80%:]
```

**When to use:**
- Balanced data
- Regression (continuous values)
- Large datasets

### Stratified Split

Maintains the **same proportion of classes** in all sets.

**When to use:**
- Classification with imbalanced classes
- Ex: 95% class A, 5% class B
- Ensures train, val, and test have ~95%/5%

---

## Best Practices

### ✅ What to Do

1. **Split BEFORE any processing**
   - Normalize after splitting
   - Avoids "data leakage"

2. **Never train with validation/test data**
   - Validation is only for evaluation
   - Test is only for final evaluation

3. **Keep test set "sacred"**
   - Use only ONCE at the end
   - Don't adjust anything based on it

4. **Shuffle the data**
   - Avoids bias if data is ordered

### ❌ What NOT to Do

1. Normalize before splitting (causes data leakage)
2. Use test set to adjust hyperparameters
3. Train with validation/test set
4. Evaluate multiple times on test set
5. Randomly split temporal data (use temporal split)

---

## Temporal Data (Time Series)

For data with a temporal component (stock prices, sales over time), **DO NOT shuffle**!

**Example:**
- Train: January - August (8 months)
- Validation: September - October (2 months)
- Test: November - December (2 months)

**Why?**
- In production, you always predict the future based on the past
- Shuffling creates "temporal leakage" (model sees the future)

---

## Summary

| Data Type | Train Set | Validation Set | Test Set |
|---------|-----------|----------------|----------|
| **Usage** | Train model | Adjust hyperparameters | Final evaluation |
| **Frequency** | Multiple times | Multiple times | **Once** |
| **Size** | 60-80% | 10-20% | 10-20% |
| **Can train?** | ✅ Yes | ❌ No | ❌ No |

In [None]:
# ============================================================
# IMPORTS
# ============================================================

# Library for numerical computation
import numpy as np

# Library for visualization
import matplotlib.pyplot as plt

# Function to split data into training and test sets
from sklearn.model_selection import train_test_split

# Linear regression model
from sklearn.linear_model import LinearRegression

# R² metric (coefficient of determination)
from sklearn.metrics import r2_score

In [None]:
# ============================================================
# DATA GENERATION (SYNTHETIC DATASET)
# ============================================================

# Fix the random seed to ensure reproducibility
np.random.seed(42)

# Independent variable X (50 points between 0 and 10)
X = np.random.rand(50, 1) * 10

# Dependent variable y
# Linear relationship: y = 2.5x + 5 + noise
y = 2.5 * X.ravel() + 5 + np.random.randn(50) * 3

In [None]:
# ============================================================
# SCENARIO 1 — WITHOUT TRAIN/TEST SPLIT (WRONG)
# ============================================================

print("\n❌ SCENARIO 1: WITHOUT TRAIN/TEST SPLIT")
print("-" * 70)

# Create the model
model_no_split = LinearRegression()

# Train the model using ALL data
model_no_split.fit(X, y)

# Make predictions on the same data used for training
y_pred_no_split = model_no_split.predict(X)

# Compute R² using training = test (problem!)
r2_no_split = r2_score(y, y_pred_no_split)

print(f"R² = {r2_no_split:.4f}")
print("⚠️  Misleading evaluation: the model was tested on the same data it was trained on!\n")

In [None]:
# ============================================================
# SCENARIO 2 — WITH TRAIN/TEST SPLIT (CORRECT)
# ============================================================

print("=" * 70)
print("✅ SCENARIO 2: WITH TRAIN/TEST SPLIT")
print("-" * 70)

# Split the data:
# 70% training | 30% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Training: {len(X_train)} samples | Test: {len(X_test)} samples\n")

# Create the model
model_good = LinearRegression()

# Train ONLY using training data
model_good.fit(X_train, y_train)

# Predictions on training data
y_train_pred = model_good.predict(X_train)

# Predictions on data NEVER seen before (test)
y_test_pred = model_good.predict(X_test)

# Proper evaluation
r2_train_good = r2_score(y_train, y_train_pred)
r2_test_good = r2_score(y_test, y_test_pred)

print(f"R² Training: {r2_train_good:.4f}")
print(f"R² Test:     {r2_test_good:.4f}")
print(f"Difference:  {abs(r2_train_good - r2_test_good):.4f}")
print("✨ The model generalizes well!\n")

In [None]:
# ============================================================
# VISUALIZATION — TRAIN vs TEST (CLEAN VERSION)
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Continuous line for regression
X_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)

# Blue color palette
blue_light = "#7EC8E3"
blue_dark = "#1F4E79"
blue_mid = "#3A7CA5"


# ------------------------------------------------------------
# PLOT 1 — WITHOUT SPLIT
# ------------------------------------------------------------

axes[0].scatter(
    X, y,
    s=90,
    color=blue_light,
    alpha=0.8
)

axes[0].plot(
    X_line,
    model_no_split.predict(X_line),
    color=blue_dark,
    linewidth=3
)

axes[0].set_title(
    f"Without Train/Test Split\nR² = {r2_no_split:.3f}",
    fontsize=13,
    fontweight="bold",
    color=blue_dark
)

axes[0].set_xlabel("X")
axes[0].set_ylabel("y")
axes[0].grid(True, alpha=0.25)


# ------------------------------------------------------------
# PLOT 2 — WITH SPLIT
# ------------------------------------------------------------

# Training data
axes[1].scatter(
    X_train, y_train,
    s=90,
    color=blue_mid,
    alpha=0.75,
    label="Training"
)

# Test data
axes[1].scatter(
    X_test, y_test,
    s=90,
    color=blue_light,
    edgecolor=blue_dark,
    linewidth=1.5,
    label="Test"
)

# Model trained on training data
axes[1].plot(
    X_line,
    model_good.predict(X_line),
    color=blue_dark,
    linewidth=3
)

axes[1].set_title(
    f"With Train/Test Split\nTrain={r2_train_good:.2f} | Test={r2_test_good:.2f}",
    fontsize=13,
    fontweight="bold",
    color=blue_dark
)

axes[1].set_xlabel("X")
axes[1].set_ylabel("y")
axes[1].legend(frameon=False)
axes[1].grid(True, alpha=0.25)


plt.tight_layout()
plt.show()