# Day 22: Cross Validation & Hyperparameter Tuning

As part of my Machine Learning learning journey, Day 22 focuses on understanding **model generalization**,  
the **biasâ€“variance tradeoff**, and how **cross validation** helps in building robust and reliable models.

This notebook covers:
- Generalization in Machine Learning
- Biasâ€“Variance Tradeoff
- Underfitting vs Overfitting
- Validation Set & Data Leakage
- K-Fold Cross Validation
- Hyperparameter Tuning using Cross Validation

## Generalization

## Generalization in Machine Learning

**Generalization** refers to how well a trained model performs on **unseen data**.

A good ML model should:
- Perform well on training data
- Perform similarly well on test data

The real objective of Machine Learning is **not just high training accuracy**,  
but **stable and consistent performance on unseen data**.


## Biasâ€“Variance Tradeoff

## Biasâ€“Variance Tradeoff

### Bias
- Error due to incorrect assumptions
- High bias â†’ overly simple models
- Leads to **underfitting**

Examples:
- KNN with very large `k`
- Decision Tree with very small `max_depth`

### Variance
- Error due to sensitivity to training data
- High variance â†’ overly complex models
- Leads to **overfitting**

Examples:
- KNN with `k = 1`
- Deep Decision Trees

### Tradeoff
Reducing bias increases variance and vice versa.

**Goal:** Find a balance that minimizes total error.


## Underfitting vs Overfitting

## Underfitting vs Overfitting

| Scenario | Bias | Variance | Performance |
|--------|------|----------|-------------|
| Underfitting | High | Low | Poor on train & test |
| Overfitting | Low | High | Good on train, poor on test |
| Good Fit | Balanced | Balanced | Good on both |

## Why Trainâ€“Test Split Is Not Enough?

If we repeatedly:
- Train a model
- Check test performance
- Adjust hyperparameters

ðŸ‘‰ The model indirectly learns from the test data.

This causes **Data Leakage**, making test performance unreliable.

**Rule:**  
Test data must be used **only once** at the final evaluation stage.

## Validation Set

## Validation Set (Hold-Out Method)

To avoid data leakage:
- Split data into Training and Testing sets
- Further split Training into:
  - Training data
  - Validation data

### Purpose of Validation Set
- Tune hyperparameters
- Select best model
- Keep test data untouched

### Train / Validation / Test Split

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load dataset
X, y = load_iris(return_X_y=True)

# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train-Validation split
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=42
)

print("Training size:", X_train.shape)
print("Validation size:", X_val.shape)
print("Test size:", X_test.shape)

Training size: (90, 4)
Validation size: (30, 4)
Test size: (30, 4)


### Cross Validation

### Cross Validation

Cross Validation provides a **more reliable estimate** of model performance.

### K-Fold Cross Validation
- Dataset is split into `K` equal folds
- Train on `K-1` folds
- Validate on remaining fold
- Repeat K times
- Final score = Mean of all validation scores

Common values:
- K = 5
- K = 10


### K-Fold Cross Validation

In [5]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)

kf = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X_train, y_train, cv=kf)

print("Cross-validation scores:", scores)
print("Mean CV score:", scores.mean())


Cross-validation scores: [1.         0.94444444 1.         0.88888889 0.88888889]
Mean CV score: 0.9444444444444444


## Types of Cross Validation

1. **K-Fold Cross Validation**
2. **Repeated K-Fold Cross Validation**
3. **Stratified K-Fold Cross Validation**
   - Preserves class distribution
   - Preferred for classification
4. **Leave-One-Out Cross Validation (LOOCV)**
   - Computationally expensive
5. **Leave-P-Out Cross Validation**
   - Extremely expensive for large datasets

## Hyperparameter Tuning

Hyperparameters control model complexity.

Examples:
- KNN â†’ `k`
- Decision Tree â†’ `max_depth`
- Random Forest â†’ `n_estimators`

Cross Validation helps select hyperparameters that:
- Avoid overfitting
- Improve generalization


### GridSearchCV (Hyperparameter Tuning)

In [7]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 5, 10]
}

grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring="accuracy"
)

grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best CV Score:", grid.best_score_)

Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Best CV Score: 0.9222222222222222


## Final Model Evaluation

After selecting best hyperparameters:
- Retrain model on full training data
- Evaluate **once** on test data

This gives an unbiased estimate of model performance.


### Test Set Evaluation

In [8]:
from sklearn.metrics import accuracy_score
best_model = grid.best_estimator_
y_test_pred = best_model.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

Test Accuracy: 0.9666666666666667
