# Module 07: Cross-Validation and Hyperparameter Tuning

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 70 minutes  
**Prerequisites**: 
- [Module 02: Data Preparation and Train-Test Split](02_data_preparation_train_test_split.ipynb)
- [Module 06: Model Evaluation Metrics](06_model_evaluation_metrics.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand why cross-validation is needed and how it works
2. Implement K-fold and Stratified K-fold cross-validation
3. Use GridSearchCV for exhaustive hyperparameter search
4. Use RandomizedSearchCV for faster hyperparameter tuning
5. Avoid common pitfalls in model selection
6. Choose optimal hyperparameters for your models

## 1. Why Cross-Validation?

### The Problem with Single Train-Test Split

**Scenario**: You split data 70-30 and get 85% accuracy.

**Questions**:
- Was it luck? What if we chose a different split?
- Did we happen to get an "easy" test set?
- How confident can we be in this 85%?

### The Solution: Cross-Validation

**Key Idea**: Test model on multiple different splits!

**Benefits**:
1. **More reliable estimate** of model performance
2. **Uses all data** for both training and validation
3. **Reduces variance** in performance estimates
4. **Detects overfitting** more reliably

### Real-World Analogy
Instead of taking one practice exam, take five different practice exams. Your average score is a better indicator of your true ability!

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Visualization settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✓ Setup complete!")

## 2. K-Fold Cross-Validation

### How It Works

**K-Fold Process** (typically K=5 or K=10):

```
Fold 1: [TEST][TRAIN][TRAIN][TRAIN][TRAIN]
Fold 2: [TRAIN][TEST][TRAIN][TRAIN][TRAIN]
Fold 3: [TRAIN][TRAIN][TEST][TRAIN][TRAIN]
Fold 4: [TRAIN][TRAIN][TRAIN][TEST][TRAIN]
Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][TEST]
```

**Steps**:
1. Split data into K equal parts ("folds")
2. For each fold:
   - Use that fold as test set
   - Use other K-1 folds as training set
   - Train model and evaluate
3. Average the K scores

**Result**: K different scores → Mean ± Standard Deviation

In [None]:
# Load iris dataset
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

iris_df = pd.read_csv('data/sample/iris.csv')

# Prepare data
feature_cols = ['sepal length (cm)', 'sepal width (cm)', 
                'petal length (cm)', 'petal width (cm)']
X = iris_df[feature_cols]
y = iris_df['species']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Dataset shape: {X.shape}")
print(f"Classes: {y.unique()}")

In [None]:
# Compare single split vs cross-validation
from sklearn.model_selection import train_test_split

# Method 1: Single train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

model = LogisticRegression(random_state=42, max_iter=10000)
model.fit(X_train, y_train)
single_score = model.score(X_test, y_test)

print("Method 1: Single Train-Test Split")
print(f"Accuracy: {single_score:.3f}")
print("Problem: Only one score - could be lucky or unlucky!\n")

# Method 2: 5-Fold Cross-Validation
model_cv = LogisticRegression(random_state=42, max_iter=10000)
cv_scores = cross_val_score(model_cv, X_scaled, y, cv=5)

print("Method 2: 5-Fold Cross-Validation")
print(f"Fold scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
print("Benefit: More reliable estimate with confidence interval!")

In [None]:
# Visualize cross-validation scores
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot of fold scores
fold_numbers = [f'Fold {i+1}' for i in range(len(cv_scores))]
axes[0].bar(fold_numbers, cv_scores, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].axhline(y=cv_scores.mean(), color='red', linestyle='--', 
               linewidth=2, label=f'Mean: {cv_scores.mean():.3f}')
axes[0].set_xlabel('Fold', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Cross-Validation Scores by Fold', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3, axis='y')

# Box plot showing distribution
axes[1].boxplot(cv_scores, vert=True, widths=0.5)
axes[1].scatter([1]*len(cv_scores), cv_scores, color='steelblue', 
               s=100, alpha=0.6, zorder=3)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Cross-Validation Score Distribution', fontsize=13, fontweight='bold')
axes[1].set_xticks([1])
axes[1].set_xticklabels(['5-Fold CV'])
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"Consistency check: Low std ({cv_scores.std():.3f}) = stable model!")

## 3. Stratified K-Fold for Classification

### The Problem with Regular K-Fold

**Scenario**: Dataset with 90% Class A, 10% Class B

**Risk**: One fold might have:
- All Class B samples (imbalanced training)
- No Class B samples (can't evaluate properly)

### The Solution: Stratified K-Fold

**Key Feature**: Preserves class proportions in each fold

- If original data: 90% A, 10% B
- Each fold will have: ~90% A, ~10% B

**Rule of Thumb**: Always use Stratified K-Fold for classification!

In [None]:
# Load wine dataset (more imbalanced)
from sklearn.model_selection import StratifiedKFold

wine_df = pd.read_csv('data/sample/wine.csv')

X_wine = wine_df.drop('target', axis=1)
y_wine = wine_df['target']

# Check class distribution
print("Original class distribution:")
print(y_wine.value_counts(normalize=True))
print()

In [None]:
# Compare regular vs stratified K-Fold
print("=" * 60)
print("REGULAR K-FOLD (may have imbalanced folds)")
print("=" * 60)

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(X_wine), 1):
    y_fold = y_wine.iloc[test_idx]
    distribution = y_fold.value_counts(normalize=True).sort_index()
    print(f"Fold {fold_idx}: {dict(distribution)}")

print("\n" + "=" * 60)
print("STRATIFIED K-FOLD (maintains class proportions)")
print("=" * 60)

stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for fold_idx, (train_idx, test_idx) in enumerate(stratified_kfold.split(X_wine, y_wine), 1):
    y_fold = y_wine.iloc[test_idx]
    distribution = y_fold.value_counts(normalize=True).sort_index()
    print(f"Fold {fold_idx}: {dict(distribution)}")

print("\nNotice: Stratified K-Fold keeps similar proportions across all folds!")

In [None]:
# Compare performance
from sklearn.tree import DecisionTreeClassifier

model_wine = DecisionTreeClassifier(random_state=42)

# Regular K-Fold
regular_scores = cross_val_score(model_wine, X_wine, y_wine, cv=5)

# Stratified K-Fold (default for classification)
stratified_scores = cross_val_score(model_wine, X_wine, y_wine, cv=5)

print("Regular K-Fold:")
print(f"  Mean: {regular_scores.mean():.3f} ± {regular_scores.std():.3f}")
print(f"  Scores: {regular_scores}\n")

print("Stratified K-Fold:")
print(f"  Mean: {stratified_scores.mean():.3f} ± {stratified_scores.std():.3f}")
print(f"  Scores: {stratified_scores}")

print("\nNote: cross_val_score uses Stratified K-Fold by default for classification!")

## 4. Hyperparameter Tuning Basics

### What are Hyperparameters?

**Parameters**: Learned from data (e.g., weights in linear regression)
**Hyperparameters**: Set before training (e.g., tree depth, learning rate)

### Common Hyperparameters

**Decision Tree**:
- `max_depth`: Maximum tree depth
- `min_samples_split`: Minimum samples to split a node
- `min_samples_leaf`: Minimum samples in leaf node

**Random Forest**:
- `n_estimators`: Number of trees
- `max_features`: Features to consider for split
- Plus all decision tree hyperparameters

### Why Tune Hyperparameters?

**Default values** may not be optimal for your specific dataset!

- Too simple → Underfitting (high bias)
- Too complex → Overfitting (high variance)
- Just right → Best generalization

In [None]:
# Demonstrate impact of hyperparameters
from sklearn.tree import DecisionTreeClassifier

# Test different max_depth values
depths = [1, 2, 3, 5, 10, 20, None]
train_scores = []
cv_scores_list = []

for depth in depths:
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    
    # Training score (on full data)
    model.fit(X_scaled, y)
    train_score = model.score(X_scaled, y)
    train_scores.append(train_score)
    
    # Cross-validation score
    cv_scores = cross_val_score(model, X_scaled, y, cv=5)
    cv_scores_list.append(cv_scores.mean())

# Create results dataframe
depth_labels = [str(d) if d is not None else 'None' for d in depths]
results_df = pd.DataFrame({
    'max_depth': depth_labels,
    'Training Score': train_scores,
    'CV Score': cv_scores_list,
    'Gap (Overfit)': [train - cv for train, cv in zip(train_scores, cv_scores_list)]
})

print("Impact of max_depth hyperparameter:")
print(results_df.to_string(index=False))
print("\nObservations:")
print("- Too shallow (depth=1,2): Underfitting (low scores)")
print("- Too deep (depth=None): Overfitting (large gap)")
print("- Just right (depth=3-5): Best CV performance!")

In [None]:
# Visualize training vs CV scores
plt.figure(figsize=(10, 6))
x_pos = range(len(depth_labels))

plt.plot(x_pos, train_scores, marker='o', linewidth=2, markersize=8,
        label='Training Score', color='blue')
plt.plot(x_pos, cv_scores_list, marker='s', linewidth=2, markersize=8,
        label='CV Score', color='red')

# Highlight best CV score
best_idx = np.argmax(cv_scores_list)
plt.scatter([best_idx], [cv_scores_list[best_idx]], 
           s=300, c='green', marker='*', 
           label=f'Best: depth={depth_labels[best_idx]}', zorder=5)

plt.xticks(x_pos, depth_labels)
plt.xlabel('max_depth', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Training vs Cross-Validation Scores\n(Large gap = overfitting)', 
         fontsize=13, fontweight='bold')
plt.legend(fontsize=11, loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. GridSearchCV - Exhaustive Search

### How It Works

**GridSearchCV** tries every combination of hyperparameters:

```python
param_grid = {
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10]
}
# Tests: 3 × 3 = 9 combinations
# With 5-fold CV: 9 × 5 = 45 model fits!
```

**Process**:
1. Define parameter grid
2. For each combination:
   - Perform K-fold cross-validation
   - Record mean score
3. Select best combination
4. Retrain on full training data

**Pros**: Guaranteed to find best combination in grid  
**Cons**: Slow for large grids (exponential growth)

In [None]:
# GridSearchCV example
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}

print("Parameter grid:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")

total_combinations = np.prod([len(v) for v in param_grid.values()])
print(f"\nTotal combinations: {total_combinations}")
print(f"With 5-fold CV: {total_combinations * 5} model fits!")
print("\nSearching for best hyperparameters...")

In [None]:
# Perform grid search
rf_model = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(
    estimator=rf_model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,  # Use all CPU cores
    verbose=0
)

# Fit on iris data
grid_search.fit(X_scaled, y)

print("Grid Search Complete!\n")
print("Best hyperparameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nBest cross-validation score: {grid_search.best_score_:.3f}")
print(f"\nThe best model is already fitted and ready to use!")

In [None]:
# Analyze grid search results
results_df = pd.DataFrame(grid_search.cv_results_)

# Select relevant columns
relevant_cols = ['param_n_estimators', 'param_max_depth', 'param_min_samples_split',
                'mean_test_score', 'std_test_score', 'rank_test_score']
results_summary = results_df[relevant_cols].sort_values('rank_test_score')

print("Top 10 hyperparameter combinations:\n")
print(results_summary.head(10).to_string(index=False))

print("\nKey Insights:")
print(f"- Best score: {results_summary['mean_test_score'].max():.3f}")
print(f"- Worst score: {results_summary['mean_test_score'].min():.3f}")
print(f"- Score range: {results_summary['mean_test_score'].max() - results_summary['mean_test_score'].min():.3f}")

In [None]:
# Visualize hyperparameter impact
# Focus on two hyperparameters for visualization
pivot_data = results_df.pivot_table(
    values='mean_test_score',
    index='param_max_depth',
    columns='param_n_estimators',
    aggfunc='mean'
)

plt.figure(figsize=(10, 7))
sns.heatmap(pivot_data, annot=True, fmt='.3f', cmap='YlGnBu', 
           cbar_kws={'label': 'Mean CV Score'})
plt.xlabel('n_estimators', fontsize=12)
plt.ylabel('max_depth', fontsize=12)
plt.title('GridSearchCV: Hyperparameter Impact on Performance', 
         fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print("Interpretation: Darker colors = better performance")

## 6. RandomizedSearchCV - Faster Alternative

### The Problem with Grid Search

**Large grids are slow!**

Example:
- 5 parameters
- 10 values each
- Total combinations: 10^5 = 100,000
- With 5-fold CV: 500,000 fits!

### The Solution: Randomized Search

**Key Idea**: Don't test everything - sample randomly!

**Process**:
1. Define parameter distributions (not fixed grids)
2. Randomly sample N combinations
3. Evaluate each with cross-validation
4. Select best combination

**Pros**: Much faster, often finds good solutions  
**Cons**: Not guaranteed to find absolute best

**Rule of Thumb**: Use RandomizedSearchCV first to narrow down range, then GridSearchCV for fine-tuning

In [None]:
# RandomizedSearchCV example
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define parameter distributions
param_distributions = {
    'n_estimators': randint(50, 300),  # Random integers from 50 to 300
    'max_depth': [3, 5, 7, 10, None],  # Can still use lists
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': uniform(0.3, 0.7)  # Random float from 0.3 to 1.0
}

print("Parameter distributions:")
for param, dist in param_distributions.items():
    print(f"  {param}: {dist}")

n_iterations = 50
print(f"\nWill try {n_iterations} random combinations")
print(f"With 5-fold CV: {n_iterations * 5} model fits (much faster than GridSearch!)")
print("\nSearching...")

In [None]:
# Perform randomized search
rf_random = RandomForestClassifier(random_state=42)

random_search = RandomizedSearchCV(
    estimator=rf_random,
    param_distributions=param_distributions,
    n_iter=50,  # Number of random combinations to try
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=0
)

random_search.fit(X_scaled, y)

print("Randomized Search Complete!\n")
print("Best hyperparameters:")
for param, value in random_search.best_params_.items():
    if isinstance(value, float):
        print(f"  {param}: {value:.3f}")
    else:
        print(f"  {param}: {value}")

print(f"\nBest cross-validation score: {random_search.best_score_:.3f}")

In [None]:
# Compare GridSearchCV vs RandomizedSearchCV
comparison = pd.DataFrame({
    'Method': ['GridSearchCV', 'RandomizedSearchCV'],
    'Best Score': [grid_search.best_score_, random_search.best_score_],
    'Combinations Tried': [len(grid_search.cv_results_['params']), 
                          len(random_search.cv_results_['params'])]
})

print("Comparison of Search Methods:\n")
print(comparison.to_string(index=False))

print("\nConclusion:")
print(f"- RandomizedSearch tried {comparison.iloc[1, 2]} combinations")
print(f"- GridSearch tried {comparison.iloc[0, 2]} combinations")
print(f"- RandomizedSearch was {comparison.iloc[0, 2] / comparison.iloc[1, 2]:.1f}x faster")
print(f"- Performance difference: {abs(comparison.iloc[0, 1] - comparison.iloc[1, 1]):.4f}")
print("\n→ RandomizedSearch found nearly optimal solution much faster!")

## 7. Best Practices and Common Pitfalls

### Best Practices

1. **Always use cross-validation** for model evaluation
   - Never tune on test set!
   - Use cross-validation on training data only

2. **Use Stratified K-Fold for classification**
   - Maintains class proportions
   - More reliable estimates

3. **Choose appropriate K**
   - K=5: Good balance (common default)
   - K=10: More reliable but slower
   - Small datasets: Use larger K

4. **Hyperparameter tuning strategy**:
   - Start with RandomizedSearchCV (broad search)
   - Narrow down promising ranges
   - Use GridSearchCV for fine-tuning

5. **Scale features before CV**
   - Fit scaler on training folds only
   - Use Pipeline to avoid data leakage

### Common Pitfalls

❌ **Data Leakage**: Fitting preprocessor on entire dataset
```python
# WRONG - leakage!
X_scaled = scaler.fit_transform(X)  # Uses test data!
cross_val_score(model, X_scaled, y, cv=5)

# CORRECT - use Pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('model', model)])
cross_val_score(pipe, X, y, cv=5)
```

❌ **Overfitting on CV**: Trying too many configurations
- More combinations = higher chance of luck
- Always evaluate best model on held-out test set

❌ **Wrong metric**: Using accuracy on imbalanced data
- Specify scoring parameter: `scoring='f1'`, `scoring='roc_auc'`

❌ **Not using random_state**: Results not reproducible

In [None]:
# Demonstrate correct way: Pipeline to prevent data leakage
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scaling step
    ('classifier', SVC(random_state=42))  # Model step
])

# Define parameter grid for pipeline
param_grid_pipeline = {
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf'],
    'classifier__gamma': ['scale', 'auto']
}

# Grid search on pipeline
grid_pipeline = GridSearchCV(
    pipeline, 
    param_grid_pipeline, 
    cv=5, 
    n_jobs=-1
)

# Fit on RAW data (pipeline handles scaling internally for each fold)
grid_pipeline.fit(X, y)  # Note: X not X_scaled!

print("✓ Pipeline approach prevents data leakage!")
print(f"\nBest parameters: {grid_pipeline.best_params_}")
print(f"Best CV score: {grid_pipeline.best_score_:.3f}")
print("\nBenefit: Scaler is fit on training folds only (no test data leakage)")

## Exercises

### Exercise 1: Manual K-Fold Implementation

Implement 5-fold cross-validation manually:
1. Use `KFold` to generate train/test indices
2. For each fold, train a LogisticRegression model
3. Calculate and store accuracy for each fold
4. Compute mean and standard deviation
5. Compare with `cross_val_score` result

Use wine dataset.

In [None]:
# Your code here


### Exercise 2: Optimal K Selection

Determine the optimal value of K for cross-validation:
1. Try K values from 3 to 10
2. For each K, perform cross-validation with LogisticRegression
3. Record mean score and standard deviation
4. Plot mean score and std vs K
5. Which K gives the best balance of performance and stability?

Use iris dataset.

In [None]:
# Your code here


### Exercise 3: Custom Grid Search

Perform grid search for DecisionTreeClassifier on wine dataset:
1. Define parameter grid:
   - `max_depth`: [2, 4, 6, 8, 10]
   - `min_samples_split`: [2, 5, 10, 20]
   - `criterion`: ['gini', 'entropy']
2. Use GridSearchCV with 5-fold CV
3. Print best parameters and score
4. Create a visualization comparing top 5 combinations
5. Test best model on held-out test set

In [None]:
# Your code here


### Exercise 4: RandomizedSearchCV Exploration

Compare different numbers of iterations in RandomizedSearchCV:
1. Use RandomForestClassifier on iris dataset
2. Try n_iter values: [10, 25, 50, 100]
3. For each n_iter:
   - Run RandomizedSearchCV
   - Record best score and time taken
4. Plot best score vs n_iter
5. What's the point of diminishing returns?

In [None]:
# Your code here


## Summary

### Key Concepts

1. **Cross-Validation**:
   - More reliable than single train-test split
   - Tests model on multiple different data splits
   - Provides mean ± std for confidence interval
   - **K-Fold**: Split data into K equal parts
   - **Stratified K-Fold**: Maintains class proportions (use for classification!)

2. **Hyperparameter Tuning**:
   - Hyperparameters are set before training (not learned)
   - Default values are rarely optimal
   - Must tune on training data only (never test data)

3. **GridSearchCV**:
   - Exhaustive search over parameter grid
   - Tests every combination
   - Guaranteed to find best in grid
   - Can be slow for large grids

4. **RandomizedSearchCV**:
   - Samples random combinations
   - Much faster than grid search
   - Often finds near-optimal solutions
   - Good for initial broad search

5. **Best Practices**:
   - Use Stratified K-Fold for classification
   - Use Pipeline to prevent data leakage
   - Start with RandomizedSearchCV, then GridSearchCV
   - Always validate final model on held-out test set
   - Set random_state for reproducibility

6. **Common Pitfalls to Avoid**:
   - ❌ Fitting preprocessors on full dataset (data leakage)
   - ❌ Tuning on test set
   - ❌ Using regular K-Fold for classification (use Stratified)
   - ❌ Not using appropriate scoring metric

### What's Next?

In **Module 08: Regularization (L1, L2, Elastic Net)**, you'll learn:
- Understanding overfitting vs underfitting
- Ridge regression (L2 regularization)
- Lasso regression (L1 regularization and feature selection)
- Elastic Net (combining L1 and L2)
- Bias-variance tradeoff
- Choosing regularization strength

### Additional Resources

- [Cross-Validation Explained - StatQuest](https://www.youtube.com/watch?v=fSytzGwwBVw)
- [Hyperparameter Tuning - Andrew Ng](https://www.coursera.org/lecture/machine-learning/model-selection-and-train-validation-test-sets-QGKbr)
- [scikit-learn Cross-Validation Guide](https://scikit-learn.org/stable/modules/cross_validation.html)
- [Grid Search vs Random Search](https://scikit-learn.org/stable/modules/grid_search.html)