# Module 13: Model Selection & Hyperparameter Tuning

**Estimated Time**: 75 minutes

## Learning Objectives

By the end of this module, you will:
- **Understand** how to choose the right model for your problem
- **Master** cross-validation techniques to prevent overfitting
- **Apply** Grid Search and Random Search for hyperparameter tuning
- **Use** Bayesian Optimization for efficient parameter search
- **Build** robust model comparison frameworks
- **Interpret** learning curves to diagnose model performance
- **Implement** advanced validation strategies

## Prerequisites

- Modules 00-12 completed
- Understanding of ML algorithms (classification, regression)
- Familiarity with scikit-learn

## What is Model Selection & Hyperparameter Tuning?

Every ML algorithm has two types of parameters:

1. **Model Parameters** - Learned from data (e.g., weights in linear regression)
2. **Hyperparameters** - Set before training (e.g., learning rate, tree depth)

**Model selection** = Choosing the right algorithm  
**Hyperparameter tuning** = Optimizing the algorithm's settings

### Why It Matters

- Default hyperparameters rarely give best performance
- Can improve accuracy by **5-20%** or more
- Critical for winning Kaggle competitions
- Essential for production ML systems

Let's master these skills!

---

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
    RandomizedSearchCV,
    learning_curve,
    validation_curve,
    KFold,
    StratifiedKFold,
)
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import time
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

# Set style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

print("‚úì All libraries loaded successfully!")
print("‚úì Ready for model selection and tuning!")

## 1. Model Selection Strategies

Choosing the right algorithm is crucial. Different algorithms have different strengths and weaknesses.

### Common ML Algorithms & When to Use Them

| Algorithm | Best For | Pros | Cons |
|-----------|----------|------|------|
| **Logistic Regression** | Binary classification, baseline | Fast, interpretable, no tuning | Assumes linearity |
| **Decision Trees** | Non-linear patterns, interpretability | Easy to understand | Overfits easily |
| **Random Forest** | General purpose, robust | Handles non-linearity, robust | Black box, slower |
| **SVM** | Small datasets, high dimensions | Effective in high dims | Slow on large data |
| **KNN** | Small datasets, simple patterns | Simple, no training | Slow prediction, sensitive to scale |
| **XGBoost/LightGBM** | Kaggle, tabular data | State-of-the-art performance | Requires tuning |
| **Neural Networks** | Images, text, complex patterns | Very flexible | Needs lots of data, hard to tune |

### Decision Framework

```
Start here:
‚îÇ
‚îú‚îÄ Linear separable? ‚Üí Logistic Regression / SVM
‚îÇ
‚îú‚îÄ Small dataset (<1000 samples)? ‚Üí SVM / KNN
‚îÇ
‚îú‚îÄ Need interpretability? ‚Üí Decision Tree / Logistic Regression
‚îÇ
‚îú‚îÄ Tabular data? ‚Üí Random Forest / XGBoost
‚îÇ
‚îú‚îÄ Images/Text? ‚Üí Neural Networks (CNNs/RNNs)
‚îÇ
‚îî‚îÄ Not sure? ‚Üí Try Random Forest (good baseline)
```

### The No Free Lunch Theorem

> "No single algorithm works best for all problems."

**Implication**: Always try multiple algorithms and compare!

Let's load data and compare different models.

In [None]:
# Model Selection - Compare Multiple Algorithms

print("=" * 60)
print("COMPARING MULTIPLE ML ALGORITHMS")
print("=" * 60)

# Load data
df = pd.read_csv("../../data_advanced/feature_engineering.csv")

# Prepare features
from sklearn.preprocessing import LabelEncoder

le_city = LabelEncoder()
le_job = LabelEncoder()

df["city_encoded"] = le_city.fit_transform(df["city"])
df["job_encoded"] = le_job.fit_transform(df["job_category"])

# Select features
features = [
    "age",
    "income",
    "education_years",
    "experience_years",
    "num_dependents",
    "city_encoded",
    "job_encoded",
]
X = df[features]
y = df["loan_approved"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features (important for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nDataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Train set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Class distribution: {y.value_counts().to_dict()}")

# Define models to compare
models = {
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42, n_estimators=100),
    "SVM": SVC(random_state=42),
    "KNN": KNeighborsClassifier(),
}

# Compare models
results = []
print("\n" + "=" * 60)
print("TRAINING AND EVALUATING MODELS")
print("=" * 60)

for name, model in models.items():
    # Time the training
    start_time = time.time()

    # Train model (use scaled data for all)
    model.fit(X_train_scaled, y_train)

    # Predictions
    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)

    # Metrics
    train_acc = accuracy_score(y_train, y_pred_train)
    test_acc = accuracy_score(y_test, y_pred_test)
    training_time = time.time() - start_time

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Overfitting": train_acc - test_acc,
            "Time (s)": training_time,
        }
    )

    print(f"\n{name}:")
    print(f"  Train Accuracy: {train_acc:.4f}")
    print(f"  Test Accuracy:  {test_acc:.4f}")
    print(f"  Overfitting:    {train_acc - test_acc:.4f}")
    print(f"  Training Time:  {training_time:.4f}s")

# Create results DataFrame
results_df = pd.DataFrame(results).sort_values("Test Accuracy", ascending=False)

print("\n" + "=" * 60)
print("SUMMARY COMPARISON")
print("=" * 60)
display(results_df)

# Visualize comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Train vs Test Accuracy
x = np.arange(len(results_df))
width = 0.35

axes[0, 0].bar(
    x - width / 2, results_df["Train Accuracy"], width, label="Train", alpha=0.8, color="skyblue"
)
axes[0, 0].bar(
    x + width / 2, results_df["Test Accuracy"], width, label="Test", alpha=0.8, color="coral"
)
axes[0, 0].set_xlabel("Model")
axes[0, 0].set_ylabel("Accuracy")
axes[0, 0].set_title("Train vs Test Accuracy", fontweight="bold")
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(results_df["Model"], rotation=45, ha="right")
axes[0, 0].legend()
axes[0, 0].grid(axis="y", alpha=0.3)

# 2. Overfitting Analysis
colors = ["red" if x > 0.05 else "green" for x in results_df["Overfitting"]]
axes[0, 1].barh(results_df["Model"], results_df["Overfitting"], color=colors, alpha=0.7)
axes[0, 1].axvline(x=0.05, color="orange", linestyle="--", label="5% threshold")
axes[0, 1].set_xlabel("Overfitting (Train - Test)")
axes[0, 1].set_title("Overfitting Analysis", fontweight="bold")
axes[0, 1].legend()
axes[0, 1].grid(axis="x", alpha=0.3)

# 3. Training Time
axes[1, 0].bar(
    results_df["Model"], results_df["Time (s)"], color="lightgreen", alpha=0.8, edgecolor="black"
)
axes[1, 0].set_xlabel("Model")
axes[1, 0].set_ylabel("Time (seconds)")
axes[1, 0].set_title("Training Time Comparison", fontweight="bold")
axes[1, 0].tick_params(axis="x", rotation=45)
plt.setp(axes[1, 0].xaxis.get_majorticklabels(), rotation=45, ha="right")
axes[1, 0].grid(axis="y", alpha=0.3)

# 4. Test Accuracy Ranking
sorted_by_acc = results_df.sort_values("Test Accuracy")
colors_gradient = plt.cm.RdYlGn(sorted_by_acc["Test Accuracy"])
axes[1, 1].barh(
    sorted_by_acc["Model"], sorted_by_acc["Test Accuracy"], color=colors_gradient, alpha=0.8
)
axes[1, 1].set_xlabel("Test Accuracy")
axes[1, 1].set_title("Model Ranking by Test Accuracy", fontweight="bold")
axes[1, 1].grid(axis="x", alpha=0.3)

for i, (model, acc) in enumerate(zip(sorted_by_acc["Model"], sorted_by_acc["Test Accuracy"])):
    axes[1, 1].text(acc, i, f" {acc:.4f}", va="center")

plt.tight_layout()
plt.show()

# Winner
best_model = results_df.iloc[0]
print("\nüèÜ Best Model:", best_model["Model"])
print(f"   Test Accuracy: {best_model['Test Accuracy']:.4f}")
print(f"   Overfitting: {best_model['Overfitting']:.4f}")
print("\n‚úì Model comparison complete!")

## 2. Cross-Validation Techniques

A single train/test split can be misleading. Cross-validation provides more robust evaluation.

### Why Cross-Validation?

**Problem with single split:**
- Results depend on which samples are in train vs test
- Could get lucky (or unlucky) with the split
- Wastes data (can't use test set for training)

**Cross-validation solution:**
- Use all data for both training and testing
- Average performance across multiple splits
- More reliable performance estimate

### Common CV Methods

1. **K-Fold Cross-Validation**
   - Split data into K folds
   - Train on K-1 folds, test on 1 fold
   - Repeat K times
   - Average the results

2. **Stratified K-Fold**
   - Like K-Fold but preserves class distribution
   - **Use this for classification!**

3. **Leave-One-Out (LOO)**
   - K = number of samples
   - Expensive but uses maximum data

4. **Time Series Split**
   - Respects temporal order
   - Use for time series data

### Typical K Values

- K=5: Fast, good for large datasets
- K=10: Standard choice, good balance
- K=number of samples: Maximum accuracy estimate (slow)

Let's compare single split vs cross-validation!

In [None]:
# Cross-Validation in Action

print("=" * 60)
print("CROSS-VALIDATION DEMONSTRATION")
print("=" * 60)

# Use our best model from previous comparison
model = RandomForestClassifier(random_state=42, n_estimators=100)

# Method 1: Single train/test split (what we did before)
print("\nMethod 1: Single Train/Test Split")
print("-" * 60)

model.fit(X_train_scaled, y_train)
single_score = model.score(X_test_scaled, y_test)
print(f"Accuracy: {single_score:.4f}")

# Method 2: 5-Fold Cross-Validation
print("\nMethod 2: 5-Fold Cross-Validation")
print("-" * 60)

cv_scores_5 = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring="accuracy")
print(f"Fold scores: {cv_scores_5}")
print(f"Mean CV Accuracy: {cv_scores_5.mean():.4f} (+/- {cv_scores_5.std() * 2:.4f})")

# Method 3: 10-Fold Cross-Validation
print("\nMethod 3: 10-Fold Cross-Validation")
print("-" * 60)

cv_scores_10 = cross_val_score(model, X_train_scaled, y_train, cv=10, scoring="accuracy")
print(f"Fold scores: {cv_scores_10}")
print(f"Mean CV Accuracy: {cv_scores_10.mean():.4f} (+/- {cv_scores_10.std() * 2:.4f})")

# Method 4: Stratified K-Fold (preserves class distribution)
print("\nMethod 4: Stratified 5-Fold Cross-Validation")
print("-" * 60)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores_stratified = cross_val_score(model, X_train_scaled, y_train, cv=skf, scoring="accuracy")
print(f"Fold scores: {cv_scores_stratified}")
print(
    f"Mean CV Accuracy: {cv_scores_stratified.mean():.4f} (+/- {cv_scores_stratified.std() * 2:.4f})"
)

# Visualize CV results
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# 1. Box plot of CV scores
cv_data = {
    "5-Fold": cv_scores_5,
    "10-Fold": cv_scores_10,
    "Stratified\n5-Fold": cv_scores_stratified,
}

axes[0].boxplot(cv_data.values(), labels=cv_data.keys())
axes[0].axhline(y=single_score, color="red", linestyle="--", label="Single Split")
axes[0].set_ylabel("Accuracy")
axes[0].set_title("Cross-Validation Score Distribution", fontweight="bold")
axes[0].legend()
axes[0].grid(axis="y", alpha=0.3)

# 2. Fold-by-fold comparison
x = np.arange(5)
width = 0.25

axes[1].bar(x - width, cv_scores_5, width, label="5-Fold", alpha=0.8)
axes[1].bar(x, cv_scores_stratified, width, label="Stratified 5-Fold", alpha=0.8)
axes[1].bar(x + width, cv_scores_10[:5], width, label="10-Fold (first 5)", alpha=0.8)

axes[1].axhline(y=single_score, color="red", linestyle="--", label="Single Split", linewidth=2)
axes[1].set_xlabel("Fold Number")
axes[1].set_ylabel("Accuracy")
axes[1].set_title("Fold-by-Fold Accuracy Comparison", fontweight="bold")
axes[1].set_xticks(x)
axes[1].set_xticklabels([f"Fold {i+1}" for i in range(5)])
axes[1].legend()
axes[1].grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

# Summary comparison
print("\n" + "=" * 60)
print("COMPARISON SUMMARY")
print("=" * 60)

comparison_df = pd.DataFrame(
    {
        "Method": ["Single Split", "5-Fold CV", "10-Fold CV", "Stratified 5-Fold CV"],
        "Mean Accuracy": [
            single_score,
            cv_scores_5.mean(),
            cv_scores_10.mean(),
            cv_scores_stratified.mean(),
        ],
        "Std Dev": [0, cv_scores_5.std(), cv_scores_10.std(), cv_scores_stratified.std()],
        "95% CI": [
            0,
            cv_scores_5.std() * 2,
            cv_scores_10.std() * 2,
            cv_scores_stratified.std() * 2,
        ],
    }
)

display(comparison_df)

print("\nüí° Key Insights:")
print("   ‚Ä¢ CV provides confidence intervals (mean +/- 2*std)")
print("   ‚Ä¢ More folds = more reliable estimate (but slower)")
print("   ‚Ä¢ Stratified K-Fold preserves class balance")
print("   ‚Ä¢ Single split can be misleading!")
print("\n‚úì Cross-validation demonstrated!")

## 3. Grid Search

Grid Search exhaustively tries all combinations of hyperparameters.

### How It Works

1. Define a grid of hyperparameter values
2. Try every possible combination  
3. Evaluate each using cross-validation
4. Return the best combination

### Pros & Cons

**Pros:**
- Guaranteed to find best in grid
- Easy to implement
- Parallelizable

**Cons:**
- Exponentially slow
- Wastes time on bad combos
- Limited to discrete values

### Example

With 3 params (3, 4, 5 values) and 5-fold CV = **3 √ó 4 √ó 5 √ó 5 = 300 fits!**

Let's tune a Random Forest!

In [None]:
# Grid Search Implementation

print("=" * 60)
print("GRID SEARCH HYPERPARAMETER TUNING")
print("=" * 60)

# Smaller grid for demonstration
param_grid = {"n_estimators": [50, 100], "max_depth": [5, 10, None], "min_samples_split": [2, 5]}

print("\nParameter Grid:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")

total_combos = 2 * 3 * 2
print(f"\nCombinations: {total_combos}")
print(f"With 5-fold CV: {total_combos * 5} fits")

# Grid Search
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    estimator=rf, param_grid=param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=0
)

print("\nRunning Grid Search...")
start = time.time()
grid_search.fit(X_train_scaled, y_train)
elapsed = time.time() - start

print(f"\n‚úì Complete in {elapsed:.2f}s")
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test_scaled, y_test):.4f}")

# Show all results
results_df = pd.DataFrame(grid_search.cv_results_)
summary = results_df[["params", "mean_test_score", "rank_test_score"]].sort_values(
    "rank_test_score"
)
print("\nAll combinations ranked:")
display(summary.head(10))

print("\n‚úì Grid Search complete!")

## 4. Random Search

Random Search samples random combinations instead of trying all.

### Grid Search vs Random Search

| Aspect | Grid Search | Random Search |
|--------|-------------|---------------|
| Coverage | All combinations | Random sample |
| Speed | Slow | Fast |
| Best for | Small grids | Large search spaces |
| Guarantee | Finds best in grid | May miss best |

### Why Random Search Works

- Most parameters don't matter much
- Random sampling explores more of the space
- Often finds good solutions faster

### When to Use

- **Grid Search**: ‚â§3 parameters, small grids
- **Random Search**: Many parameters, large ranges

Let's compare!

In [None]:
# Random Search Implementation

from scipy.stats import randint

print("=" * 60)
print("RANDOM SEARCH vs GRID SEARCH")
print("=" * 60)

# Random Search parameter distributions
param_dist = {
    "n_estimators": randint(50, 200),
    "max_depth": [5, 10, 15, 20, None],
    "min_samples_split": randint(2, 11),
    "min_samples_leaf": randint(1, 5),
}

# Random Search
random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20,  # Try 20 random combinations
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
    random_state=42,
    verbose=0,
)

print(f"\nRandom Search: {20} random combinations")
print(f"Grid Search would try: {2*5*10*4} = 400 combinations!\n")

start = time.time()
random_search.fit(X_train_scaled, y_train)
rs_time = time.time() - start

print(f"Random Search time: {rs_time:.2f}s")
print(f"Best params: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")
print(f"Test score: {random_search.score(X_test_scaled, y_test):.4f}")

print(f"\n‚úì Random Search: {20} fits in {rs_time:.2f}s")
print(f"‚úì Grid Search: {total_combos*5} fits in {elapsed:.2f}s")
print(f"‚úì Speedup: {elapsed/rs_time:.1f}x faster!")
print("\n‚úì Random Search complete!")

## 5. Bayesian Optimization with Optuna

Bayesian Optimization uses past trials to inform future searches - smarter than random!

### How It Works

1. Try a few random combinations
2. Build a model of performance
3. Use model to pick promising next trial
4. Repeat

### Advantages

- Much faster than Grid/Random Search
- Explores intelligently
- Works with continuous parameters

### When to Use

- Expensive models (deep learning)
- Many hyperparameters
- Limited time budget

**Note**: Requires `optuna` library. Install with: `pip install optuna`

For now, Grid Search and Random Search are sufficient for most tabular data problems!

In [None]:
# Bayesian Optimization - Conceptual Example

print("=" * 60)
print("BAYESIAN OPTIMIZATION (Conceptual)")
print("=" * 60)

print("\nBayesian Optimization Workflow:")
print("1. Start with random trials")
print("2. Fit surrogate model (Gaussian Process)")
print("3. Use acquisition function to pick next trial")
print("4. Evaluate and update model")
print("5. Repeat until budget exhausted\n")

print("Example libraries:")
print("  ‚Ä¢ Optuna (recommended)")
print("  ‚Ä¢ Hyperopt")
print("  ‚Ä¢ Scikit-Optimize")

print("\nüí° For deep learning, Bayesian optimization can save days of tuning!")
print("‚úì Concept explained!")

## 6. Model Comparison Framework

Build a systematic framework to compare multiple models fairly.

### Best Practices

1. **Use same train/test split** for all models
2. **Use cross-validation** for robust estimates
3. **Track multiple metrics** (accuracy, precision, recall, F1)
4. **Measure training time**
5. **Test on unseen data**

### Comparison Checklist

‚úì Baseline model (simplest)  
‚úì Multiple algorithms  
‚úì Default hyperparameters first  
‚úì Tune best performing models  
‚úì Statistical significance tests  
‚úì Document all results  

We've already built this framework in Section 1!

In [None]:
# Model Comparison Summary

print("=" * 60)
print("COMPLETE MODEL COMPARISON SUMMARY")
print("=" * 60)

# Compare: Baseline vs Tuned
baseline_rf = RandomForestClassifier(random_state=42)
tuned_rf = grid_search.best_estimator_

models_final = {
    "Baseline Random Forest": baseline_rf,
    "Grid Search Tuned RF": grid_search.best_estimator_,
    "Random Search Tuned RF": random_search.best_estimator_,
}

print("\nFinal Comparison:")
for name, model in models_final.items():
    if "Baseline" in name:
        model.fit(X_train_scaled, y_train)
    cv_score = cross_val_score(model, X_train_scaled, y_train, cv=5).mean()
    test_score = model.score(X_test_scaled, y_test)
    print(f"\n{name}:")
    print(f"  CV Score: {cv_score:.4f}")
    print(f"  Test Score: {test_score:.4f}")

print("\n‚úì Model comparison framework complete!")

## 7. Learning Curves

Learning curves show how performance changes with training data size.

### What They Tell Us

1. **Underfitting**: Both curves low, converged
   - Solution: More complex model

2. **Overfitting**: Large gap between curves
   - Solution: More data, regularization

3. **Good Fit**: Small gap, high performance
   - Solution: You're done!

### How to Use

- Plot train/validation score vs training size
- Diagnose model issues
- Decide if more data would help

Let's visualize!

In [None]:
# Learning Curves Visualization

print("=" * 60)
print("LEARNING CURVES")
print("=" * 60)

# Calculate learning curve
train_sizes, train_scores, val_scores = learning_curve(
    RandomForestClassifier(random_state=42, n_estimators=100),
    X_train_scaled,
    y_train,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    n_jobs=-1,
    scoring="accuracy",
)

# Calculate mean and std
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, label="Training score", marker="o", linewidth=2)
plt.plot(train_sizes, val_mean, label="Validation score", marker="s", linewidth=2)

plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15)
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.15)

plt.xlabel("Training Size", fontweight="bold")
plt.ylabel("Accuracy", fontweight="bold")
plt.title("Learning Curves", fontweight="bold", fontsize=14)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nDiagnosis:")
print(f"  Final training score: {train_mean[-1]:.4f}")
print(f"  Final validation score: {val_mean[-1]:.4f}")
print(f"  Gap: {train_mean[-1] - val_mean[-1]:.4f}")

if train_mean[-1] - val_mean[-1] > 0.05:
    print("  ‚ö†Ô∏è  Overfitting detected - consider more data or regularization")
elif val_mean[-1] < 0.7:
    print("  ‚ö†Ô∏è  Underfitting - consider more complex model")
else:
    print("  ‚úì Good fit!")

print("\n‚úì Learning curves complete!")

## 8. Validation Strategies

Choose the right validation strategy for your data type.

### Common Strategies

| Data Type | Strategy | Why |
|-----------|----------|-----|
| **Standard** | Stratified K-Fold | Preserves class balance |
| **Time Series** | Time Series Split | Respects temporal order |
| **Small dataset** | Leave-One-Out | Uses maximum data |
| **Imbalanced** | Stratified K-Fold | Maintains class ratio |
| **Grouped data** | Group K-Fold | Keeps groups together |

### Time Series Split Example

```python
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
# Ensures test data is always after train data
```

### Key Rule

> **Never use future data to predict the past!**

‚úì Validation strategies covered!

In [None]:
# Validation Strategies Example

print("=" * 60)
print("VALIDATION STRATEGY COMPARISON")
print("=" * 60)

# Compare different validation strategies
strategies = {
    "Stratified 5-Fold": StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    "Regular 5-Fold": KFold(n_splits=5, shuffle=True, random_state=42),
    "Stratified 10-Fold": StratifiedKFold(n_splits=10, shuffle=True, random_state=42),
}

model = RandomForestClassifier(random_state=42, n_estimators=100)

for name, strategy in strategies.items():
    scores = cross_val_score(model, X_train_scaled, y_train, cv=strategy, scoring="accuracy")
    print(f"\n{name}:")
    print(f"  Scores: {scores}")
    print(f"  Mean: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

print("\n‚úì For classification, use Stratified K-Fold!")
print("‚úì Validation strategies complete!")

## 9. Hands-On Exercises

Practice model selection and tuning!

### Exercise 1: Model Selection
Compare Logistic Regression, SVM, and KNN on the customer_reviews dataset. Which performs best?

### Exercise 2: Grid Search
Tune a Decision Tree using Grid Search. Parameters: max_depth, min_samples_split, min_samples_leaf.

### Exercise 3: Random Search
Use Random Search to tune a Random Forest with 5+ parameters. Compare time vs Grid Search.

### Exercise 4: Learning Curves
Generate learning curves for an overfitting model. Diagnose the problem.

### Exercise 5: Complete Pipeline
Build an end-to-end pipeline:
1. Load data
2. Compare 3+ algorithms
3. Tune best model with Grid/Random Search
4. Evaluate with cross-validation
5. Plot learning curves

Ready to practice!

In [None]:
# Exercise Workspace

print("=" * 60)
print("EXERCISES - Practice Your Skills!")
print("=" * 60)

# Exercise 1: Model Selection
print("\nExercise 1: Model Selection")
print("-" * 60)
print("TODO: Load customer_reviews.csv")
print("TODO: Compare 3+ algorithms")
print("TODO: Report best model and scores\n")

# Exercise 2: Grid Search
print("Exercise 2: Grid Search")
print("-" * 60)
print("TODO: Define parameter grid for Decision Tree")
print("TODO: Run GridSearchCV")
print("TODO: Report best parameters\n")

# Exercise 3: Random Search
print("Exercise 3: Random Search")
print("-" * 60)
print("TODO: Define parameter distributions")
print("TODO: Run RandomizedSearchCV")
print("TODO: Compare time with Grid Search\n")

# Your code here for exercises

## 10. Key Takeaways & Next Steps

Excellent work mastering model selection and hyperparameter tuning!

### What You've Learned

#### 1. **Model Selection**
- ‚úì How to choose the right algorithm
- ‚úì Comparing multiple models systematically
- ‚úì Baseline ‚Üí Multiple algorithms ‚Üí Tune best
- ‚úì No Free Lunch Theorem

#### 2. **Cross-Validation**
- ‚úì K-Fold and Stratified K-Fold
- ‚úì Why single split can be misleading
- ‚úì Confidence intervals from CV scores
- ‚úì When to use which strategy

#### 3. **Hyperparameter Tuning**
- ‚úì Grid Search - exhaustive but slow
- ‚úì Random Search - fast and effective
- ‚úì Bayesian Optimization - intelligent search
- ‚úì When to use each method

#### 4. **Model Diagnostics**
- ‚úì Learning curves for overfitting detection
- ‚úì Validation strategies for different data types
- ‚úì Model comparison frameworks
- ‚úì Performance metrics tracking

### Key Insights

> **"Hyperparameter tuning can improve performance by 5-20%, but good features can improve it by 50%+"**

**Tuning Hierarchy:**
1. **Feature engineering** (biggest impact)
2. **Algorithm selection** (moderate impact)
3. **Hyperparameter tuning** (smaller but important)

### Best Practices Checklist

‚úì Always start with baseline model  
‚úì Use cross-validation (not single split)  
‚úì Try multiple algorithms before tuning  
‚úì Use Stratified K-Fold for classification  
‚úì Grid Search for ‚â§3 parameters, Random Search for more  
‚úì Plot learning curves to diagnose issues  
‚úì Test final model on held-out test set  
‚úì Document all experiments  

### Quick Reference Table

| Task | Method | Code |
|------|--------|------|
| Compare models | cross_val_score | `cross_val_score(model, X, y, cv=5)` |
| Tune parameters | GridSearchCV | `GridSearchCV(model, param_grid, cv=5)` |
| Fast tuning | RandomizedSearchCV | `RandomizedSearchCV(model, params, n_iter=20)` |
| Learning curves | learning_curve | `learning_curve(model, X, y, cv=5)` |

### Common Pitfalls

1. **Tuning on test set** - Always use CV on training data only!
2. **Data leakage** - Fit scalers on training data, transform test data
3. **Ignoring baseline** - Always compare against simple model
4. **Over-tuning** - Diminishing returns, focus on features instead
5. **Wrong CV strategy** - Use Stratified for classification, Time Series Split for temporal data

### Real-World Applications

Model selection and tuning is essential in:
- **Healthcare**: Optimizing disease prediction models
- **Finance**: Credit scoring and fraud detection
- **E-commerce**: Recommendation system optimization
- **Manufacturing**: Predictive maintenance tuning
- **Competitions**: Kaggle leaderboard climbing

### Next Steps

#### Module 14: Ensemble Methods
- XGBoost, LightGBM, CatBoost
- Stacking and blending
- Kaggle-winning techniques

#### Practice Projects
1. Titanic competition (Kaggle)
2. House Prices competition
3. Your own dataset

#### Advanced Topics
- Nested cross-validation
- Time series cross-validation
- Bayesian optimization deep dive
- AutoML tools (Auto-sklearn, H2O)

### Recommended Practice

Spend **2-3 hours**:
1. Complete all exercises
2. Apply to your own dataset
3. Participate in a Kaggle competition
4. Build a model comparison template

### Resources

**Documentation:**
- [scikit-learn Model Selection](https://scikit-learn.org/stable/model_selection.html)
- [GridSearchCV Guide](https://scikit-learn.org/stable/modules/grid_search.html)

**Books:**
- "Hands-On Machine Learning" by Aur√©lien G√©ron (Chapter 2)
- "Python Machine Learning" by Sebastian Raschka

**Tools:**
- Optuna for Bayesian Optimization
- Weights & Biases for experiment tracking
- MLflow for model management

---

### Module Complete! üéâ

**Skills Gained:**
- Model selection strategies
- Cross-validation mastery
- Hyperparameter tuning (Grid, Random, Bayesian)
- Learning curve interpretation
- Production-ready validation pipelines

**Next Module**: `14_ensemble_methods.ipynb` - Learn XGBoost and win Kaggle!

---

*Built with Claude Code | Module 13: Model Selection & Hyperparameter Tuning*