# Module 6: Model Evaluation and Validation

---

Building a model is only half the work. This module focuses on rigorously **evaluating and validating** models to ensure they generalize well to unseen data. We cover evaluation metrics, cross-validation, the bias-variance tradeoff, and hyperparameter tuning.

---

## Table of Contents

1. [Confusion Matrix and Derived Metrics](#1.-Confusion-Matrix-and-Derived-Metrics)
2. [ROC Curve and AUC](#2.-ROC-Curve-and-AUC)
3. [Cross-Validation](#3.-Cross-Validation)
4. [Bias-Variance Tradeoff](#4.-Bias-Variance-Tradeoff)
5. [Hyperparameter Tuning](#5.-Hyperparameter-Tuning)
6. [Exercises](#6.-Exercises)
7. [Summary and Further Reading](#7.-Summary-and-Further-Reading)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import (train_test_split, cross_val_score,
                                     StratifiedKFold, GridSearchCV,
                                     RandomizedSearchCV, learning_curve, validation_curve)
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score,
                              recall_score, f1_score, roc_curve, roc_auc_score,
                              classification_report, precision_recall_curve)

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

# Load and prepare data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

---

## 1. Confusion Matrix and Derived Metrics

The confusion matrix is the foundation of classification evaluation. It shows how predictions map to actual labels:

```
                     Predicted
                  Neg        Pos
Actual  Neg   [ TN      |   FP ]
        Pos   [ FN      |   TP ]
```

| Metric | Formula | Interpretation |
|--------|---------|---------------|
| **Accuracy** | (TP + TN) / Total | Overall correct predictions |
| **Precision** | TP / (TP + FP) | Of predicted positives, how many are correct? |
| **Recall (Sensitivity)** | TP / (TP + FN) | Of actual positives, how many did we find? |
| **F1 Score** | 2 * P * R / (P + R) | Harmonic mean of precision and recall |
| **Specificity** | TN / (TN + FP) | Of actual negatives, how many did we identify? |

In [None]:
# Train a model for evaluation
lr = LogisticRegression(max_iter=5000, random_state=42)
lr.fit(X_train_s, y_train)
y_pred = lr.predict(X_test_s)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print("CONFUSION MATRIX")
print("=" * 40)
print(f"                Predicted")
print(f"             Neg     Pos")
print(f"Actual Neg   {tn:>4}    {fp:>4}")
print(f"       Pos   {fn:>4}    {tp:>4}")

print(f"\nDERIVED METRICS")
print("=" * 40)
print(f"Accuracy:    {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision:   {precision_score(y_test, y_pred):.4f}")
print(f"Recall:      {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score:    {f1_score(y_test, y_pred):.4f}")
print(f"Specificity: {tn / (tn + fp):.4f}")

In [None]:
# Visualize the confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw counts
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=cancer.target_names, yticklabels=cancer.target_names,
            square=True, linewidths=1, annot_kws={'size': 16})
axes[0].set_xlabel('Predicted', fontsize=13)
axes[0].set_ylabel('Actual', fontsize=13)
axes[0].set_title('Confusion Matrix (Counts)', fontsize=14, fontweight='bold')

# Normalized (percentages)
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_norm, annot=True, fmt='.2%', cmap='Blues', ax=axes[1],
            xticklabels=cancer.target_names, yticklabels=cancer.target_names,
            square=True, linewidths=1, annot_kws={'size': 14})
axes[1].set_xlabel('Predicted', fontsize=13)
axes[1].set_ylabel('Actual', fontsize=13)
axes[1].set_title('Confusion Matrix (Normalized)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

### When to Use Which Metric?

- **Accuracy** is appropriate when classes are balanced.
- **Precision** matters when false positives are costly (e.g., spam detection — do not flag legitimate email).
- **Recall** matters when false negatives are costly (e.g., cancer detection — do not miss a positive case).
- **F1 Score** provides a balanced measure when both false positives and false negatives are important.

---

## 2. ROC Curve and AUC

The **ROC (Receiver Operating Characteristic) curve** plots the True Positive Rate (recall) against the False Positive Rate at various classification thresholds.

The **AUC (Area Under the Curve)** summarizes the curve into a single number:
- AUC = 1.0: perfect classifier
- AUC = 0.5: random classifier (no discrimination)

In [None]:
# Get probability scores (not just binary predictions)
y_proba = lr.predict_proba(X_test_s)[:, 1]

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)

# Precision-Recall curve
precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, y_proba)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ROC Curve
ax = axes[0]
ax.plot(fpr, tpr, linewidth=2.5, color='#2196F3', label=f'Logistic Regression (AUC = {auc_score:.3f})')
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random classifier (AUC = 0.5)')
ax.fill_between(fpr, tpr, alpha=0.1, color='#2196F3')
ax.set_xlabel('False Positive Rate', fontsize=13)
ax.set_ylabel('True Positive Rate (Recall)', fontsize=13)
ax.set_title('ROC Curve', fontsize=14, fontweight='bold')
ax.legend(fontsize=10, loc='lower right')

# Precision-Recall Curve
ax = axes[1]
ax.plot(recall_vals, precision_vals, linewidth=2.5, color='#FF5722')
ax.fill_between(recall_vals, precision_vals, alpha=0.1, color='#FF5722')
ax.set_xlabel('Recall', fontsize=13)
ax.set_ylabel('Precision', fontsize=13)
ax.set_title('Precision-Recall Curve', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Compare ROC curves of multiple classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=5000, random_state=42),
    'KNN (K=5)': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),
}

fig, ax = plt.subplots(figsize=(10, 8))
colors = ['#2196F3', '#FF5722', '#4CAF50', '#9C27B0']

for (name, clf), color in zip(classifiers.items(), colors):
    clf.fit(X_train_s, y_train)
    y_prob = clf.predict_proba(X_test_s)[:, 1]
    fpr_c, tpr_c, _ = roc_curve(y_test, y_prob)
    auc_c = roc_auc_score(y_test, y_prob)
    ax.plot(fpr_c, tpr_c, linewidth=2, color=color, label=f'{name} (AUC = {auc_c:.3f})')

ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.5)')
ax.set_xlabel('False Positive Rate', fontsize=13)
ax.set_ylabel('True Positive Rate', fontsize=13)
ax.set_title('ROC Curves — Classifier Comparison', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
plt.tight_layout()
plt.show()

---

## 3. Cross-Validation

A single train-test split can give unreliable results because it depends on which samples end up in each set. **Cross-validation** addresses this by evaluating the model on multiple different splits.

**K-Fold Cross-Validation:**
1. Split data into K equal folds.
2. For each fold, train on K-1 folds and evaluate on the remaining fold.
3. Average the K results.

In [None]:
# 5-Fold and 10-Fold cross-validation
models = {
    'Logistic Regression': LogisticRegression(max_iter=5000, random_state=42),
    'KNN (K=5)': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
}

print("5-Fold Stratified Cross-Validation Results")
print("=" * 60)

cv_results = []
for name, model in models.items():
    scores = cross_val_score(model, X_train_s, y_train,
                            cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                            scoring='accuracy')
    cv_results.append({
        'Model': name,
        'Mean Accuracy': scores.mean(),
        'Std': scores.std(),
        'Fold Scores': scores
    })
    print(f"  {name:>25s}: {scores.mean():.4f} (+/- {scores.std():.4f})  Folds: {np.round(scores, 3)}")

# Visualize as boxplot
fig, ax = plt.subplots(figsize=(10, 6))
fold_data = [r['Fold Scores'] for r in cv_results]
bp = ax.boxplot(fold_data, labels=[r['Model'] for r in cv_results],
                patch_artist=True)
colors = ['#2196F3', '#FF5722', '#4CAF50', '#9C27B0']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.6)
ax.set_ylabel('Accuracy', fontsize=13)
ax.set_title('Cross-Validation Accuracy Distribution', fontsize=14, fontweight='bold')
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()

---

## 4. Bias-Variance Tradeoff

Understanding the bias-variance tradeoff is crucial for building models that generalize well.

- **Bias**: Error from incorrect assumptions. High bias leads to underfitting.
- **Variance**: Error from sensitivity to fluctuations in training data. High variance leads to overfitting.
- **Goal**: Find the sweet spot that minimizes total error (bias + variance).

**Learning curves** help diagnose whether a model suffers from high bias or high variance.

In [None]:
# Learning curves for different model complexities
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

models_lc = [
    ('High Bias (Logistic Reg)', LogisticRegression(max_iter=5000, random_state=42)),
    ('Good Fit (SVM RBF)', SVC(kernel='rbf', random_state=42)),
    ('High Variance (Deep Tree)', DecisionTreeClassifier(max_depth=None, random_state=42)),
]

for idx, (name, model) in enumerate(models_lc):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X_train_s, y_train,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5, scoring='accuracy', n_jobs=-1
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    ax = axes[idx]
    ax.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='#2196F3')
    ax.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='#FF5722')
    ax.plot(train_sizes, train_mean, 'o-', color='#2196F3', linewidth=2, label='Training score')
    ax.plot(train_sizes, val_mean, 's-', color='#FF5722', linewidth=2, label='Validation score')
    ax.set_xlabel('Training Set Size', fontsize=12)
    ax.set_ylabel('Accuracy', fontsize=12)
    ax.set_title(name, fontsize=13, fontweight='bold')
    ax.legend(fontsize=10)
    ax.set_ylim(0.85, 1.02)

plt.suptitle('Learning Curves — Diagnosing Bias vs Variance', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("How to read learning curves:")
print("  - High Bias: both curves converge at a low score — model is too simple.")
print("  - Good Fit: both curves converge at a high score — model is appropriate.")
print("  - High Variance: large gap between curves — model memorizes training data.")

In [None]:
# Validation curve: how a hyperparameter affects performance
param_range = np.arange(1, 25)
train_scores_vc, val_scores_vc = validation_curve(
    DecisionTreeClassifier(random_state=42), X_train_s, y_train,
    param_name='max_depth', param_range=param_range,
    cv=5, scoring='accuracy', n_jobs=-1
)

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(param_range, train_scores_vc.mean(axis=1), 'o-', color='#2196F3',
        linewidth=2, label='Training score')
ax.plot(param_range, val_scores_vc.mean(axis=1), 's-', color='#FF5722',
        linewidth=2, label='Validation score')
ax.fill_between(param_range,
                train_scores_vc.mean(axis=1) - train_scores_vc.std(axis=1),
                train_scores_vc.mean(axis=1) + train_scores_vc.std(axis=1),
                alpha=0.1, color='#2196F3')
ax.fill_between(param_range,
                val_scores_vc.mean(axis=1) - val_scores_vc.std(axis=1),
                val_scores_vc.mean(axis=1) + val_scores_vc.std(axis=1),
                alpha=0.1, color='#FF5722')
ax.set_xlabel('max_depth', fontsize=13)
ax.set_ylabel('Accuracy', fontsize=13)
ax.set_title('Validation Curve — Decision Tree max_depth', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
plt.tight_layout()
plt.show()

best_depth = param_range[np.argmax(val_scores_vc.mean(axis=1))]
print(f"Optimal max_depth: {best_depth}")
print("Beyond this depth, validation accuracy plateaus or drops while training accuracy keeps rising — overfitting.")

---

## 5. Hyperparameter Tuning

Most ML algorithms have hyperparameters that must be set before training. Finding the best combination is called **hyperparameter tuning**.

| Method | How It Works | Pros | Cons |
|--------|-------------|------|------|
| **GridSearchCV** | Tries every combination | Thorough | Slow for many parameters |
| **RandomizedSearchCV** | Samples random combinations | Faster | May miss the best |

Both use cross-validation internally to evaluate each combination.

In [None]:
# GridSearchCV for SVM
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

grid_search = GridSearchCV(
    SVC(random_state=42),
    param_grid,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='accuracy',
    n_jobs=-1,
    verbose=0
)

grid_search.fit(X_train_s, y_train)

print("GRID SEARCH RESULTS — SVM")
print("=" * 50)
print(f"Best parameters:     {grid_search.best_params_}")
print(f"Best CV accuracy:    {grid_search.best_score_:.4f}")
print(f"Test set accuracy:   {grid_search.score(X_test_s, y_test):.4f}")

# Show top 5 parameter combinations
results = pd.DataFrame(grid_search.cv_results_)
top5 = results.nsmallest(5, 'rank_test_score')[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']]
print("\nTop 5 Configurations:")
for _, row in top5.iterrows():
    print(f"  Rank {int(row['rank_test_score'])}:  {row['mean_test_score']:.4f} +/- {row['std_test_score']:.4f}  {row['params']}")

In [None]:
# RandomizedSearchCV for Random Forest
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(3, 20),
    'min_samples_split': randint(2, 15),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=50,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=0
)

random_search.fit(X_train_s, y_train)

print("RANDOMIZED SEARCH RESULTS — Random Forest")
print("=" * 50)
print(f"Best parameters:     {random_search.best_params_}")
print(f"Best CV accuracy:    {random_search.best_score_:.4f}")
print(f"Test set accuracy:   {random_search.score(X_test_s, y_test):.4f}")

---

## 6. Exercises

### Exercise 1: Evaluate on an Imbalanced Dataset

In [None]:
# Exercise 1: Create an imbalanced dataset and evaluate a classifier
#
# 1. Generate imbalanced data using make_classification with weights=[0.9, 0.1]
# 2. Split into train/test
# 3. Train a LogisticRegression
# 4. Compute accuracy, precision, recall, and F1
# 5. Why is accuracy misleading here? Which metric is more informative?
# 6. Plot the confusion matrix and ROC curve

X_imb, y_imb = make_classification(n_samples=1000, weights=[0.9, 0.1],
                                    random_state=42, n_features=10)

# Your code here:


### Exercise 2: Cross-Validation Comparison

In [None]:
# Exercise 2: Using the Breast Cancer dataset:
# 1. Run 10-fold cross-validation for at least 3 different classifiers
# 2. Use 'f1' as the scoring metric instead of 'accuracy'
# 3. Print mean and std for each model
# 4. Create a bar chart comparing mean F1 scores with error bars (std)

# Your code here:


### Exercise 3: Hyperparameter Tuning for KNN

In [None]:
# Exercise 3: Use GridSearchCV to find the best hyperparameters for KNN:
# Search over:
#   - n_neighbors: [1, 3, 5, 7, 9, 11, 15, 20]
#   - weights: ['uniform', 'distance']
#   - metric: ['euclidean', 'manhattan']
# 
# Report the best parameters, best CV score, and test accuracy.

# Your code here:


---

## 7. Summary and Further Reading

### What We Covered

- **Confusion Matrix**: The foundation of classification evaluation (TP, TN, FP, FN).
- **Precision, Recall, F1**: Metrics that go beyond accuracy, especially for imbalanced data.
- **ROC Curve and AUC**: Threshold-independent evaluation of classifier performance.
- **Cross-Validation**: Multiple train/test splits for more reliable evaluation.
- **Bias-Variance Tradeoff**: Diagnosing underfitting and overfitting with learning curves.
- **Hyperparameter Tuning**: GridSearchCV and RandomizedSearchCV to find optimal model settings.

### Recommended Reading

- [Scikit-learn Model Evaluation Guide](https://scikit-learn.org/stable/modules/model_evaluation.html)
- [Scikit-learn Cross-Validation Guide](https://scikit-learn.org/stable/modules/cross_validation.html)
- Chapter 2 of Aurélien Géron, *Hands-On Machine Learning* (end-to-end project with evaluation)
- Chapter 5 of *ISLR* — Resampling Methods

### Next Module

In **Module 7: Unsupervised Learning**, we will explore algorithms that work with unlabeled data: K-Means clustering, hierarchical clustering, DBSCAN, and dimensionality reduction with PCA and t-SNE.

---