# Model Evaluation

## Week 6: Comprehensive Guide to Model Assessment

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Understand** the importance of proper train/validation/test splits
2. **Calculate and interpret** regression metrics (MAE, MSE, RMSE, R²)
3. **Calculate and interpret** classification metrics (accuracy, precision, recall, F1)
4. **Implement** cross-validation techniques
5. **Plot and interpret** learning curves and validation curves
6. **Diagnose** overfitting and underfitting using evaluation techniques
7. **Use** confusion matrices and ROC curves for classification problems

## 1. Introduction to Model Evaluation

**Model evaluation** is crucial for:
- Assessing how well a model generalizes to unseen data
- Comparing different models
- Tuning hyperparameters
- Detecting overfitting/underfitting

### The Fundamental Trade-off

| Problem | Symptoms | Solutions |
|---------|----------|------------|
| **Underfitting** (High Bias) | Low train & test scores | More features, complex model |
| **Overfitting** (High Variance) | High train, low test score | More data, regularization, simpler model |

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import (
    train_test_split, cross_val_score, KFold, 
    learning_curve, validation_curve, StratifiedKFold
)
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, roc_auc_score,
    precision_recall_curve, ConfusionMatrixDisplay, RocCurveDisplay
)
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification, load_iris

# Set random seed
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

---

## 2. Data Splitting Strategies

### 2.1 Train/Validation/Test Split

```
        Full Dataset
             |
    +--------+--------+
    |                 |
 Training        Test Set (15-20%)
   Set               [Final evaluation only]
    |
    +--------+
    |        |
 Train    Validation
(60-70%)  (15-20%)
           [Hyperparameter tuning]
```

### 2.2 Rationale

| Split | Purpose | Usage |
|-------|---------|-------|
| **Training** | Learn model parameters | Fit the model |
| **Validation** | Tune hyperparameters | Model selection |
| **Test** | Final evaluation | Report performance |

In [None]:
# Load data
df = pd.read_csv('../data/house_prices.csv')
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

print(f"Total samples: {len(X)}")

# Method 1: Simple train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"\nSimple split:")
print(f"  Training: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"  Test: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")

# Method 2: Train/validation/test split
X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, test_size=0.25, random_state=42  # 0.25 of 0.8 = 0.2
)
print(f"\nThree-way split:")
print(f"  Training: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"  Validation: {len(X_val)} ({len(X_val)/len(X)*100:.0f}%)")
print(f"  Test: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")

---

## 3. Regression Metrics

### 3.1 Mean Absolute Error (MAE)

Average of absolute differences between predictions and actual values:

$$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

**Interpretation**: Average error magnitude in original units.

### 3.2 Mean Squared Error (MSE)

Average of squared differences:

$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

**Interpretation**: Penalizes larger errors more than MAE.

### 3.3 Root Mean Squared Error (RMSE)

Square root of MSE:

$$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

**Interpretation**: Error in original units, but penalizes large errors.

### 3.4 Coefficient of Determination (R²)

Proportion of variance explained by the model:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$

where:
- $SS_{res}$ = Residual sum of squares
- $SS_{tot}$ = Total sum of squares
- $\bar{y}$ = Mean of actual values

**Interpretation**: 
- R² = 1: Perfect predictions
- R² = 0: Model predicts as well as the mean
- R² < 0: Model is worse than predicting the mean

In [None]:
# Train a regression model
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Make predictions
y_train_pred = model.predict(X_train_scaled)
y_val_pred = model.predict(X_val_scaled)
y_test_pred = model.predict(X_test_scaled)

# Calculate metrics for each set
def calculate_regression_metrics(y_true, y_pred, set_name):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    
    print(f"\n{set_name} Metrics:")
    print(f"  MAE:  {mae:.4f}")
    print(f"  MSE:  {mse:.4f}")
    print(f"  RMSE: {rmse:.4f}")
    print(f"  R²:   {r2:.4f}")
    return {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R2': r2}

train_metrics = calculate_regression_metrics(y_train, y_train_pred, "Training")
val_metrics = calculate_regression_metrics(y_val, y_val_pred, "Validation")
test_metrics = calculate_regression_metrics(y_test, y_test_pred, "Test")

In [None]:
# Visualize predictions vs actual
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

datasets = [
    (y_train, y_train_pred, 'Training'),
    (y_val, y_val_pred, 'Validation'),
    (y_test, y_test_pred, 'Test')
]

for ax, (y_true, y_pred, title) in zip(axes, datasets):
    ax.scatter(y_true, y_pred, alpha=0.5, s=30)
    
    # Perfect prediction line
    min_val = min(y_true.min(), y_pred.min())
    max_val = max(y_true.max(), y_pred.max())
    ax.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')
    
    ax.set_xlabel('Actual Values')
    ax.set_ylabel('Predicted Values')
    ax.set_title(f'{title} Set\nR² = {r2_score(y_true, y_pred):.4f}')
    ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Residual analysis
residuals = y_test - y_test_pred

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Residuals vs Predicted
ax1 = axes[0]
ax1.scatter(y_test_pred, residuals, alpha=0.5, s=30)
ax1.axhline(y=0, color='r', linestyle='--', linewidth=2)
ax1.set_xlabel('Predicted Values')
ax1.set_ylabel('Residuals')
ax1.set_title('Residuals vs Predicted Values')

# Residual distribution
ax2 = axes[1]
ax2.hist(residuals, bins=30, edgecolor='black', alpha=0.7)
ax2.axvline(x=0, color='r', linestyle='--', linewidth=2)
ax2.set_xlabel('Residual')
ax2.set_ylabel('Frequency')
ax2.set_title(f'Distribution of Residuals\nMean: {residuals.mean():.4f}, Std: {residuals.std():.4f}')

plt.tight_layout()
plt.show()

---

## 4. Classification Metrics

### 4.1 Confusion Matrix

```
                    Predicted
                 Positive  Negative
Actual Positive    TP        FN
Actual Negative    FP        TN
```

- **TP** (True Positive): Correctly predicted positive
- **TN** (True Negative): Correctly predicted negative
- **FP** (False Positive): Incorrectly predicted positive (Type I Error)
- **FN** (False Negative): Incorrectly predicted negative (Type II Error)

### 4.2 Core Metrics

#### Accuracy
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

**Warning**: Can be misleading with imbalanced classes!

#### Precision (Positive Predictive Value)
$$\text{Precision} = \frac{TP}{TP + FP}$$

**Question it answers**: Of all positive predictions, how many are correct?

#### Recall (Sensitivity, True Positive Rate)
$$\text{Recall} = \frac{TP}{TP + FN}$$

**Question it answers**: Of all actual positives, how many did we catch?

#### F1 Score
Harmonic mean of precision and recall:
$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}$$

### 4.3 When to Use Which Metric?

| Scenario | Priority | Example |
|----------|----------|----------|
| Spam detection | Precision | Don't want legitimate emails in spam |
| Disease detection | Recall | Don't want to miss sick patients |
| Balanced importance | F1 Score | General classification tasks |

In [None]:
# Load classification data
iris_df = pd.read_csv('../data/iris.csv')
X_iris = iris_df.drop('class', axis=1)
y_iris = iris_df['class']

# Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

# Train classifier
clf = LogisticRegression(max_iter=200, random_state=42)
clf.fit(X_train_c, y_train_c)

# Predictions
y_pred_c = clf.predict(X_test_c)
y_prob_c = clf.predict_proba(X_test_c)

print("Classification Report:")
print(classification_report(y_test_c, y_pred_c))

In [None]:
# Confusion Matrix Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Without normalization
cm = confusion_matrix(y_test_c, y_pred_c)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot(ax=axes[0], cmap='Blues')
axes[0].set_title('Confusion Matrix (Counts)')

# With normalization
cm_norm = confusion_matrix(y_test_c, y_pred_c, normalize='true')
disp_norm = ConfusionMatrixDisplay(confusion_matrix=cm_norm, display_labels=clf.classes_)
disp_norm.plot(ax=axes[1], cmap='Blues', values_format='.2f')
axes[1].set_title('Confusion Matrix (Normalized)')

plt.tight_layout()
plt.show()

In [None]:
# Binary classification example for ROC curve
X_binary, y_binary = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, n_classes=2, random_state=42
)

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(
    X_binary, y_binary, test_size=0.3, random_state=42
)

# Train multiple classifiers for comparison
classifiers = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42)
}

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ROC Curves
ax1 = axes[0]
for name, clf in classifiers.items():
    clf.fit(X_train_b, y_train_b)
    y_prob = clf.predict_proba(X_test_b)[:, 1]
    fpr, tpr, _ = roc_curve(y_test_b, y_prob)
    auc = roc_auc_score(y_test_b, y_prob)
    ax1.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})', linewidth=2)

ax1.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('ROC Curves Comparison')
ax1.legend(loc='lower right')

# Precision-Recall Curves
ax2 = axes[1]
for name, clf in classifiers.items():
    y_prob = clf.predict_proba(X_test_b)[:, 1]
    precision, recall, _ = precision_recall_curve(y_test_b, y_prob)
    ax2.plot(recall, precision, label=name, linewidth=2)

ax2.set_xlabel('Recall')
ax2.set_ylabel('Precision')
ax2.set_title('Precision-Recall Curves')
ax2.legend(loc='lower left')

plt.tight_layout()
plt.show()

---

## 5. Cross-Validation

### 5.1 K-Fold Cross-Validation

Divides data into k equal parts, uses k-1 for training and 1 for validation, rotating k times.

```
Fold 1: [Test] [Train] [Train] [Train] [Train]
Fold 2: [Train] [Test] [Train] [Train] [Train]
Fold 3: [Train] [Train] [Test] [Train] [Train]
Fold 4: [Train] [Train] [Train] [Test] [Train]
Fold 5: [Train] [Train] [Train] [Train] [Test]
```

### 5.2 Cross-Validation Score

$$\text{CV Score} = \frac{1}{k}\sum_{i=1}^{k} \text{Score}_i$$

### 5.3 Advantages
- Uses all data for both training and validation
- Provides a more robust estimate of model performance
- Helps detect overfitting

In [None]:
# Cross-validation demonstration
from sklearn.model_selection import cross_val_score, cross_validate

# Using our regression data
X_scaled = scaler.fit_transform(X)

# Simple cross-validation
cv_scores = cross_val_score(LinearRegression(), X_scaled, y, cv=5, scoring='r2')

print("5-Fold Cross-Validation Results:")
print(f"  Individual fold scores: {cv_scores}")
print(f"  Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

# Multiple metrics at once
scoring = ['r2', 'neg_mean_squared_error', 'neg_mean_absolute_error']
cv_results = cross_validate(
    LinearRegression(), X_scaled, y, cv=5, 
    scoring=scoring, return_train_score=True
)

print("\nCross-Validation with Multiple Metrics:")
results_df = pd.DataFrame({
    'Metric': ['R²', 'MSE', 'MAE'],
    'Train Mean': [
        cv_results['train_r2'].mean(),
        -cv_results['train_neg_mean_squared_error'].mean(),
        -cv_results['train_neg_mean_absolute_error'].mean()
    ],
    'Test Mean': [
        cv_results['test_r2'].mean(),
        -cv_results['test_neg_mean_squared_error'].mean(),
        -cv_results['test_neg_mean_absolute_error'].mean()
    ],
    'Test Std': [
        cv_results['test_r2'].std(),
        cv_results['test_neg_mean_squared_error'].std(),
        cv_results['test_neg_mean_absolute_error'].std()
    ]
})
print(results_df.to_string(index=False))

In [None]:
# Visualize cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

fig, ax = plt.subplots(figsize=(12, 6))

# Create a visual representation of the folds
n_samples = len(X)
indices = np.arange(n_samples)

for fold_idx, (train_idx, val_idx) in enumerate(kfold.split(X)):
    # Create array for this fold (0 = train, 1 = validation)
    fold_array = np.zeros(n_samples)
    fold_array[val_idx] = 1
    
    ax.scatter(indices, [fold_idx] * n_samples, c=fold_array, 
               cmap='RdYlBu', s=10, alpha=0.7)

ax.set_xlabel('Sample Index')
ax.set_ylabel('Fold')
ax.set_title('K-Fold Cross-Validation Splits')
ax.set_yticks(range(5))
ax.set_yticklabels([f'Fold {i+1}' for i in range(5)])

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#4575b4', label='Training'),
    Patch(facecolor='#d73027', label='Validation')
]
ax.legend(handles=legend_elements, loc='upper right')

plt.tight_layout()
plt.show()

---

## 6. Learning Curves

### 6.1 What are Learning Curves?

Learning curves show how training and validation scores change as the **amount of training data** increases.

### 6.2 Diagnosing Problems

| Pattern | Diagnosis | Solution |
|---------|-----------|----------|
| Both scores low | **High Bias** (Underfitting) | More features, complex model |
| Train high, Val low | **High Variance** (Overfitting) | More data, regularization |
| Both converge high | **Good Fit** | Model is appropriate |

In [None]:
def plot_learning_curve(estimator, X, y, title, ax=None, cv=5):
    """Plot learning curve for a given estimator."""
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='r2'
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 6))
    
    ax.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, 
                    alpha=0.1, color='blue')
    ax.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, 
                    alpha=0.1, color='orange')
    ax.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
    ax.plot(train_sizes, val_mean, 'o-', color='orange', label='Cross-Val Score')
    
    ax.set_xlabel('Training Set Size')
    ax.set_ylabel('R² Score')
    ax.set_title(title)
    ax.legend(loc='lower right')
    ax.grid(True)
    
    return ax

# Compare different models
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Simple model (potential underfitting)
plot_learning_curve(
    LinearRegression(), X_scaled, y,
    'Linear Regression\n(Potential High Bias)', axes[0]
)

# Moderate complexity
plot_learning_curve(
    Ridge(alpha=1.0), X_scaled, y,
    'Ridge Regression\n(Balanced)', axes[1]
)

# Complex model (potential overfitting)
plot_learning_curve(
    DecisionTreeRegressor(max_depth=None), X_scaled, y,
    'Decision Tree (Unlimited Depth)\n(Potential High Variance)', axes[2]
)

plt.tight_layout()
plt.show()

---

## 7. Validation Curves

### 7.1 What are Validation Curves?

Validation curves show how training and validation scores change as a **hyperparameter** varies.

### 7.2 Uses
- Find optimal hyperparameter values
- Detect underfitting/overfitting regions

In [None]:
# Validation curve for Decision Tree depth
param_range = np.arange(1, 20)

train_scores, val_scores = validation_curve(
    DecisionTreeRegressor(random_state=42), X_scaled, y,
    param_name='max_depth', param_range=param_range,
    cv=5, scoring='r2'
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

fig, ax = plt.subplots(figsize=(10, 6))

ax.fill_between(param_range, train_mean - train_std, train_mean + train_std, 
                alpha=0.1, color='blue')
ax.fill_between(param_range, val_mean - val_std, val_mean + val_std, 
                alpha=0.1, color='orange')
ax.plot(param_range, train_mean, 'o-', color='blue', label='Training Score', linewidth=2)
ax.plot(param_range, val_mean, 'o-', color='orange', label='Cross-Val Score', linewidth=2)

# Find optimal depth
optimal_depth = param_range[np.argmax(val_mean)]
ax.axvline(x=optimal_depth, color='r', linestyle='--', label=f'Optimal Depth = {optimal_depth}')

ax.set_xlabel('Max Depth')
ax.set_ylabel('R² Score')
ax.set_title('Validation Curve for Decision Tree')
ax.legend(loc='lower right')
ax.grid(True)

# Annotate regions
ax.annotate('Underfitting\n(High Bias)', xy=(3, 0.4), fontsize=10, ha='center')
ax.annotate('Overfitting\n(High Variance)', xy=(15, 0.4), fontsize=10, ha='center')

plt.tight_layout()
plt.show()

print(f"Optimal max_depth: {optimal_depth}")
print(f"Best CV Score: {val_mean[optimal_depth-1]:.4f}")

---

## 8. Summary

### Key Takeaways

1. **Data Splitting**: Always use separate test set for final evaluation
2. **Regression Metrics**: Choose based on problem requirements (RMSE for penalizing large errors, MAE for robustness)
3. **Classification Metrics**: Consider class imbalance; don't rely on accuracy alone
4. **Cross-Validation**: Provides more robust performance estimates
5. **Learning Curves**: Diagnose bias vs variance problems
6. **Validation Curves**: Find optimal hyperparameters

### Metric Selection Guide

| Problem Type | Metric | When to Use |
|--------------|--------|-------------|
| Regression | RMSE | Penalize large errors |
| Regression | MAE | Robust to outliers |
| Regression | R² | Compare model explanatory power |
| Classification | Precision | Minimize false positives |
| Classification | Recall | Minimize false negatives |
| Classification | F1 | Balance precision and recall |
| Classification | AUC-ROC | Overall discrimination ability |

---

## 9. Exercises

### Exercise 1: Regression Evaluation
Train multiple regression models (Linear, Ridge, Lasso) on the house prices data. Compare them using MAE, RMSE, and R² on a held-out test set.

### Exercise 2: Cross-Validation Comparison
Compare 5-fold and 10-fold cross-validation results for a classifier. How do the variance of scores differ?

### Exercise 3: Learning Curve Analysis
Generate learning curves for a RandomForestClassifier on the iris dataset. Interpret what the curves tell you about bias and variance.

### Exercise 4: ROC Curve Interpretation
Create a binary classification problem and plot ROC curves for at least 3 different classifiers. Which one performs best?

### Exercise 5: Hyperparameter Tuning
Use validation curves to find the optimal `C` parameter for LogisticRegression on the iris dataset.

### Exercise 6: Imbalanced Data
Create an imbalanced dataset (90% one class, 10% other). Show why accuracy is misleading and which metrics are more appropriate.

### Exercise 7: Stratified Sampling
Compare regular train_test_split with stratified splitting for an imbalanced classification problem. How do the class distributions differ?

---

## 10. Further Reading

- [scikit-learn Model Evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html)
- [scikit-learn Cross-Validation](https://scikit-learn.org/stable/modules/cross_validation.html)
- [Understanding ROC Curves](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)
- [Precision-Recall Trade-off](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)
- [Learning Curves Guide](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html)