# Lab 3: Model Evaluation and Validation

Building a model is only half the battle. Evaluating its performance properly is crucial for understanding whether it will work well on new, unseen data.

## Learning Objectives

By the end of this lab, you will:
- Use various evaluation metrics for classification and regression
- Understand and implement cross-validation
- Recognize and address overfitting and underfitting
- Apply regularization techniques
- Tune hyperparameters systematically
- Use learning curves to diagnose model problems
- Build robust, well-validated models

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from typing import Tuple, List
from sklearn.datasets import load_breast_cancer, make_classification, fetch_california_housing
from sklearn.model_selection import (
    train_test_split, cross_val_score, KFold, StratifiedKFold,
    GridSearchCV, RandomizedSearchCV, learning_curve, validation_curve
)
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, roc_auc_score,
    mean_squared_error, mean_absolute_error, r2_score
)

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Part 1: Classification Metrics

Accuracy is not always the best metric. Let's explore various classification metrics.

### Confusion Matrix

|                | Predicted Positive | Predicted Negative |
|----------------|-------------------|-------------------|
| **Actual Positive** | True Positive (TP) | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN) |

### Key Metrics

- **Accuracy**: $\frac{TP + TN}{TP + TN + FP + FN}$ - Overall correctness
- **Precision**: $\frac{TP}{TP + FP}$ - Of predicted positives, how many are correct?
- **Recall (Sensitivity)**: $\frac{TP}{TP + FN}$ - Of actual positives, how many did we find?
- **F1-Score**: $2 \times \frac{Precision \times Recall}{Precision + Recall}$ - Harmonic mean of precision and recall
- **Specificity**: $\frac{TN}{TN + FP}$ - Of actual negatives, how many did we correctly identify?

In [None]:
# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]

# Calculate metrics
print("Classification Metrics:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1-Score:  {f1_score(y_test, y_pred):.3f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba):.3f}")

print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

In [None]:
# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
           xticklabels=data.target_names, yticklabels=data.target_names)
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')

# Normalized confusion matrix
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_norm, annot=True, fmt='.2f', cmap='Blues', ax=axes[1],
           xticklabels=data.target_names, yticklabels=data.target_names)
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title('Normalized Confusion Matrix')

plt.tight_layout()
plt.show()

### ROC Curve and AUC

The **ROC (Receiver Operating Characteristic)** curve plots:
- True Positive Rate (Recall) on y-axis
- False Positive Rate on x-axis

**AUC (Area Under Curve)** measures the entire area under the ROC curve:
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random guessing
- AUC < 0.5: Worse than random

In [None]:
# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"The closer the curve is to the top-left corner, the better the model.")
print(f"AUC = {auc:.3f} indicates excellent performance.")

### Precision-Recall Trade-off

By adjusting the classification threshold, we can trade precision for recall:
- **High threshold** → Higher precision, lower recall (fewer false positives)
- **Low threshold** → Higher recall, lower precision (fewer false negatives)

Choose based on the cost of false positives vs false negatives.

In [None]:
# Precision-Recall trade-off
thresholds_test = np.linspace(0, 1, 100)
precisions = []
recalls = []

for threshold in thresholds_test:
    y_pred_thresh = (y_proba >= threshold).astype(int)
    if len(np.unique(y_pred_thresh)) > 1:
        precisions.append(precision_score(y_test, y_pred_thresh))
        recalls.append(recall_score(y_test, y_pred_thresh))
    else:
        precisions.append(1.0 if np.unique(y_pred_thresh)[0] == 1 else 0.0)
        recalls.append(1.0 if np.unique(y_pred_thresh)[0] == 1 else 0.0)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Precision and Recall vs Threshold
axes[0].plot(thresholds_test, precisions, label='Precision', linewidth=2)
axes[0].plot(thresholds_test, recalls, label='Recall', linewidth=2)
axes[0].axvline(x=0.5, color='k', linestyle='--', alpha=0.5, label='Default threshold')
axes[0].set_xlabel('Threshold')
axes[0].set_ylabel('Score')
axes[0].set_title('Precision and Recall vs Threshold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Precision-Recall Curve
axes[1].plot(recalls, precisions, linewidth=2)
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 2: Cross-Validation

A single train-test split can be misleading. **Cross-validation** provides a more robust estimate of model performance.

### k-Fold Cross-Validation

1. Split data into k folds
2. For each fold:
   - Train on k-1 folds
   - Validate on the remaining fold
3. Average the k validation scores

### Stratified k-Fold

Maintains class distribution in each fold (important for imbalanced datasets).

In [None]:
class CrossValidator:
    """
    Custom k-Fold Cross-Validation implementation.
    """
    
    def __init__(self, n_splits: int = 5):
        self.n_splits = n_splits
    
    def split(self, X: np.ndarray) -> List[Tuple[np.ndarray, np.ndarray]]:
        """
        Generate train/test indices for k-fold CV.
        """
        n_samples = len(X)
        indices = np.arange(n_samples)
        np.random.shuffle(indices)
        
        fold_size = n_samples // self.n_splits
        
        for i in range(self.n_splits):
            start = i * fold_size
            end = start + fold_size if i < self.n_splits - 1 else n_samples
            
            test_indices = indices[start:end]
            train_indices = np.concatenate([indices[:start], indices[end:]])
            
            yield train_indices, test_indices
    
    def cross_val_score(self, model, X: np.ndarray, y: np.ndarray) -> List[float]:
        """
        Perform cross-validation and return scores.
        """
        scores = []
        
        for train_idx, test_idx in self.split(X):
            X_train_fold, X_test_fold = X[train_idx], X[test_idx]
            y_train_fold, y_test_fold = y[train_idx], y[test_idx]
            
            # Train model
            model.fit(X_train_fold, y_train_fold)
            
            # Evaluate
            y_pred = model.predict(X_test_fold)
            score = accuracy_score(y_test_fold, y_pred)
            scores.append(score)
        
        return scores

In [None]:
# Test custom cross-validation
cv = CrossValidator(n_splits=5)
model = LogisticRegression(max_iter=1000, random_state=42)

# Scale entire dataset
X_scaled = StandardScaler().fit_transform(X)

scores = cv.cross_val_score(model, X_scaled, y)

print("Custom 5-Fold Cross-Validation Scores:")
for i, score in enumerate(scores, 1):
    print(f"Fold {i}: {score:.3f}")
print(f"\nMean: {np.mean(scores):.3f} (+/- {np.std(scores):.3f})")

In [None]:
# Compare with scikit-learn
from sklearn.model_selection import cross_val_score as sk_cv_score

# Regular k-fold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kfold = sk_cv_score(model, X_scaled, y, cv=kfold, scoring='accuracy')

# Stratified k-fold
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_stratified = sk_cv_score(model, X_scaled, y, cv=stratified_kfold, scoring='accuracy')

print("Scikit-learn Cross-Validation:")
print(f"\nk-Fold CV:")
print(f"Mean: {scores_kfold.mean():.3f} (+/- {scores_kfold.std():.3f})")
print(f"\nStratified k-Fold CV:")
print(f"Mean: {scores_stratified.mean():.3f} (+/- {scores_stratified.std():.3f})")

# Visualize
plt.figure(figsize=(10, 6))
positions = [1, 2]
bp = plt.boxplot([scores_kfold, scores_stratified], positions=positions, 
                 labels=['k-Fold', 'Stratified k-Fold'], patch_artist=True)
for patch in bp['boxes']:
    patch.set_facecolor('lightblue')
plt.ylabel('Accuracy')
plt.title('Cross-Validation Comparison')
plt.grid(True, alpha=0.3)
plt.show()

## Part 3: Overfitting and Underfitting

- **Underfitting**: Model is too simple, performs poorly on both training and test data
- **Overfitting**: Model is too complex, performs well on training but poorly on test data
- **Good fit**: Model generalizes well to unseen data

### Bias-Variance Trade-off

- **High Bias** (Underfitting): Model makes strong assumptions, misses relevant patterns
- **High Variance** (Overfitting): Model is too sensitive to training data, captures noise
- **Optimal**: Balance between bias and variance

In [None]:
# Demonstrate overfitting with decision trees
X, y = make_classification(n_samples=500, n_features=2, n_informative=2, 
                          n_redundant=0, n_clusters_per_class=1, 
                          random_state=42, flip_y=0.1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

max_depths = [1, 2, 3, 5, 10, 20]
train_scores = []
test_scores = []

for depth in max_depths:
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    
    train_scores.append(model.score(X_train, y_train))
    test_scores.append(model.score(X_test, y_test))

plt.figure(figsize=(10, 6))
plt.plot(max_depths, train_scores, 'o-', label='Training Score', linewidth=2)
plt.plot(max_depths, test_scores, 'o-', label='Test Score', linewidth=2)
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Training vs Test Scores (Overfitting Demonstration)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Observations:")
print("- Shallow trees (depth 1-2): Underfit - low scores on both sets")
print("- Medium trees (depth 3-5): Good fit - balanced performance")
print("- Deep trees (depth 10+): Overfit - high training score, lower test score")

## Part 4: Regularization

Regularization adds a penalty term to prevent overfitting by constraining model complexity.

### Ridge Regression (L2 Regularization)

$$Loss = MSE + \alpha \sum_{i=1}^{n} w_i^2$$

- Penalizes large weights
- Shrinks coefficients toward zero
- All features retained

### Lasso Regression (L1 Regularization)

$$Loss = MSE + \alpha \sum_{i=1}^{n} |w_i|$$

- Can set coefficients exactly to zero
- Performs feature selection

### Elastic Net

$$Loss = MSE + \alpha_1 \sum_{i=1}^{n} |w_i| + \alpha_2 \sum_{i=1}^{n} w_i^2$$

- Combines L1 and L2 penalties

In [None]:
# Load housing data
housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Test different alpha values
alphas = np.logspace(-3, 3, 50)

ridge_train_scores = []
ridge_test_scores = []
lasso_train_scores = []
lasso_test_scores = []

for alpha in alphas:
    # Ridge
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    ridge_train_scores.append(ridge.score(X_train_scaled, y_train))
    ridge_test_scores.append(ridge.score(X_test_scaled, y_test))
    
    # Lasso
    lasso = Lasso(alpha=alpha, max_iter=5000)
    lasso.fit(X_train_scaled, y_train)
    lasso_train_scores.append(lasso.score(X_train_scaled, y_train))
    lasso_test_scores.append(lasso.score(X_test_scaled, y_test))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge
axes[0].semilogx(alphas, ridge_train_scores, label='Training', linewidth=2)
axes[0].semilogx(alphas, ridge_test_scores, label='Test', linewidth=2)
axes[0].set_xlabel('Alpha (Regularization Strength)')
axes[0].set_ylabel('R² Score')
axes[0].set_title('Ridge Regression')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Lasso
axes[1].semilogx(alphas, lasso_train_scores, label='Training', linewidth=2)
axes[1].semilogx(alphas, lasso_test_scores, label='Test', linewidth=2)
axes[1].set_xlabel('Alpha (Regularization Strength)')
axes[1].set_ylabel('R² Score')
axes[1].set_title('Lasso Regression')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Observations:")
print("- Low alpha: Little regularization, potential overfitting")
print("- High alpha: Strong regularization, potential underfitting")
print("- Optimal alpha: Balance between training and test performance")

In [None]:
# Compare feature selection
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1, max_iter=5000)

ridge.fit(X_train_scaled, y_train)
lasso.fit(X_train_scaled, y_train)

# Compare coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].barh(housing.feature_names, ridge.coef_)
axes[0].set_xlabel('Coefficient Value')
axes[0].set_title('Ridge Coefficients (All Non-Zero)')
axes[0].grid(True, alpha=0.3)

axes[1].barh(housing.feature_names, lasso.coef_)
axes[1].set_xlabel('Coefficient Value')
axes[1].set_title(f'Lasso Coefficients ({np.sum(lasso.coef_ != 0)} Non-Zero)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Feature Selection with Lasso:")
for name, coef in zip(housing.feature_names, lasso.coef_):
    if coef != 0:
        print(f"{name:15s}: {coef:8.3f}")
    else:
        print(f"{name:15s}: Eliminated")

## Part 5: Hyperparameter Tuning

Hyperparameters are settings that control the learning process (not learned from data).

### Grid Search
- Exhaustively try all combinations
- Guaranteed to find best combination in grid
- Can be computationally expensive

### Random Search
- Sample random combinations
- Often finds good solutions faster
- Better for large parameter spaces

In [None]:
# Load classification data
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Grid Search
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)

print("Grid Search Results:")
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.3f}")
print(f"Test Score: {grid_search.score(X_test_scaled, y_test):.3f}")
print(f"Total combinations tested: {len(grid_search.cv_results_['params'])}")

In [None]:
# Random Search
from scipy.stats import randint

param_dist = {
    'max_depth': randint(3, 20),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

random_search = RandomizedSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_dist,
    n_iter=50,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train_scaled, y_train)

print("\nRandom Search Results:")
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best CV Score: {random_search.best_score_:.3f}")
print(f"Test Score: {random_search.score(X_test_scaled, y_test):.3f}")
print(f"Combinations tested: {len(random_search.cv_results_['params'])}")

## Part 6: Learning Curves

Learning curves show how model performance changes with training set size.

**Interpreting Learning Curves:**
- **High bias (underfitting)**: Training and validation scores are both low and converge
- **High variance (overfitting)**: Large gap between training and validation scores
- **Good fit**: Both scores are high and close together

In [None]:
def plot_learning_curves(model, X, y, title="Learning Curves"):
    """
    Plot learning curves for a model.
    """
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy'
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', label='Training Score', linewidth=2)
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
    plt.plot(train_sizes, val_mean, 'o-', label='Validation Score', linewidth=2)
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1)
    plt.xlabel('Training Set Size')
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Compare different models
models = [
    (DecisionTreeClassifier(max_depth=2, random_state=42), "Shallow Tree (Underfit)"),
    (DecisionTreeClassifier(max_depth=5, random_state=42), "Medium Tree (Good Fit)"),
    (DecisionTreeClassifier(max_depth=20, random_state=42), "Deep Tree (Overfit)")
]

for model, title in models:
    plot_learning_curves(model, X_train_scaled, y_train, title)

## Part 7: Validation Curves

Validation curves show how model performance changes with a single hyperparameter.

In [None]:
# Validation curve for max_depth
param_range = range(1, 21)
train_scores, val_scores = validation_curve(
    DecisionTreeClassifier(random_state=42),
    X_train_scaled, y_train,
    param_name='max_depth',
    param_range=param_range,
    cv=5,
    scoring='accuracy'
)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)

plt.figure(figsize=(10, 6))
plt.plot(param_range, train_mean, 'o-', label='Training Score', linewidth=2)
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.1)
plt.plot(param_range, val_mean, 'o-', label='Validation Score', linewidth=2)
plt.fill_between(param_range, val_mean - val_std, val_mean + val_std, alpha=0.1)
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Validation Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

optimal_depth = param_range[np.argmax(val_mean)]
print(f"Optimal max_depth: {optimal_depth}")
print(f"Best validation score: {val_mean[optimal_depth-1]:.3f}")

## Key Takeaways

1. **Choose appropriate metrics** for your problem (accuracy, precision, recall, F1, ROC-AUC)
2. **Use cross-validation** for robust performance estimates
3. **Watch for overfitting** - training score much higher than validation score
4. **Regularization** helps prevent overfitting by constraining model complexity
5. **Hyperparameter tuning** is essential for optimal performance
6. **Learning curves** diagnose bias-variance issues
7. **Validation curves** help choose optimal hyperparameters

## Exercises

1. **Imbalanced Dataset**: Create a dataset with 95% class 0 and 5% class 1. Show why accuracy is misleading. Use precision, recall, and F1 instead.

2. **Custom Scoring**: Implement custom scoring functions (e.g., weighted F1, custom cost function).

3. **Nested Cross-Validation**: Implement nested CV for unbiased hyperparameter tuning.

4. **Regularization Path**: Plot how coefficients change with regularization strength (regularization path).

5. **Early Stopping**: Implement early stopping for iterative algorithms to prevent overfitting.

6. **Feature Selection**: Compare different feature selection methods (filter, wrapper, embedded).

## Next Steps

In Lab 4, we'll explore:
- Ensemble methods that combine multiple models
- Random Forests and Gradient Boosting
- Advanced techniques for improving model performance

Excellent work! You now know how to properly evaluate and validate machine learning models.