# Chapter 18: Model Evaluation, Validation, and Quality Control

## Introduction
A model is only useful if it works on **new (unseen) data**. This chapter teaches you how to validate your work so your results are trustworthy.

You will learn how to:
- Split data correctly (train/test and train/validation/test)
- Choose evaluation metrics for classification and regression
- Spot overfitting and underfitting
- Understand the bias–variance trade-off
- Run simple sensitivity analysis
- Perform quality assurance checks (especially avoiding data leakage)

**Tools used:** `numpy`, `pandas`, `matplotlib`. Some parts use `scikit-learn` (recommended).

## Learning goals
By the end of this chapter, you will be able to:

1. Explain why validation matters.
2. Split data into train/test and train/validation/test.
3. Use common evaluation metrics and explain what they mean.
4. Detect overfitting and underfitting.
5. Describe the bias–variance trade-off.
6. Perform basic sensitivity analysis.
7. Apply quality control checks to reduce mistakes.

In [None]:
# Imports used throughout the chapter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('default')
np.random.seed(42)

# Optional: scikit-learn (recommended)
SKLEARN_AVAILABLE = True
try:
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import Pipeline
    from sklearn.metrics import roc_auc_score, roc_curve, auc, precision_recall_curve
except Exception as e:
    SKLEARN_AVAILABLE = False
    SKLEARN_IMPORT_ERROR = str(e)

print('Ready! scikit-learn available:', SKLEARN_AVAILABLE)
if not SKLEARN_AVAILABLE:
    print('Install scikit-learn to run all sections: pip install scikit-learn')
    print('Import error:', SKLEARN_IMPORT_ERROR)

## 18.1 Importance of validation
Validation answers: **Will my model still perform well on data it has never seen before?**

If you only evaluate on training data, you can get a score that looks great but fails in real life.

> **Common mistake:** Reporting training accuracy as if it were real-world accuracy.

Validation helps you:
- Detect overfitting
- Compare models fairly
- Choose the right metric for the goal
- Build confidence and trust

## 18.2 Train–test concepts (conceptual level)
A common split is:
- **Train set**: the model learns from this
- **Test set**: the model never sees this until the final evaluation

Often we add:
- **Validation set**: used to choose model settings (hyperparameters)

> **Rule:** Use the test set once at the end. If you keep looking at test results and changing the model, the test set is no longer a fair test.

**Important:** Split first. Then fit preprocessing (scaling, imputation) on training data only.

### Example dataset (binary classification)
We will use a synthetic dataset where the target is 0/1.

We purposely make the classes a bit imbalanced to show why accuracy can be misleading.

In [None]:
# Create a dataset (self-contained)
if SKLEARN_AVAILABLE:
    X, y = make_classification(
        n_samples=1200,
        n_features=8,
        n_informative=4,
        n_redundant=2,
        weights=[0.70, 0.30],
        class_sep=1.2,
        random_state=42,
    )
else:
    rng = np.random.default_rng(42)
    X = rng.normal(size=(1200, 8))
    w = np.array([1.4, -1.0, 0.9, 0.0, 0.0, 0.6, 0.0, -0.5])
    logits = X @ w + rng.normal(scale=0.8, size=1200)
    probs = 1 / (1 + np.exp(-logits))
    # Make about 30% positives
    y = (probs > np.quantile(probs, 0.70)).astype(int)

df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['target'] = y
df.head()

In [None]:
# Quick class balance check
counts = df['target'].value_counts().sort_index()
props = df['target'].value_counts(normalize=True).sort_index()
summary = pd.DataFrame({'count': counts, 'proportion': props})
summary.index = ['class 0', 'class 1']
summary

### Train–test split (practice)
We split and keep the test set untouched until evaluation.

> **Tip:** Use `stratify=y` for classification to keep similar class ratios in train and test.

In [None]:
# Train–test split (code)
X_all = df.drop(columns=['target']).values
y_all = df['target'].values

if SKLEARN_AVAILABLE:
    X_train, X_test, y_train, y_test = train_test_split(
        X_all, y_all, test_size=0.25, random_state=42, stratify=y_all
    )
else:
    rng = np.random.default_rng(42)
    idx = np.arange(len(y_all))
    rng.shuffle(idx)
    split = int(0.75 * len(idx))
    train_idx, test_idx = idx[:split], idx[split:]
    X_train, X_test = X_all[train_idx], X_all[test_idx]
    y_train, y_test = y_all[train_idx], y_all[test_idx]

print('Train size:', len(y_train), '| Test size:', len(y_test))
print('Train positive rate:', round(float(y_train.mean()), 3))
print('Test positive rate:', round(float(y_test.mean()), 3))

## 18.3 Evaluation metrics
Metrics turn predictions into numbers you can compare.

### Classification (binary)
- **Accuracy**: overall fraction correct
- **Precision**: of predicted positives, how many were truly positive?
- **Recall**: of true positives, how many did we catch?
- **F1-score**: balances precision and recall
- **ROC AUC**: measures ranking performance across thresholds

### Regression (numeric target)
- **MAE**: average absolute error
- **RMSE**: penalizes large errors more
- **$R^2$**: fraction of variance explained

> **Tip:** With imbalanced data, accuracy can hide failures. Always look at precision/recall and a confusion matrix.

In [None]:
# Helper functions for evaluation (binary classification + regression)

def confusion_matrix_2x2(y_true, y_pred):
    y_true = np.asarray(y_true).astype(int)
    y_pred = np.asarray(y_pred).astype(int)
    tn = int(((y_true == 0) & (y_pred == 0)).sum())
    fp = int(((y_true == 0) & (y_pred == 1)).sum())
    fn = int(((y_true == 1) & (y_pred == 0)).sum())
    tp = int(((y_true == 1) & (y_pred == 1)).sum())
    return np.array([[tn, fp], [fn, tp]], dtype=int)

def metrics_from_cm(cm):
    tn, fp = cm[0]
    fn, tp = cm[1]
    accuracy = (tp + tn) / max(tp + tn + fp + fn, 1)
    precision = tp / max(tp + fp, 1)
    recall = tp / max(tp + fn, 1)
    f1 = 2 * precision * recall / max(precision + recall, 1e-12)
    return {
        'accuracy': float(accuracy),
        'precision': float(precision),
        'recall': float(recall),
        'f1': float(f1),
    }

def plot_cm(cm, title='Confusion matrix'):
    fig, ax = plt.subplots(figsize=(4.8, 4))
    ax.imshow(cm, cmap='Blues')
    ax.set_title(title)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_xticks([0, 1])
    ax.set_yticks([0, 1])
    ax.set_xticklabels(['0', '1'])
    ax.set_yticklabels(['0', '1'])
    for (i, j), v in np.ndenumerate(cm):
        ax.text(j, i, str(v), ha='center', va='center', color='black')
    plt.tight_layout()
    plt.show()

def mae(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    return float(np.mean(np.abs(y_true - y_pred)))

def rmse(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    return float(np.sqrt(np.mean((y_true - y_pred) ** 2)))

def r2_manual(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    ss_res = float(np.sum((y_true - y_pred) ** 2))
    ss_tot = float(np.sum((y_true - np.mean(y_true)) ** 2))
    return float(1 - ss_res / max(ss_tot, 1e-12))

print('Helpers ready')

### 18.3.1 Regression metrics in practice
Let's see how MAE, RMSE, and R² work on a concrete regression example.

| Metric | What it measures | When to use |
|--------|------------------|-------------|
| **MAE** | Average absolute error | When all errors matter equally |
| **RMSE** | Root of average squared error | When large errors are especially bad |
| **R²** | Fraction of variance explained | To compare models (0 = baseline, 1 = perfect) |

> **Tip:** RMSE will always be ≥ MAE. If RMSE >> MAE, you have some large outlier errors.

In [None]:
# Regression metrics demonstration
# Create a simple regression scenario: predict y from x
rng_demo = np.random.default_rng(99)
n_demo = 100
x_demo = rng_demo.uniform(0, 10, size=n_demo)
y_true_demo = 2 * x_demo + 5 + rng_demo.normal(scale=2, size=n_demo)

# Simulate predictions from three models:
# Model A: Good predictions
# Model B: Moderate predictions (more noise)
# Model C: Predictions with some large outliers

y_pred_A = 2 * x_demo + 5 + rng_demo.normal(scale=1.5, size=n_demo)
y_pred_B = 2 * x_demo + 5 + rng_demo.normal(scale=4, size=n_demo)
y_pred_C = 2 * x_demo + 5 + rng_demo.normal(scale=1.5, size=n_demo)
# Add a few large outliers to Model C
outlier_idx = rng_demo.choice(n_demo, size=5, replace=False)
y_pred_C[outlier_idx] += rng_demo.choice([-15, 15], size=5)

# Calculate metrics for each model
models = {
    'Model A (good)': y_pred_A,
    'Model B (noisy)': y_pred_B,
    'Model C (outliers)': y_pred_C,
}

results = []
for name, y_pred in models.items():
    results.append({
        'Model': name,
        'MAE': round(mae(y_true_demo, y_pred), 3),
        'RMSE': round(rmse(y_true_demo, y_pred), 3),
        'R²': round(r2_manual(y_true_demo, y_pred), 3),
    })

reg_metrics_df = pd.DataFrame(results)
print('Regression Metrics Comparison:')
display(reg_metrics_df)

# Visualize predictions vs actual
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
for ax, (name, y_pred) in zip(axes, models.items()):
    ax.scatter(y_true_demo, y_pred, alpha=0.6, s=30)
    ax.plot([y_true_demo.min(), y_true_demo.max()], 
            [y_true_demo.min(), y_true_demo.max()], 
            'r--', label='Perfect prediction')
    ax.set_xlabel('Actual')
    ax.set_ylabel('Predicted')
    ax.set_title(name)
    ax.legend()
plt.tight_layout()
plt.show()

print('\nNotice: Model C has similar MAE to Model A, but much higher RMSE due to outlier errors.')
print('R² tells you what fraction of variance is explained (higher is better, max=1).')

### Exercise: Interpreting regression metrics
Look at the regression metrics table above and answer these questions:

1. Which model has the best overall performance?
2. Why does Model C have a much higher RMSE than MAE compared to the other models?
3. If you could only use one metric to compare models, which would you choose and why?

> **Hint:** Think about what each metric penalizes and how outliers affect squared vs absolute differences.

### Start with a baseline (always!)
Before training a model, create a simple baseline. If your model cannot beat the baseline, something is wrong.

For imbalanced classification, a common baseline is predicting the **most common class** every time.
> **Tip:** Baselines prevent you from celebrating meaningless results.

In [None]:
# Baseline: always predict the majority class from the training set
majority_class = 0 if y_train.mean() < 0.5 else 1
y_pred_base = np.full_like(y_test, fill_value=majority_class)

cm_base = confusion_matrix_2x2(y_test, y_pred_base)
m_base = metrics_from_cm(cm_base)

print('Majority class predicted:', majority_class)
print('Baseline metrics:', m_base)
plot_cm(cm_base, title='Baseline confusion matrix (majority class)')

## 18.3A Build a simple model (Logistic Regression)
Logistic Regression is a strong first model for binary classification because it:
- Is fast and often performs well
- Outputs probabilities (useful for thresholds)
- Is reasonably interpretable

We will use a **pipeline** that scales features then trains the model. This is a best practice because it helps prevent leakage (the scaler is fit only on training data).
> **Common mistake:** Scaling with the full dataset *before* splitting.

In [None]:
# Train and evaluate Logistic Regression (requires scikit-learn)
if not SKLEARN_AVAILABLE:
    print('Skipping Logistic Regression: scikit-learn is not available.')
else:
    model = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, random_state=42)),
    ])
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    cm = confusion_matrix_2x2(y_test, y_pred)
    m = metrics_from_cm(cm)
    auc_value = float(roc_auc_score(y_test, y_proba))

    print('Metrics at threshold=0.5:', m)
    print('ROC AUC:', round(auc_value, 4))
    plot_cm(cm, title='Logistic Regression confusion matrix')

### Visual examples: ROC and Precision–Recall curves
If a model outputs probabilities, you can choose different decision thresholds (not always 0.5).
Curves show the trade-offs across thresholds:
- **ROC curve**: True Positive Rate vs False Positive Rate
- **Precision–Recall curve**: Precision vs Recall (often more informative with imbalanced data)
> **Tip:** If the positive class is rare, focus on Precision–Recall.

In [None]:
if not SKLEARN_AVAILABLE:
    print('Skipping ROC/PR curves: scikit-learn is not available.')
else:
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)

    prec, rec, _ = precision_recall_curve(y_test, y_proba)

    fig, axes = plt.subplots(1, 2, figsize=(12, 4.2))

    axes[0].plot(fpr, tpr, label=f'AUC={roc_auc:.3f}')
    axes[0].plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random')
    axes[0].set_title('ROC curve')
    axes[0].set_xlabel('False Positive Rate')
    axes[0].set_ylabel('True Positive Rate')
    axes[0].legend()

    axes[1].plot(rec, prec)
    axes[1].set_title('Precision–Recall curve')
    axes[1].set_xlabel('Recall')
    axes[1].set_ylabel('Precision')

    plt.tight_layout()
    plt.show()

## Exercise 1: Compute metrics from a confusion matrix
Pick a confusion matrix and compute accuracy, precision, recall, and F1 yourself.

Why? If you understand TN/FP/FN/TP, you can explain model performance clearly.

In [None]:
# Use either the baseline or logistic regression confusion matrix (if it exists)
cm_ex = cm_base if 'cm_base' in globals() else np.array([[50, 10], [8, 32]])
print('Confusion matrix used:\n', cm_ex)

# TODO: compute the metrics manually
tn, fp = cm_ex[0]
fn, tp = cm_ex[1]
accuracy = (tp + tn) / max(tp + tn + fp + fn, 1)
precision = tp / max(tp + fp, 1)
recall = tp / max(tp + fn, 1)
f1 = 2 * precision * recall / max(precision + recall, 1e-12)

print({'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1': f1})

# Check yourself using the helper
print('Helper check:', metrics_from_cm(cm_ex))

## 18.4 Overfitting and underfitting
Two common failure modes:
- **Underfitting**: model is too simple → high error on both train and test
- **Overfitting**: model is too complex → very low train error but worse test error

> **Key idea:** Compare training performance vs testing (or validation) performance. Big gaps are a warning sign.

In [None]:
# Regression example: y = sin(x) + noise
rng = np.random.default_rng(42)
n = 180
x = rng.uniform(-3, 3, size=n)
y = np.sin(x) + rng.normal(scale=0.25, size=n)

# Train/test split (manual, to keep it simple)
idx = np.arange(n)
rng.shuffle(idx)
split = int(0.75 * n)
train_idx, test_idx = idx[:split], idx[split:]
x_train_r, x_test_r = x[train_idx], x[test_idx]
y_train_r, y_test_r = y[train_idx], y[test_idx]

plt.figure(figsize=(7.5, 4))
plt.scatter(x_train_r, y_train_r, s=18, alpha=0.8, label='train')
plt.scatter(x_test_r, y_test_r, s=18, alpha=0.8, label='test')
plt.title('Noisy sin(x) regression dataset')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

In [None]:
# Fit polynomial regression models with different degrees (numpy only)
def fit_poly(x_train, y_train, degree):
    X = np.vander(x_train, N=degree + 1, increasing=True)
    w, *_ = np.linalg.lstsq(X, y_train, rcond=None)
    return w

def predict_poly(x_values, w):
    degree = len(w) - 1
    X = np.vander(x_values, N=degree + 1, increasing=True)
    return X @ w

degrees = list(range(1, 16))
train_scores = []
test_scores = []

for d in degrees:
    w = fit_poly(x_train_r, y_train_r, degree=d)
    yhat_train = predict_poly(x_train_r, w)
    yhat_test = predict_poly(x_test_r, w)
    train_scores.append(rmse(y_train_r, yhat_train))
    test_scores.append(rmse(y_test_r, yhat_test))

best_degree = degrees[int(np.argmin(test_scores))]
print('Best degree on test (demo):', best_degree)

plt.figure(figsize=(8, 4))
plt.plot(degrees, train_scores, marker='o', label='Train RMSE')
plt.plot(degrees, test_scores, marker='o', label='Test RMSE')
plt.title('Overfitting vs underfitting (polynomial degree)')
plt.xlabel('Polynomial degree (complexity)')
plt.ylabel('RMSE (lower is better)')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

### Interpreting the overfitting plot
In the polynomial plot above:
- **Low degree** (left side) often underfits: the curve is too simple, so both train and test errors are relatively high.
- **High degree** (right side) often overfits: training error keeps dropping, but test error starts to rise.

**What you want:** a complexity level where test (or validation) error is low *and stable*.

> **Warning:** In the demo we printed the “best degree on test.” In real projects, you should choose the degree using a **validation** set (or cross‑validation), then evaluate once on the **test** set.

## 18.5 Bias and variance trade-off
The bias–variance trade-off helps explain why a model can be “too simple” or “too complex.”
- **Bias**: error from simplifying assumptions (model can’t capture the true pattern). High bias → underfitting.
- **Variance**: error from sensitivity to the training data (model changes a lot if the data changes a little). High variance → overfitting.

A useful mental picture:
- High bias models are consistently wrong in a similar way.
- High variance models are sometimes great, sometimes terrible, depending on the sample.

> **Tip:** If your score changes a lot when you change the random split, variance is likely a problem.

In [None]:
# Visual demo: variance = instability across different train/test splits
# We'll see how the "best" polynomial degree changes if we resplit the data.
def find_best_degree_for_split(seed, degrees=range(1, 16)):
    rng = np.random.default_rng(seed)
    idx = np.arange(len(x))
    rng.shuffle(idx)
    split = int(0.75 * len(idx))
    train_idx, test_idx = idx[:split], idx[split:]
    x_tr, x_te = x[train_idx], x[test_idx]
    y_tr, y_te = y[train_idx], y[test_idx]

    test_rmse = []
    for d in degrees:
        w = fit_poly(x_tr, y_tr, degree=d)
        yhat_te = predict_poly(x_te, w)
        test_rmse.append(rmse(y_te, yhat_te))
    best_d = int(np.argmin(test_rmse)) + 1
    return best_d, float(min(test_rmse))

rows = []
for seed in range(10):
    best_d, best_rmse = find_best_degree_for_split(seed)
    rows.append({'seed': seed, 'best_degree': best_d, 'best_test_rmse': best_rmse})

stability_df = pd.DataFrame(rows)
stability_df

## Exercise 2: Train/validation/test (choose complexity without peeking)
Update the polynomial regression demo to use **three splits**:
1. **Train** (60%): fit the model
2. **Validation** (20%): choose the best polynomial degree
3. **Test** (20%): final evaluation, used **once** at the very end

### Your task:
1. Split the `x` and `y` arrays into train/validation/test
2. Loop through polynomial degrees 1–15
3. For each degree: fit on train, evaluate RMSE on validation
4. Pick the degree with the lowest validation RMSE
5. Report the test RMSE for that chosen degree (use test set only once!)

**Goal:** pick the polynomial degree using validation RMSE, then report test RMSE for that chosen degree.

> **Why this matters:** Using the test set to choose settings makes performance look better than it truly is. This is a form of "data leakage."

> **Common mistake:** Looking at test set results multiple times and picking the model that does best. This inflates your score!

### Starter template (fill in the TODOs)
```python
# Step 1: Create three-way split
rng = np.random.default_rng(123)
idx = np.arange(len(x))
rng.shuffle(idx)

n_total = len(idx)
n_train = int(0.60 * n_total)
n_val = int(0.20 * n_total)
# TODO: Create train_idx, val_idx, test_idx

# Step 2: Extract x and y for each split
# TODO: x_tr, y_tr = ...
# TODO: x_val, y_val = ...
# TODO: x_te, y_te = ...

# Step 3: Find best degree using validation set
degrees = list(range(1, 16))
# TODO: Loop, fit on train, evaluate on validation

# Step 4: Final evaluation on test (once!)
# TODO: Use best degree, fit on train, evaluate on test
```

In [None]:
# One possible solution (you can modify it)
rng = np.random.default_rng(123)
idx = np.arange(len(x))
rng.shuffle(idx)

n_total = len(idx)
n_train = int(0.60 * n_total)
n_val = int(0.20 * n_total)
train_idx = idx[:n_train]
val_idx = idx[n_train:n_train + n_val]
test_idx = idx[n_train + n_val:]

x_tr, y_tr = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]
x_te, y_te = x[test_idx], y[test_idx]

degrees = list(range(1, 16))
val_rmse = []
for d in degrees:
    w = fit_poly(x_tr, y_tr, degree=d)
    yhat_val = predict_poly(x_val, w)
    val_rmse.append(rmse(y_val, yhat_val))

best_d = degrees[int(np.argmin(val_rmse))]
w_final = fit_poly(x_tr, y_tr, degree=best_d)
test_rmse = rmse(y_te, predict_poly(x_te, w_final))

print('Best degree (by validation):', best_d)
print('Test RMSE (final evaluation):', round(test_rmse, 4))

plt.figure(figsize=(8, 4))
plt.plot(degrees, val_rmse, marker='o')
plt.title('Validation RMSE by polynomial degree')
plt.xlabel('Degree')
plt.ylabel('RMSE')
plt.grid(True, alpha=0.3)
plt.show()

## 18.5.1 Cross-validation (conceptual overview)
When you only have limited data, holding out 20–25% for validation can waste valuable training samples. **Cross-validation** solves this by rotating which portion of the data is used for validation.

### How K-Fold Cross-Validation Works
1. Split the data into **K equal parts** (called "folds")
2. Train K models, each time using K-1 folds for training and 1 fold for validation
3. Average the K validation scores

> **Common choice:** K=5 or K=10

### Visual diagram (5-fold example)
```
Fold 1: [VAL] [Train] [Train] [Train] [Train]
Fold 2: [Train] [VAL] [Train] [Train] [Train]
Fold 3: [Train] [Train] [VAL] [Train] [Train]
Fold 4: [Train] [Train] [Train] [VAL] [Train]
Fold 5: [Train] [Train] [Train] [Train] [VAL]
```

Each data point gets used for validation exactly once, and for training K-1 times.

### Why use cross-validation?
- More reliable performance estimate (uses all data)
- Reduces variance in evaluation (less dependent on one lucky/unlucky split)
- Especially useful for small datasets

> **Tip:** Cross-validation is for choosing hyperparameters and estimating performance. You still need a final held-out test set for the last evaluation.

> **Warning:** Cross-validation is slower (trains K models instead of 1).

In [None]:
# Cross-validation example (requires scikit-learn)
if not SKLEARN_AVAILABLE:
    print('Skipping cross-validation demo: scikit-learn is not available.')
else:
    from sklearn.model_selection import cross_val_score
    
    # Use the same classification dataset
    model_cv = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, random_state=42)),
    ])
    
    # 5-fold cross-validation
    cv_scores = cross_val_score(model_cv, X_all, y_all, cv=5, scoring='accuracy')
    
    print('5-Fold Cross-Validation Results:')
    print('Individual fold scores:', [round(s, 4) for s in cv_scores])
    print('Mean accuracy:', round(cv_scores.mean(), 4))
    print('Standard deviation:', round(cv_scores.std(), 4))
    
    # Visualize fold scores
    plt.figure(figsize=(7, 4))
    plt.bar(range(1, 6), cv_scores, color='steelblue', edgecolor='black')
    plt.axhline(cv_scores.mean(), color='red', linestyle='--', label=f'Mean = {cv_scores.mean():.4f}')
    plt.xlabel('Fold')
    plt.ylabel('Accuracy')
    plt.title('5-Fold Cross-Validation Scores')
    plt.ylim(0.7, 1.0)
    plt.xticks(range(1, 6))
    plt.legend()
    plt.tight_layout()
    plt.show()
    
    print('\n> Low standard deviation = stable model performance across folds')

## 18.6 Sensitivity analysis
Sensitivity analysis asks: **How sensitive are results to small changes?**

We’ll do two beginner-friendly types:
1. **Threshold sensitivity (classification):** change the probability cutoff and watch precision/recall change.
2. **Input perturbation:** slightly change inputs (noise / shuffle a feature) and see how performance changes.

> **Tip:** Sensitivity analysis is a bridge between “model metrics” and “real-world robustness.”

In [None]:
# 18.6A Threshold sensitivity (requires a probability model)
if not SKLEARN_AVAILABLE or 'y_proba' not in globals():
    print('Run the Logistic Regression section first (and ensure scikit-learn is installed) to generate probabilities.')
else:
    thresholds = np.linspace(0.05, 0.95, 19)
    rows = []
    for t in thresholds:
        y_pred_t = (y_proba >= t).astype(int)
        cm_t = confusion_matrix_2x2(y_test, y_pred_t)
        m_t = metrics_from_cm(cm_t)
        rows.append({'threshold': float(t), **m_t})
    sens_df = pd.DataFrame(rows)
    display(sens_df.head())
    
    plt.figure(figsize=(8, 4))
    plt.plot(sens_df['threshold'], sens_df['precision'], marker='o', label='Precision')
    plt.plot(sens_df['threshold'], sens_df['recall'], marker='o', label='Recall')
    plt.plot(sens_df['threshold'], sens_df['f1'], marker='o', label='F1')
    plt.title('Threshold sensitivity (Logistic Regression)')
    plt.xlabel('Decision threshold')
    plt.ylabel('Metric value')
    plt.grid(True, alpha=0.3)
    plt.legend()
    plt.show()

In [None]:
# 18.6B Input perturbation: add noise and shuffle a feature (robustness check)
if not SKLEARN_AVAILABLE or 'model' not in globals():
    print('Run the Logistic Regression section first (and ensure scikit-learn is installed).')
else:
    rng = np.random.default_rng(7)
    
    # 1) Add small noise to test features
    X_test_noisy = X_test + rng.normal(scale=0.15, size=X_test.shape)
    y_pred_noisy = model.predict(X_test_noisy)
    cm_noisy = confusion_matrix_2x2(y_test, y_pred_noisy)
    m_noisy = metrics_from_cm(cm_noisy)
    
    # 2) Shuffle one feature column (break its relationship to target)
    X_test_shuffled = X_test.copy()
    col_to_shuffle = 0
    X_test_shuffled[:, col_to_shuffle] = rng.permutation(X_test_shuffled[:, col_to_shuffle])
    y_pred_shuffled = model.predict(X_test_shuffled)
    cm_shuf = confusion_matrix_2x2(y_test, y_pred_shuffled)
    m_shuf = metrics_from_cm(cm_shuf)
    
    print('Original metrics:', m)
    print('Noisy-input metrics:', m_noisy)
    print(f'Shuffled feature_{col_to_shuffle} metrics:', m_shuf)

## 18.7 Quality assurance checks
Quality control is not “extra.” It is how you prevent avoidable mistakes and build trust.

Here are practical checks you can apply right away:

### A) Reproducibility
- Set random seeds (splits, models)
- Record library versions
- Keep your preprocessing steps consistent (prefer pipelines)

### B) Leakage prevention (critical!)
- Split before any learned preprocessing (scaling, imputation)
- Use pipelines so transforms are fit on training data only
- Watch out for “target leakage” features that directly encode the answer

### C) Sanity checks
- Compare against a baseline
- Look at a confusion matrix (not just one metric)
- Check for impossible values, missing data, and unexpected duplicates

### D) Stability checks
- Re-run with different random splits
- Check sensitivity to small perturbations

> **Common mistake:** Accidentally using information from the future (or the test set) during training.

In [None]:
# 18.7A Leakage demo: scaling BEFORE split (wrong) vs pipeline (right)
if not SKLEARN_AVAILABLE:
    print('Skipping leakage demo: scikit-learn is not available.')
else:
    from sklearn.linear_model import LogisticRegression
    
    # WRONG: scale using ALL data, then split
    scaler_wrong = StandardScaler()
    X_scaled_wrong = scaler_wrong.fit_transform(X_all)
    X_tr_w, X_te_w, y_tr_w, y_te_w = train_test_split(
        X_scaled_wrong, y_all, test_size=0.25, random_state=42, stratify=y_all
    )
    clf_wrong = LogisticRegression(max_iter=1000, random_state=42)
    clf_wrong.fit(X_tr_w, y_tr_w)
    y_pred_w = clf_wrong.predict(X_te_w)
    cm_w = confusion_matrix_2x2(y_te_w, y_pred_w)
    m_w = metrics_from_cm(cm_w)
    
    # RIGHT: split first, then scale inside a pipeline
    X_tr_r, X_te_r, y_tr_r, y_te_r = train_test_split(
        X_all, y_all, test_size=0.25, random_state=42, stratify=y_all
    )
    clf_right = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, random_state=42)),
    ])
    clf_right.fit(X_tr_r, y_tr_r)
    y_pred_r = clf_right.predict(X_te_r)
    cm_r = confusion_matrix_2x2(y_te_r, y_pred_r)
    m_r = metrics_from_cm(cm_r)
    
    print('WRONG (scale before split):', m_w)
    print('RIGHT (pipeline):', m_r)
    print('\nNote: leakage can inflate scores more on real datasets than on synthetic demos.')

In [None]:
# 18.7B Stability check: metrics across different random splits
if not SKLEARN_AVAILABLE:
    print('Skipping stability check: scikit-learn is not available.')
else:
    scores = []
    for seed in range(6):
        X_tr, X_te, y_tr, y_te = train_test_split(
            X_all, y_all, test_size=0.25, random_state=seed, stratify=y_all
        )
        m_split = Pipeline([
            ('scaler', StandardScaler()),
            ('clf', LogisticRegression(max_iter=1000, random_state=seed)),
        ])
        m_split.fit(X_tr, y_tr)
        y_pred_split = m_split.predict(X_te)
        y_proba_split = m_split.predict_proba(X_te)[:, 1]
        cm_split = confusion_matrix_2x2(y_te, y_pred_split)
        met = metrics_from_cm(cm_split)
        scores.append({
            'seed': seed,
            **met,
            'roc_auc': float(roc_auc_score(y_te, y_proba_split)),
        })
    scores_df = pd.DataFrame(scores)
    scores_df

## Mini-project: Create a model evaluation report
Goal: produce a short, trustworthy “evaluation report” someone else can read and believe.

### What to include
1. Dataset description (what is predicted? what counts as class 1?)
2. Train/test split method (and seed)
3. Baseline performance
4. Model performance (confusion matrix + metrics + ROC AUC if available)
5. One sensitivity analysis (threshold plot or input perturbation)
6. One quality check (leakage prevention or stability across splits)

> **Deliverable:** A short markdown summary (5–10 lines) stating which metric you prioritize and why.

In [None]:
# Mini-project starter template (edit freely)
if not SKLEARN_AVAILABLE:
    print('Install scikit-learn to run this mini-project: pip install scikit-learn')
else:
    # 1) Split
    X_tr, X_te, y_tr, y_te = train_test_split(
        X_all, y_all, test_size=0.25, random_state=101, stratify=y_all
    )

    # 2) Baseline
    maj = 0 if y_tr.mean() < 0.5 else 1
    y_pred_base = np.full_like(y_te, maj)
    cm_base2 = confusion_matrix_2x2(y_te, y_pred_base)
    base_metrics = metrics_from_cm(cm_base2)

    # 3) Model (pipeline)
    model2 = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, random_state=101)),
    ])
    model2.fit(X_tr, y_tr)
    y_pred = model2.predict(X_te)
    y_proba = model2.predict_proba(X_te)[:, 1]

    # 4) Report metrics
    cm2 = confusion_matrix_2x2(y_te, y_pred)
    metrics2 = metrics_from_cm(cm2)
    auc2 = float(roc_auc_score(y_te, y_proba))

    print('Baseline metrics:', base_metrics)
    print('Model metrics:', metrics2)
    print('ROC AUC:', round(auc2, 4))
    plot_cm(cm2, title='Mini-project confusion matrix')

    # 5) Threshold sensitivity (choose based on your goal)
    thresholds = np.linspace(0.05, 0.95, 19)
    best = None
    for t in thresholds:
        y_pred_t = (y_proba >= t).astype(int)
        m_t = metrics_from_cm(confusion_matrix_2x2(y_te, y_pred_t))
        # Example objective: maximize F1 (change this if your goal is different)
        if best is None or m_t['f1'] > best['f1']:
            best = {'threshold': float(t), **m_t}
    print('Best threshold by F1 (demo):', best)
    print('Better practice: pick threshold on a validation set, then evaluate once on test.')

### Report template (copy and fill in)
Use this template to write your evaluation report. Fill in the blanks based on your analysis above.

---

## Model Evaluation Report

**Dataset:** [Describe what the data represents and what class 1 means]

**Split method:** [e.g., 75% train / 25% test, stratified, random_state=101]

**Baseline performance:**
- Accuracy: ___
- Precision: ___
- Recall: ___

**Model performance (Logistic Regression):**
- Accuracy: ___
- Precision: ___
- Recall: ___
- F1-score: ___
- ROC AUC: ___

**Key finding:** [Does the model beat the baseline? By how much?]

**Metric prioritization:** [Which metric matters most for this problem and why?]
- Example: "I prioritize Recall because missing a positive case (class 1) is costly."

**Sensitivity analysis:** [What did you learn from changing the threshold?]

**Quality check performed:** [What did you verify? e.g., leakage prevention, stability]

**Recommendation:** [Is this model ready for use? What should be done next?]

---

## Common Mistakes to Avoid (Beginner Checklist)
Before finalizing your analysis, check that you haven't made these common errors:

### ❌ Mistake 1: Using accuracy on imbalanced data
If 95% of your data is class 0, predicting "always 0" gives 95% accuracy but is useless.
**Fix:** Always check precision, recall, F1, and look at the confusion matrix.

### ❌ Mistake 2: Scaling/preprocessing before splitting
If you scale using the entire dataset, test set statistics "leak" into training.
**Fix:** Use pipelines, or fit preprocessing on training data only.

### ❌ Mistake 3: Peeking at the test set multiple times
If you keep checking test performance and adjusting your model, the test set is no longer a fair evaluation.
**Fix:** Use a separate validation set for tuning; touch the test set **once** at the very end.

### ❌ Mistake 4: Not setting random seeds
Without seeds, your results change every run, making debugging hard.
**Fix:** Set `random_state` or `np.random.seed()` for reproducibility.

### ❌ Mistake 5: Forgetting to compare against a baseline
A model that beats "random guessing" but loses to "always predict majority class" is not useful.
**Fix:** Always compute baseline metrics first.

### ❌ Mistake 6: Ignoring feature importance and interpretability
Knowing *what* the model predicts is not enough—stakeholders want to know *why*.
**Fix:** Examine coefficients (for linear models) or use feature importance plots.

### ❌ Mistake 7: Not checking for data leakage features
Some features directly encode the target (e.g., "purchase_amount" when predicting "did_purchase").
**Fix:** Review feature definitions carefully before modeling.

> **Tip:** When in doubt, ask: "Would I have access to this feature at prediction time in the real world?"

## Summary / Key Takeaways

### Core concepts covered in this chapter:

| Concept | What you learned |
|---------|------------------|
| **Validation** | Estimates real-world performance on unseen data |
| **Train/test split** | Training data for learning, test data for final evaluation |
| **Train/validation/test** | Validation set for tuning hyperparameters without touching test |
| **Cross-validation** | Rotates validation folds for more reliable estimates |
| **Classification metrics** | Accuracy, precision, recall, F1, ROC AUC |
| **Regression metrics** | MAE, RMSE, R² |
| **Overfitting** | Great training score, worse test score |
| **Underfitting** | Bad performance on both train and test |
| **Bias-variance trade-off** | Balance between model simplicity and flexibility |
| **Sensitivity analysis** | Test robustness to thresholds and input changes |
| **Quality assurance** | Leakage prevention, reproducibility, sanity checks |

### Remember:
- ✅ Always compare against a baseline
- ✅ Use metrics that match your business goal
- ✅ Split data before any preprocessing
- ✅ Use the test set only once at the end
- ✅ Check sensitivity to thresholds and perturbations
- ✅ Document your process for reproducibility

> **Final thought:** A model is only as trustworthy as its evaluation. Take the time to validate properly!

## Optional references
- scikit-learn model evaluation: https://scikit-learn.org/stable/modules/model_evaluation.html
- Cross-validation (concepts and tools): https://scikit-learn.org/stable/modules/cross_validation.html
- Precision/Recall trade-offs: https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics
- Data leakage (general explanation): https://scikit-learn.org/stable/common_pitfalls.html#data-leakage