# Module 06: Model Evaluation Metrics

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 70 minutes  
**Prerequisites**: 
- [Module 03: Linear Regression](03_linear_regression.ipynb)
- [Module 04: Logistic Regression](04_logistic_regression.ipynb)
- [Module 05: Decision Trees](05_decision_trees.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand and calculate classification metrics (accuracy, precision, recall, F1-score)
2. Interpret confusion matrices for multiclass problems
3. Use ROC curves and AUC for model comparison
4. Apply regression metrics appropriately (MAE, MSE, RMSE, R²)
5. Choose the right metric for your specific problem
6. Understand the trade-offs between different metrics

## 1. Why Multiple Metrics?

**Accuracy alone is not enough!**

### Example: Cancer Detection
Imagine a dataset with:
- 990 healthy patients (99%)
- 10 cancer patients (1%)

A lazy model that predicts "healthy" for everyone:
- **Accuracy**: 99% (looks great!)
- **Problem**: Misses all 10 cancer cases (terrible!)

### The Solution
Use multiple metrics that capture different aspects of performance:
- **Precision**: When model predicts positive, how often is it correct?
- **Recall**: Of all actual positives, how many does model find?
- **F1-Score**: Balance between precision and recall
- **ROC/AUC**: Overall discriminative ability

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Visualization settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✓ Setup complete!")

## 2. Classification Metrics Deep Dive

### Confusion Matrix Components

```
                 Predicted
              Negative  Positive
Actual Negative   TN       FP
       Positive   FN       TP
```

- **TP (True Positive)**: Correctly predicted positive
- **TN (True Negative)**: Correctly predicted negative
- **FP (False Positive)**: Wrongly predicted positive (Type I error)
- **FN (False Negative)**: Wrongly predicted negative (Type II error)

### Key Metrics

1. **Accuracy** = (TP + TN) / (TP + TN + FP + FN)
   - Overall correctness
   - **Problem**: Misleading with imbalanced classes

2. **Precision** = TP / (TP + FP)
   - Of predicted positives, how many are actually positive?
   - **Use when**: False positives are costly
   - **Example**: Spam detection (don't want to mark important emails as spam)

3. **Recall (Sensitivity)** = TP / (TP + FN)
   - Of actual positives, how many did we find?
   - **Use when**: False negatives are costly
   - **Example**: Cancer detection (don't want to miss cancer cases)

4. **F1-Score** = 2 × (Precision × Recall) / (Precision + Recall)
   - Harmonic mean of precision and recall
   - **Use when**: Need balance between precision and recall

5. **Specificity** = TN / (TN + FP)
   - Of actual negatives, how many did we correctly identify?

In [None]:
# Load breast cancer dataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

cancer_df = pd.read_csv('data/sample/breast_cancer.csv')

# Prepare data
X = cancer_df.drop(['target', 'diagnosis'], axis=1)
y = cancer_df['target']

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LogisticRegression(random_state=42, max_iter=10000)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)

print("✓ Model trained and predictions made!")

In [None]:
# Calculate all metrics
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                            f1_score, confusion_matrix, classification_report)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Classification Metrics:")
print(f"Accuracy:  {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1-Score:  {f1:.3f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"\nConfusion Matrix Breakdown:")
print(f"True Negatives (TN):  {tn}")
print(f"False Positives (FP): {fp}")
print(f"False Negatives (FN): {fn}")
print(f"True Positives (TP):  {tp}")

# Specificity
specificity = tn / (tn + fp)
print(f"\nSpecificity: {specificity:.3f}")

In [None]:
# Visualize confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw counts
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, ax=axes[0],
           xticklabels=['Malignant', 'Benign'],
           yticklabels=['Malignant', 'Benign'])
axes[0].set_xlabel('Predicted', fontsize=12)
axes[0].set_ylabel('Actual', fontsize=12)
axes[0].set_title('Confusion Matrix (Counts)', fontsize=13, fontweight='bold')

# Normalized (percentages)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized, annot=True, fmt='.1%', cmap='Greens', cbar=False, ax=axes[1],
           xticklabels=['Malignant', 'Benign'],
           yticklabels=['Malignant', 'Benign'])
axes[1].set_xlabel('Predicted', fontsize=12)
axes[1].set_ylabel('Actual', fontsize=12)
axes[1].set_title('Confusion Matrix (Percentages)', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Detailed classification report
print("Detailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))

print("\nInterpretation:")
print(f"- Support: Number of actual samples in each class")
print(f"- Macro avg: Unweighted mean (treats all classes equally)")
print(f"- Weighted avg: Weighted by number of samples in each class")

## 3. ROC Curve and AUC

### ROC (Receiver Operating Characteristic) Curve
- Plots True Positive Rate (Recall) vs False Positive Rate
- Shows performance across all classification thresholds
- Closer to top-left corner = better

### AUC (Area Under Curve)
- Single number summarizing ROC curve
- Range: 0 to 1
- **AUC = 1.0**: Perfect classifier
- **AUC = 0.5**: Random guessing (diagonal line)
- **AUC < 0.5**: Worse than random

### Interpretation
- AUC = probability that model ranks a random positive higher than a random negative
- Useful for comparing models
- Threshold-independent metric

In [None]:
# Calculate ROC curve
from sklearn.metrics import roc_curve, roc_auc_score

# Get probabilities for positive class
y_proba_pos = y_proba[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba_pos)
auc_score = roc_auc_score(y_test, y_proba_pos)

print(f"AUC Score: {auc_score:.3f}")
print(f"\nInterpretation: The model has {auc_score:.1%} probability of ranking")
print(f"a random benign sample higher than a random malignant sample.")

In [None]:
# Plot ROC curve
plt.figure(figsize=(10, 7))
plt.plot(fpr, tpr, linewidth=3, label=f'Model (AUC = {auc_score:.3f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Guessing (AUC = 0.5)')

# Mark the point for default threshold (0.5)
default_idx = np.argmin(np.abs(thresholds - 0.5))
plt.plot(fpr[default_idx], tpr[default_idx], 'ro', markersize=10, 
        label=f'Threshold = 0.5')

plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.title('ROC Curve\nCloser to top-left = better', fontsize=14, fontweight='bold')
plt.legend(fontsize=11, loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey Points:")
print("- Top-left corner (0,1): Perfect classifier")
print("- Diagonal line: Random guessing")
print("- Our curve is well above diagonal: Good performance!")

## 4. Precision-Recall Trade-off

**Key Insight**: You can't maximize both precision and recall simultaneously!

### The Trade-off
- **High threshold** → High precision, Low recall (conservative)
- **Low threshold** → Low precision, High recall (aggressive)

### Which to Prioritize?

**Prioritize Precision when** false positives are costly:
- Spam detection (don't mark important emails as spam)
- Video recommendations (don't show inappropriate content)

**Prioritize Recall when** false negatives are costly:
- Cancer detection (don't miss cancer cases)
- Fraud detection (catch as many frauds as possible)
- Security screening (better safe than sorry)

In [None]:
# Demonstrate precision-recall trade-off
from sklearn.metrics import precision_recall_curve

precisions, recalls, pr_thresholds = precision_recall_curve(y_test, y_proba_pos)

# Find threshold for different scenarios
# Scenario 1: Maximize precision (conservative)
high_precision_idx = np.argmax(precisions >= 0.95)
high_prec_threshold = pr_thresholds[high_precision_idx]
high_prec_recall = recalls[high_precision_idx]

# Scenario 2: Maximize recall (aggressive)
high_recall_idx = np.argmax(recalls >= 0.95)
high_rec_threshold = pr_thresholds[high_recall_idx]
high_rec_precision = precisions[high_recall_idx]

print("Scenario 1: Conservative (High Precision)")
print(f"  Threshold: {high_prec_threshold:.3f}")
print(f"  Precision: {precisions[high_precision_idx]:.1%}")
print(f"  Recall: {high_prec_recall:.1%}")
print(f"  Trade-off: Very confident predictions, but miss some cases\n")

print("Scenario 2: Aggressive (High Recall)")
print(f"  Threshold: {high_rec_threshold:.3f}")
print(f"  Precision: {high_rec_precision:.1%}")
print(f"  Recall: {recalls[high_recall_idx]:.1%}")
print(f"  Trade-off: Catch most cases, but more false alarms")

In [None]:
# Plot precision-recall curve
plt.figure(figsize=(10, 7))
plt.plot(recalls, precisions, linewidth=3, label='Precision-Recall Curve')
plt.scatter([high_prec_recall], [precisions[high_precision_idx]], 
           s=200, c='green', marker='*', 
           label='Conservative (High Precision)', zorder=5)
plt.scatter([recalls[high_recall_idx]], [high_rec_precision], 
           s=200, c='red', marker='*', 
           label='Aggressive (High Recall)', zorder=5)

plt.xlabel('Recall', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision-Recall Curve\nShows trade-off between metrics', 
         fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. Multiclass Metrics

For problems with more than 2 classes, metrics are calculated:

### Averaging Strategies

1. **Macro Average**: Unweighted mean
   - Treats all classes equally
   - Good for balanced datasets

2. **Weighted Average**: Weighted by class frequency
   - Accounts for class imbalance
   - Usually reported as default

3. **Micro Average**: Aggregate contributions of all classes
   - Gives more weight to larger classes

In [None]:
# Load Iris dataset for multiclass classification
iris_df = pd.read_csv('data/sample/iris.csv')

feature_cols = ['sepal length (cm)', 'sepal width (cm)', 
                'petal length (cm)', 'petal width (cm)']
X_iris = iris_df[feature_cols]
y_iris = iris_df['species']

# Split and scale
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)

# Train model
model_iris = LogisticRegression(random_state=42, max_iter=10000)
model_iris.fit(X_train_iris_scaled, y_train_iris)

# Predictions
y_pred_iris = model_iris.predict(X_test_iris_scaled)

print("✓ Multiclass model trained!")

In [None]:
# Calculate metrics with different averaging
print("Multiclass Metrics:\n")

for average in ['macro', 'weighted', 'micro']:
    prec = precision_score(y_test_iris, y_pred_iris, average=average)
    rec = recall_score(y_test_iris, y_pred_iris, average=average)
    f1 = f1_score(y_test_iris, y_pred_iris, average=average)
    
    print(f"{average.capitalize()} Average:")
    print(f"  Precision: {prec:.3f}")
    print(f"  Recall:    {rec:.3f}")
    print(f"  F1-Score:  {f1:.3f}\n")

# Detailed report
print("\nPer-Class Report:")
print(classification_report(y_test_iris, y_pred_iris))

In [None]:
# Multiclass confusion matrix
cm_iris = confusion_matrix(y_test_iris, y_pred_iris)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_iris, annot=True, fmt='d', cmap='YlGnBu', cbar=True,
           xticklabels=['Class 0', 'Class 1', 'Class 2'],
           yticklabels=['Class 0', 'Class 1', 'Class 2'])
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('Multiclass Confusion Matrix\n(Iris Dataset)', 
         fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Diagonal values = correct predictions")
print("Off-diagonal values = misclassifications")

## 6. Regression Metrics

### Common Regression Metrics

1. **MAE (Mean Absolute Error)**
   - Average absolute difference
   - Same units as target
   - Less sensitive to outliers

2. **MSE (Mean Squared Error)**
   - Average squared difference
   - Penalizes large errors more
   - Units are squared

3. **RMSE (Root Mean Squared Error)**
   - Square root of MSE
   - Same units as target
   - Popular and interpretable

4. **R² (Coefficient of Determination)**
   - Proportion of variance explained
   - Range: -∞ to 1 (1 is perfect)
   - 0 means model is as good as predicting mean

In [None]:
# Load regression dataset
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

housing_df = pd.read_csv('data/sample/california_housing.csv')

X_reg = housing_df.drop('median_house_value', axis=1)
y_reg = housing_df['median_house_value']

# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42
)

# Train model
model_reg = LinearRegression()
model_reg.fit(X_train_reg, y_train_reg)

# Predictions
y_pred_reg = model_reg.predict(X_test_reg)

print("✓ Regression model trained!")

In [None]:
# Calculate regression metrics
mae = mean_absolute_error(y_test_reg, y_pred_reg)
mse = mean_squared_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_reg, y_pred_reg)

print("Regression Metrics:")
print(f"MAE:  ${mae:,.2f}")
print(f"MSE:  ${mse:,.2f}")
print(f"RMSE: ${rmse:,.2f}")
print(f"R²:   {r2:.3f}")

print(f"\nInterpretation:")
print(f"- On average, predictions are off by ${mae:,.2f} (MAE)")
print(f"- Typical prediction error is ${rmse:,.2f} (RMSE)")
print(f"- Model explains {r2*100:.1f}% of variance (R²)")

In [None]:
# Visualize residuals (errors)
residuals = y_test_reg - y_pred_reg

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Residual plot
axes[0].scatter(y_pred_reg, residuals, alpha=0.3, s=20)
axes[0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0].set_xlabel('Predicted Values', fontsize=12)
axes[0].set_ylabel('Residuals (Actual - Predicted)', fontsize=12)
axes[0].set_title('Residual Plot\n(Should be randomly scattered around 0)', 
                 fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Residual distribution
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residuals', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Residual Distribution\n(Should be normal around 0)', 
                 fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Residual Statistics:")
print(f"Mean: ${residuals.mean():,.2f} (should be close to 0)")
print(f"Std:  ${residuals.std():,.2f}")

## Exercises

### Exercise 1: Choosing the Right Metric

For each scenario, identify which metric to prioritize and why:

1. Email spam detection
2. Credit card fraud detection
3. Medical diagnosis for rare disease
4. Product recommendation system
5. House price prediction

Write your answers as comments.

In [None]:
# Your answers:
# 1. 
# 2. 
# 3. 
# 4. 
# 5. 


### Exercise 2: ROC Curve Comparison

Compare two models using ROC curves:

1. Train both LogisticRegression and DecisionTreeClassifier on breast cancer data
2. Plot both ROC curves on the same graph
3. Calculate AUC for both
4. Which model performs better?

In [None]:
# Your code here


### Exercise 3: Finding Optimal Threshold

For the breast cancer model:
1. Try different thresholds from 0.1 to 0.9
2. Calculate precision, recall, and F1-score for each
3. Plot all three metrics vs threshold
4. Find the threshold that maximizes F1-score

In [None]:
# Your code here


### Exercise 4: Regression Metrics Comparison

On the diabetes dataset:
1. Train LinearRegression and DecisionTreeRegressor (max_depth=5)
2. Calculate MAE, RMSE, and R² for both
3. Create a comparison table
4. Which model is better according to different metrics?

In [None]:
# Your code here


## Summary

### Key Concepts

1. **Classification Metrics**:
   - **Accuracy**: Overall correctness (misleading with imbalanced data)
   - **Precision**: Of predicted positive, how many correct? (minimize FP)
   - **Recall**: Of actual positive, how many found? (minimize FN)
   - **F1-Score**: Balance of precision and recall
   - **ROC-AUC**: Overall discriminative ability (threshold-independent)

2. **Metric Selection**:
   - **High FP cost** → Prioritize precision (spam detection)
   - **High FN cost** → Prioritize recall (cancer detection)
   - **Balance both** → Use F1-score
   - **Compare models** → Use ROC-AUC

3. **Regression Metrics**:
   - **MAE**: Average error (robust to outliers)
   - **RMSE**: Typical error (penalizes large errors)
   - **R²**: Variance explained (0-1 scale)

4. **Best Practices**:
   - Never rely on accuracy alone
   - Consider class imbalance
   - Understand business costs of errors
   - Use multiple metrics for comprehensive evaluation
   - Visualize confusion matrices and ROC curves

### What's Next?

In **Module 07: Cross-Validation and Hyperparameter Tuning**, you'll learn:
- K-fold cross-validation for robust evaluation
- Grid search and random search
- Hyperparameter optimization
- Avoiding overfitting in model selection

### Additional Resources

- [Precision and Recall - StatQuest](https://www.youtube.com/watch?v=Kdsp6soqA7o)
- [ROC and AUC - StatQuest](https://www.youtube.com/watch?v=4jRBRDbJemM)
- [scikit-learn Metrics Guide](https://scikit-learn.org/stable/modules/model_evaluation.html)