# Module 06: Model Evaluation Metrics

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 90 minutes  
**Prerequisites**: 
- [Module 03: Linear Regression](03_linear_regression.ipynb)
- [Module 04: Logistic Regression](04_logistic_regression.ipynb)
- [Module 05: Decision Trees](05_decision_trees.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand and calculate classification metrics (accuracy, precision, recall, F1)
2. Interpret confusion matrices for multi-class problems
3. Use ROC curves and AUC for model comparison
4. Apply regression metrics (R², MSE, RMSE, MAE, MAPE) appropriately
5. Choose the right metric based on problem context
6. Handle imbalanced datasets with appropriate metrics

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    roc_curve, roc_auc_score, precision_recall_curve, auc,
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error
)

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print('All libraries imported successfully!')

## 2. Classification Metrics Overview

### The Confusion Matrix

Foundation of all classification metrics:

```
                 Predicted
               Negative  Positive
Actual Negative   TN        FP
       Positive   FN        TP
```

- **TN** (True Negative): Correctly predicted negative
- **TP** (True Positive): Correctly predicted positive
- **FN** (False Negative): Actually positive, predicted negative (Type II error)
- **FP** (False Positive): Actually negative, predicted positive (Type I error)

### Key Metrics

1. **Accuracy** = (TP + TN) / (TP + TN + FP + FN)
   - Overall correctness
   - Good when classes are balanced

2. **Precision** = TP / (TP + FP)
   - Of predicted positives, how many are actually positive?
   - Important when false positives are costly

3. **Recall (Sensitivity)** = TP / (TP + FN)
   - Of actual positives, how many did we find?
   - Important when false negatives are costly

4. **F1-Score** = 2 × (Precision × Recall) / (Precision + Recall)
   - Harmonic mean of precision and recall
   - Good for imbalanced datasets

In [None]:
# Load breast cancer dataset
cancer = datasets.load_breast_cancer()
X, y = cancer.data, cancer.target

# Split and train
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(max_iter=10000, random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

print('Breast Cancer Classification Model Trained')
print(f'Classes: {cancer.target_names}')

In [None]:
# Calculate all metrics
print('Classification Metrics:')
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred):.4f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:")
print(cm)
print(f"\nTN={cm[0,0]}, FP={cm[0,1]}")
print(f"FN={cm[1,0]}, TP={cm[1,1]}")

## 3. ROC Curve and AUC

**ROC (Receiver Operating Characteristic)** curve plots:
- X-axis: False Positive Rate (FPR) = FP / (FP + TN)
- Y-axis: True Positive Rate (TPR) = TP / (TP + FN) = Recall

**AUC (Area Under Curve)**:
- Perfect classifier: AUC = 1.0
- Random classifier: AUC = 0.5
- Worse than random: AUC < 0.5

In [None]:
# Get probability predictions
y_proba = model.predict_proba(X_test_scaled)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = roc_auc_score(y_test, y_proba)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'r--', linewidth=2, label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.title('ROC Curve', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f'AUC Score: {roc_auc:.4f}')
print('Interpretation: Higher AUC = Better model discrimination')

## 4. Practice Exercises

### Exercise 1: Imbalanced Dataset

Create an imbalanced dataset (90% class 0, 10% class 1). Calculate accuracy, precision, recall, and F1. Why might accuracy be misleading?

In [None]:
# Your code here


### Exercise 2: Threshold Tuning

Using the breast cancer model, try different probability thresholds (0.3, 0.5, 0.7). How do precision and recall change?

In [None]:
# Your code here


### Exercise 3: Multi-Class Metrics

Train a classifier on the Iris dataset (3 classes). Calculate macro and micro-averaged precision, recall, and F1. What's the difference?

In [None]:
# Your code here


## 5. Summary

### Key Concepts

1. **Classification Metrics**:
   - Accuracy: Overall correctness (good for balanced data)
   - Precision: Minimize false positives
   - Recall: Minimize false negatives
   - F1: Balance precision and recall

2. **Choosing Metrics**:
   - Medical diagnosis: High recall (don't miss diseases)
   - Spam detection: High precision (don't block important emails)
   - Balanced problem: F1-score or accuracy

3. **ROC-AUC**:
   - Threshold-independent metric
   - Compares models at all thresholds
   - Higher AUC = better discrimination

4. **Regression Metrics**:
   - R²: Variance explained (0-1)
   - RMSE: Average error (same units as target)
   - MAE: Robust to outliers

### Next Steps

In Module 07, we'll explore:
- Cross-validation techniques
- Hyperparameter tuning (GridSearchCV, RandomizedSearchCV)
- Learning curves and model selection

### Additional Resources

- [Scikit-learn Metrics Guide](https://scikit-learn.org/stable/modules/model_evaluation.html)
- [Understanding ROC Curves](https://www.youtube.com/watch?v=4jRBRDbJemM)
- [Precision vs Recall](https://towardsdatascience.com/precision-vs-recall-386cf9f89488)