
# üßÆ Classification Metrics ‚Äî Understanding Model Evaluation

This notebook walks through key **classification metrics** used to evaluate machine learning models.  
Each section explains the concept, formula, intuition, and potential pitfalls ‚Äî with examples and diagrams you can visualize later.

---

## 1Ô∏è‚É£ Classification Metrics Overview

When evaluating a classification model, accuracy alone is rarely enough.  
We use **multiple metrics** to understand different aspects of model performance ‚Äî especially when dealing with **imbalanced data** or **unequal costs of errors**.

---

## 2Ô∏è‚É£ Accuracy

**Definition:**  
Accuracy measures how often the model correctly predicts the class.

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

- **TP (True Positive)**: Model predicted positive, and it was positive.  
- **TN (True Negative)**: Model predicted negative, and it was negative.  
- **FP (False Positive)**: Model predicted positive, but it was negative.  
- **FN (False Negative)**: Model predicted negative, but it was positive.

‚úÖ **Good for:** balanced datasets  
‚ö†Ô∏è **Not reliable for:** imbalanced datasets (e.g., fraud detection, disease diagnosis)

---

## 3Ô∏è‚É£ Issues with Accuracy

Accuracy fails when the dataset is **highly imbalanced**.  
Example: In a dataset with 99% negatives and 1% positives, a model that always predicts "negative" achieves **99% accuracy** but **0% usefulness**.

**Better alternatives:** Precision, Recall, F1/F2 Scores, ROC-AUC.

---

## 4Ô∏è‚É£ Confusion Matrix

A **Confusion Matrix** summarizes classification performance in a 2√ó2 grid:

|               | **Predicted Positive** | **Predicted Negative** |
|----------------|------------------------|------------------------|
| **Actual Positive** | TP (True Positive) | FN (False Negative) |
| **Actual Negative** | FP (False Positive) | TN (True Negative) |

It helps identify which types of errors your model is making.

---

## 5Ô∏è‚É£ Type I and Type II Errors

| Error Type | Description | Example |
|-------------|--------------|----------|
| **Type I Error (False Positive)** | Predicting positive when it‚Äôs actually negative. | Predicting someone has a disease when they don‚Äôt. |
| **Type II Error (False Negative)** | Predicting negative when it‚Äôs actually positive. | Predicting someone is healthy when they actually have the disease. |

üí° *Tradeoff:* Reducing one often increases the other ‚Äî depending on the decision threshold.

---

## 6Ô∏è‚É£ Precision and Recall

### Precision
$$
\text{Precision} = \frac{TP}{TP + FP}
$$
- Out of all predicted positives, how many were correct?  
- **High precision** ‚Üí few false positives.

### Recall (Sensitivity / True Positive Rate)
$$
\text{Recall} = \frac{TP}{TP + FN}
$$
- Out of all actual positives, how many were detected?  
- **High recall** ‚Üí few false negatives.

---

## 7Ô∏è‚É£ F1, F2, and FŒ≤ Scores

F-scores combine precision and recall into one metric using a harmonic mean.

$$
F_{\beta} = (1 + \beta^2) \cdot \frac{Precision \cdot Recall}{(\beta^2 \cdot Precision) + Recall}
$$

- **F1 Score:** balances precision and recall equally (Œ≤=1).  
- **F2 Score:** gives **more weight to recall** ‚Äî useful in medical or fraud detection contexts.  
- **F0.5 Score:** gives **more weight to precision** ‚Äî useful in spam or recommendation systems.

---

## 8Ô∏è‚É£ ROC Curve and AUC

**ROC (Receiver Operating Characteristic)** curve plots:
- **True Positive Rate (Recall)** on Y-axis  
- **False Positive Rate (1 - Specificity)** on X-axis

The **AUC (Area Under Curve)** measures how well the model separates classes.  
- **AUC = 1.0:** perfect model  
- **AUC = 0.5:** random guessing

---

## 9Ô∏è‚É£ Precision‚ÄìRecall Curve

When data is **highly imbalanced**, the **PR curve** is often more informative than ROC.  
It shows the trade-off between **precision and recall** for different thresholds.

---

## üîü Specificity (True Negative Rate)

$$
\text{Specificity} = \frac{TN}{TN + FP}
$$
How well the model identifies **negatives** correctly.  
Used alongside recall (sensitivity) in medical testing.

---

## 11Ô∏è‚É£ Balanced Accuracy

$$
\text{Balanced Accuracy} = \frac{Recall_{positive} + Recall_{negative}}{2}
$$

Useful when data is imbalanced, as it averages performance across both classes.

---

## 12Ô∏è‚É£ Matthews Correlation Coefficient (MCC)

MCC is a **single-number summary** that works even on imbalanced datasets.

$$
MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}
$$

Values range from -1 (total disagreement) to +1 (perfect prediction).

---

## 13Ô∏è‚É£ Cohen‚Äôs Kappa

Cohen‚Äôs Kappa adjusts accuracy for chance agreement.

$$
\kappa = \frac{p_o - p_e}{1 - p_e}
$$

- \( p_o \): observed accuracy  
- \( p_e \): expected accuracy by random chance

A higher Kappa indicates better performance beyond randomness.

---

## 14Ô∏è‚É£ Summary Table

| Metric | Focus | Best Used When |
|---------|--------|----------------|
| **Accuracy** | Overall correctness | Classes are balanced |
| **Precision** | False positives matter | Spam detection |
| **Recall** | False negatives matter | Medical or fraud detection |
| **F1/F2** | Trade-off metric | General evaluation |
| **ROC-AUC** | Ranking power | Balanced or large datasets |
| **PR-AUC** | Minority detection | Imbalanced datasets |
| **MCC** | Correlation-based | Any dataset size |
| **Kappa** | Adjusted accuracy | Chance-corrected evaluation |

---

## üß† Key Takeaways

- **Always look beyond accuracy.**  
- Choose metrics based on **business impact** of errors.  
- Tune **thresholds** and **monitor multiple metrics** to ensure robust model evaluation.  
- For imbalanced datasets, use **Precision, Recall, F-score, PR-AUC, and MCC** over plain accuracy.

---

üìò *Next Step:*  
Add visuals like ROC and PR curves using scikit-learn‚Äôs `plot_roc_curve` and `plot_precision_recall_curve`, and create confusion matrices with `ConfusionMatrixDisplay`.
