
---

## **1. What Are Classification Metrics?**

**Definition:**
Classification metrics are quantitative measures used to evaluate how well a machine learning model predicts **categorical outcomes** (e.g., spam vs not spam, disease vs no disease).

**Why They’re Important:**

* They measure **how well the model distinguishes between different classes**.
* They help identify whether the model is **accurate**, **balanced**, and **reliable**.
* Different metrics focus on **different aspects** of performance — this is crucial when dealing with **imbalanced datasets** or **different costs for errors**.

---

## **2. The Confusion Matrix – The Foundation**

A **confusion matrix** summarizes predictions in a table:

|                     | Predicted Positive      | Predicted Negative      |
| ------------------- | ----------------------- | ----------------------- |
| **Actual Positive** | **True Positive (TP)**  | **False Negative (FN)** |
| **Actual Negative** | **False Positive (FP)** | **True Negative (TN)**  |

* **TP:** Model correctly predicted positive.
* **TN:** Model correctly predicted negative.
* **FP:** Model predicted positive but was wrong (false alarm).
* **FN:** Model predicted negative but missed a positive case.

**Why it’s important:**
Almost all classification metrics are derived from these four numbers.

---

## **3. Common Classification Metrics**

---

### **3.1 Accuracy**

**Definition:**
The proportion of correct predictions (both positives and negatives) out of all predictions.

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

**Purpose:**

* Quick measure of overall performance.

**Interpretation:**

* High accuracy means most predictions are correct.
* **Limitation:** Misleading when classes are imbalanced (e.g., 95% accuracy predicting all "Negative" in a dataset with 95% negatives).

**Example:**
If TP = 50, TN = 40, FP = 5, FN = 5 → Accuracy = (50+40)/(50+40+5+5) = 90/100 = **0.90 (90%)**.

**Best used when:** Classes are **balanced**.

---

### **3.2 Precision**

**Definition:**
The proportion of correctly predicted positives out of all predicted positives.

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

**Purpose:**

* Answers: **"When the model predicts Positive, how often is it correct?"**
* Important in scenarios where **false positives are costly**.

**Interpretation:**

* High precision = low false alarm rate.
* Low precision = many false positives.

**Example:**
If TP = 50, FP = 10 → Precision = 50/(50+10) = **0.83 (83%)**.

**Best used when:** **False positives** are more harmful (e.g., predicting cancer when it’s not there → unnecessary anxiety/tests).

---

### **3.3 Recall (Sensitivity / True Positive Rate)**

**Definition:**
The proportion of actual positives correctly identified.

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

**Purpose:**

* Answers: **"Out of all actual positives, how many did we correctly predict?"**
* Important in scenarios where **missing a positive case is costly**.

**Interpretation:**

* High recall = fewer missed cases.
* Low recall = many false negatives.

**Example:**
If TP = 50, FN = 5 → Recall = 50/(50+5) = **0.91 (91%)**.

**Best used when:** **False negatives** are more harmful (e.g., detecting diseases, fraud detection).

---

### **3.4 F1 Score**

**Definition:**
The **harmonic mean** of precision and recall.

$$
\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

**Purpose:**

* Balances precision and recall into a single number.
* Useful when there’s **class imbalance** and you need to consider both false positives and false negatives.

**Interpretation:**

* F1 closer to 1 = better model.
* Lower F1 = imbalance between precision and recall.

**Example:**
If Precision = 0.83 and Recall = 0.91 →
F1 = 2 × (0.83 × 0.91) / (0.83 + 0.91) ≈ **0.87**.

**Best used when:** You need **balance** between precision & recall.

---

### **3.5 ROC-AUC (Receiver Operating Characteristic – Area Under Curve)**

**Definition:**

* **ROC curve** plots True Positive Rate (Recall) vs False Positive Rate (FPR) for different thresholds.
* **AUC** = Area under ROC curve (0 to 1).

**Purpose:**

* Measures **model’s ability to separate classes** regardless of threshold.
* AUC = 0.5 → Random guessing.
  AUC = 1.0 → Perfect classifier.

**Interpretation:**

* Higher AUC = better model.
* Not affected by class imbalance as much as accuracy.

**Example:**
If a spam filter’s AUC is 0.95, it’s very good at separating spam from non-spam.

**Best used when:** You want **threshold-independent** evaluation.

---

## **4. Choosing the Right Metric**

| Scenario                           | Best Metric | Why                     |
| ---------------------------------- | ----------- | ----------------------- |
| Balanced classes                   | Accuracy    | Simple, intuitive       |
| Cost of FP is high                 | Precision   | Avoid false alarms      |
| Cost of FN is high                 | Recall      | Catch all positives     |
| Need balance between FP and FN     | F1 Score    | Harmonic mean           |
| Want threshold-independent measure | ROC-AUC     | Works across thresholds |

---

## **5. Practical Example**

```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score

# Actual and predicted labels
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Metrics
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
auc = roc_auc_score(y_true, y_pred)

print("Confusion Matrix:\n", cm)
print(f"Accuracy: {acc:.2f}")
print(f"Precision: {prec:.2f}")
print(f"Recall: {rec:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {auc:.2f}")
```

---

## **6. Key Takeaways**

* Always start with a **confusion matrix** — it’s the root of most metrics.
* **Accuracy** is not always the best choice (especially with imbalanced data).
* Pick metrics that align with **business goals**:

  * If missing positives is costly → Focus on **Recall**.
  * If false alarms are costly → Focus on **Precision**.
  * If both matter → Use **F1 score**.
  * If you want to compare across thresholds → Use **ROC-AUC**.

---
