# 📊 Performance Metrics in Classification Model

Performance metrics help assess how well a machine learning model performs. The choice of metric depends on the **problem type**—whether it's classification, regression, or clustering. 

## 🔷 **Why is Evaluation Important?** 

After building a model, you need to know:
- Is it accurate?
- Is it reliable?
- Is it overfitting or underfitting?
- How well does it generalize to unseen data?

👉 That’s where evaluation metrics come into play.

# 🔷 Classification Metrics  
> Used to evaluate the performance of **classification models**, where output is **categorical** (e.g., Spam/Not Spam, Disease/No Disease, Yes/No, etc.).

### 🔸 **Confusion Matrix**
|                        | **Predicted Positive** | **Predicted Negative** |
|------------------------|------------------------|------------------------|
| **Actual Positive**    | True Positive (TP)     | False Negative (FN)   |
| **Actual Negative**    | False Positive (FP)    | True Negative (TN)    |


Helps visualize:
- Types of errors
- Metric computation


## 🔸 **Comprehensive Classification Metrics Table**

| Metric               | Best Use Case                            | Range       | Formula Complexity | Formula                                                                                  | Advantages                                                  | Disadvantages                                               | Additional Use Cases                                        |
|----------------------|------------------------------------------|-------------|---------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------|-------------------------------------------------------------|-------------------------------------------------------------|
| **Accuracy**         | Balanced datasets                        | 0 to 1      | Low                 | $ \frac{TP + TN}{TP + TN + FP + FN} $                                                   | Easy to understand and compute                              | Misleading when data is imbalanced                         | Balanced datasets                                           |
| **Precision**        | Minimise False Positives                 | 0 to 1      | Low                 | $ \frac{TP}{TP + FP} $                                                                  | Reduces false positives, useful when false positives are costly | Ignores false negatives                                     | Spam detection, fraud detection                             |
| **Recall** (Sensitivity) | Minimise False Negatives             | 0 to 1      | Low                 | $ \frac{TP}{TP + FN} $                                                                  | Captures all actual positives                               | Ignores false positives                                     | Medical diagnosis, search relevance                         |
| **Specificity** (TNR) | Focus on identifying actual negatives   | 0 to 1      | Low                 | $ \frac{TN}{TN + FP} $                                                                 | Focuses on correct prediction of actual negatives           | Doesn’t consider true positives                            | Crime detection, background screening                       |
| **F1-Score**         | Balance of Precision & Recall            | 0 to 1      | Medium              | $ 2 \times \frac{Precision \times Recall}{Precision + Recall} $                        | Balances precision and recall                               | Harder to interpret intuitively                             | Imbalanced classification tasks                             |
| **Fβ-Score**         | Preference-weighted scenarios            | 0 to 1      | Medium              | $ (1 + \beta^2) \cdot \frac{Precision \cdot Recall}{\beta^2 \cdot Precision + Recall} $ | Adjusts weight between precision and recall                 | Needs tuning of β                                           | β > 1 for recall focus (medical), β < 1 for precision focus |
| **Balanced Accuracy** | Imbalanced datasets                     | 0 to 1      | Medium              | $ \frac{Recall + Specificity}{2} $                                                     | Handles imbalanced datasets better than regular accuracy    | Less popular, can be misunderstood                         | Imbalanced datasets with both FN and FP concern             |
| **MCC**              | Highly imbalanced datasets               | -1 to 1     | High                | $ \frac{(TP \cdot TN - FP \cdot FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} $            | Handles all confusion matrix terms, good for imbalance      | Slightly complex formula                                    | Highly imbalanced binary classification                     |
| **ROC-AUC**          | Ranking ability across thresholds        | 0 to 1      | High                | AUC of ROC curve (TPR vs FPR)                                                           | Threshold-independent, visual tool                          | Misleading in extreme imbalance                             | Classifier ranking and threshold selection                  |
| **PR-AUC**           | Imbalanced classification                | 0 to 1      | High                | AUC of Precision vs Recall curve                                                        | Better than ROC-AUC for imbalance                           | Hard to interpret without plot                             | Imbalanced datasets focusing on positive class              |
| **Log Loss**         | Probabilistic binary classification       | 0 to ∞      | High                | $ -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)] $              | Penalises wrong confident predictions                       | Sensitive to outliers, only for probabilistic models        | Probabilistic binary classification                         |

---

### a. **Accuracy**

- **Definition**:  
  Proportion of correct predictions. Measures how many predictions the model got **correct** out of all predictions made.
  
- **Formula**:  
  $$
  \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
  $$

- **When to Use**:  
  - When the dataset has **balanced class distribution**.
  - Simple binary classification problems.

- **Real-life Example**:  
  Predicting whether a transaction is fraudulent when fraud and non-fraud cases are nearly equal.

- **Pros**:
  - Easy to understand and quick to compute.

- **Cons**:
  - **Misleading in imbalanced datasets**. For example, if 95 out of 100 samples are "Not Fraud", predicting all as "Not Fraud" gives 95% accuracy but **0% usefulness**.

---

### b. **Precision (Positive Predictive Value)**

- **Definition**:  
  Out of all predicted **positive cases**, how many were **actually positive**.

- **Formula**:  
  $$
  \text{Precision} = \frac{TP}{TP + FP}
  $$

- **When to Use**:  
  - When **false positives** are **costly or risky**.

- **Real-life Example**:  
  Email spam detection. Marking an important email as spam (FP) is worse than missing one spam email.

- **Pros**:
  - Tells us how **reliable** our positive predictions are.

- **Cons**:
  - Doesn’t consider false negatives (what we missed).

---

### c. **Recall (Sensitivity or True Positive Rate)**

- **Definition**:  
  Out of all **actual positive cases**, how many were **correctly identified** by the model.

- **Formula**:  
  $$
  \text{Recall} = \frac{TP}{TP + FN}
  $$

- **When to Use**:  
  - When **missing positive cases** is **dangerous or expensive**.

- **Real-life Example**:  
  Cancer detection. Missing a cancer case (FN) could be life-threatening.

- **Pros**:
  - Highlights the model’s **completeness** in detecting positives.

- **Cons**:
  - Doesn’t care about false positives.

---

### d. **Specificity (True Negative Rate)**

- **Definition**:  
  Out of all **actual negative cases**, how many were **correctly identified** as negative.

- **Formula**:  
  $$
  \text{Specificity} = \frac{TN}{TN + FP}
  $$

- **When to Use**:  
  - When it’s critical to **avoid false positives**.

- **Real-life Example**:  
  Medical tests: You don’t want to falsely diagnose healthy people as sick.

- **Pros**:
  - Complements Recall (which focuses on positives).

- **Cons**:
  - Often ignored in standard metrics like Accuracy and F1.

---

### e. **F1-Score**

- **Definition**:  
  **Harmonic mean** of Precision and Recall. It balances both, giving a **single metric**.

- **Formula**:  
  $$
  F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}
  $$

- **When to Use**:  
  - When you need a balance between Precision and Recall.
  - Especially useful in **imbalanced classification problems**.

- **Real-life Example**:  
  In fraud detection, you want to **detect frauds (recall)** and also make sure the detected frauds are **truly fraud (precision)**.

- **Pros**:
  - Balances two important metrics into one score.

- **Cons**:
  - Not intuitive like Accuracy, and harder to explain.

---

### f. **F<sub>β</sub>-Score**

- **Definition**:  
  Generalized form of F1-score that allows **weighting Recall more than Precision**, or vice versa.

- **Formula**:  
  $$
  F_{\beta} = (1 + \beta^2) \cdot \frac{Precision \cdot Recall}{\beta^2 \cdot Precision + Recall}
  $$

- **Interpretation**:
  - **β = 1** → Equal weight to Precision and Recall (F1).
  - **β > 1** → More weight to **Recall**.
  - **β < 1** → More weight to **Precision**.

- **Real-life Examples**:
  - **β = 2** (Recall Focus): Disease detection, where **missing** cases is worse.
  - **β = 0.5** (Precision Focus): Spam filters, where **wrongly tagging** a mail as spam is worse.

- **Pros**:
  - Flexible metric depending on problem sensitivity.

- **Cons**:
  - Needs domain knowledge to choose the right β.

---

### g. **Confusion Matrix**

|                             | Predicted Positive | Predicted Negative |
|-----------------------------|--------------------|--------------------|
| **Actual Positive**         | TP (True Positive) | FN (False Negative) |
| **Actual Negative**         | FP (False Positive) | TN (True Negative)  |

- **Definition**:  
  A **2×2 matrix** that breaks down predictions into **correct and incorrect** classes.

- **When to Use**:  
  - Always useful for **diagnosing model errors**.

- **Real-life Example**:  
  Helps understand whether a model **missed actual cases** or **flagged incorrect ones**.

- **Pros**:
  - Provides a **complete breakdown** of model performance.

- **Cons**:
  - Not a single value; needs further analysis for summary.

---

### h. **Balanced Accuracy**

- **Definition**:  
  Average of **Recall** (True Positive Rate) and **Specificity** (True Negative Rate).

- **Formula**:  
  $$
  \text{Balanced Accuracy} = \frac{1}{2} \left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} \right)
  $$

- **When to Use**:  
  - For **imbalanced datasets** where one class dominates.

- **Pros**:
  - Fairly evaluates both classes.
  
- **Cons**:
  - Less popular than F1, Accuracy.

---

### i. **Matthews Correlation Coefficient (MCC)**

- **Definition**:  
  Measures the **correlation** between actual and predicted classifications.

- **Formula**:  
  $$
  MCC = \frac{(TP \cdot TN) - (FP \cdot FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
  $$

- **Score Range**:
  - +1: Perfect prediction
  - 0: Random prediction
  - –1: Complete disagreement

- **When to Use**:  
  - Especially useful for **imbalanced datasets**.

- **Pros**:
  - Takes all confusion matrix values into account.
  - Considered **one of the most reliable binary metrics**.

- **Cons**:
  - Slightly complex and not very intuitive.

---

### j. **ROC Curve & AUC (Receiver Operating Characteristic – Area Under Curve)**

- **Definition**:  
  ROC is a curve of **True Positive Rate (Recall)** vs. **False Positive Rate** at various thresholds.  
  **AUC** is the **area under that curve**.

- **Formula**:  
  No single formula, but AUC is computed via integration.

- **Interpretation**:
  - **AUC = 1**: Perfect model
  - **AUC = 0.5**: No better than random guessing

- **When to Use**:  
  - When you want to assess model’s **ranking ability**, especially with **probabilistic outputs**.

- **Real-life Example**:  
  Credit scoring, disease risk scores.

- **Pros**:
  - Threshold-independent.
  
- **Cons**:
  - Can be **misleading with highly imbalanced data**.


---

### k. **PR Curve & PR AUC (Precision-Recall Curve – Area Under Curve)**

- **Definition**:  
  The PR curve is a plot of **Precision vs Recall** for various thresholds, and **PR AUC** is the area under this curve.

- **Formula**:  
  No single formula, but AUC is computed via integration.

- **Interpretation**:
  - **PR AUC = 1**: Perfect model
  - **PR AUC = 0**: No better than random guessing

- **When to Use**:  
  - Especially useful in **imbalanced datasets** where the positive class is of more interest.

- **Real-life Example**:  
  Medical testing (e.g., detecting a rare disease), fraud detection.

- **Pros**:
  - Focuses on the performance of the model for the positive class.
  
- **Cons**:
  - Can be difficult to interpret without visualizing the full curve.

---

### l. **Log Loss (Logarithmic Loss or Cross-Entropy Loss)**

- **Definition**:  
  A loss function that measures the uncertainty of the model’s predictions. It penalizes wrong predictions based on how confident the model was in its incorrect predictions.

- **Formula**:  
  $$ 
  \text{Log Loss} = - \frac{1}{n} \sum_{i=1}^{n} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]
  $$  
  where:
  - $ y_i $ = Actual class (0 or 1)
  - $ p_i $ = Predicted probability for the positive class

- **When to Use**:  
  - For **probabilistic classifiers** (e.g., logistic regression, neural networks).

- **Real-life Example**:  
  Binary classification problems with probabilistic outputs, such as predicting whether an email is spam or not, where predictions are given as probabilities.

- **Pros**:
  - Provides a **continuous** measure of performance, not just binary outputs.
  - Penalizes models that are confident but wrong, helping improve the quality of predictions.

- **Cons**:
  - Sensitive to outliers and can give high penalties for very confident incorrect predictions.
  - Not as interpretable as other metrics like Accuracy or F1-Score.

---