# Understanding Accuracy, Precision & Recall

Evaluating a classification model means going beyond accuracy — each metric tells a **different story** about model performance.  
Here’s a clear breakdown of **when to use each** and **what they mean** in real-world settings.

---

## Accuracy

**Definition:**  
$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

**Meaning:**  
Measures the overall correctness — how often the model is right.

**Use When:**
- Classes are **balanced**.
- All types of mistakes are **equally costly**.

**Avoid When:**
- Dataset is **imbalanced** (e.g., 99% negatives).

**Examples:**
- Handwritten digit recognition (MNIST)  
- Image classification with balanced classes  
- Language model next-word prediction  

⚠️ **Example of misleading accuracy:**  
If 99% of samples are negative, predicting “negative” always gives 99% accuracy — even if the model never detects a true positive.

---

## Precision

**Definition:**  
$$
\text{Precision} = \frac{TP}{TP + FP}
$$

**Meaning:**  
When the model predicts **positive**, how often is it actually correct?  
It measures **exactness** and **trustworthiness** of positive predictions.

**Use When:**
- **False Positives (FP)** are costly.
- You need high confidence before labeling something as positive.

**Avoid When:**
- You can tolerate some false alarms but missing positives is worse.

**Examples:**
- Spam filter (don’t mark real emails as spam)  
- Credit card fraud detection (avoid wrongly blocking real transactions)  
- Cancer detection model (avoid false diagnoses)  

💡 *High Precision → “When I say something is positive, I’m almost always right.”*

---

## Recall

**Definition:**  
$$
\text{Recall} = \frac{TP}{TP + FN}
$$

**Meaning:**  
Out of all **actual positives**, how many did the model catch?  
It measures **sensitivity** or **completeness**.

**Use When:**
- **False Negatives (FN)** are costly.
- You want to detect **all** positives, even if you get some false alarms.

**Avoid When:**
- You want to be very selective or confident before predicting positives.

**Examples:**
- Disease detection (don’t miss sick patients)  
- Security / anomaly detection (better to over-alert than miss an attack)  
- Search engines (retrieve all relevant documents)  

*High Recall → “I catch most of the true positives, even if I sometimes make mistakes.”*

---

## Precision–Recall Trade-Off

- Increasing **recall** (catching more positives) usually lowers **precision** (more false alarms).  
- Increasing **precision** (being more selective) usually lowers **recall** (misses some positives).

To balance both, we often use the **F1 Score**:

$$
\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}
$$

---

## Summary Table

| Metric | Measures | Best Used When | Avoid When | Example |
|:--|:--|:--|:--|:--|
| **Accuracy** | Overall correctness | Classes are balanced | Dataset is imbalanced | Digit recognition |
| **Precision** | Reliability of positive predictions | False positives are costly | Missing positives is acceptable | Spam / Fraud detection |
| **Recall** | Ability to detect all positives | False negatives are costly | You need selectivity | Disease / Security detection |

---

**Key takeaway:**  
- Use **Accuracy** when your dataset is balanced.  
- Use **Precision** when you must be *right* about positives.  
- Use **Recall** when you must *catch* all positives.  
- Use **F1 Score** when you want a **balanced view** between the two.


In [1]:
import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

# True labels (ground truth)
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])

# Predicted labels (model outputs)
y_pred = np.array([1, 0, 1, 0, 0, 1, 0, 1, 1, 0])


We define two arrays:
- `y_true`: actual class labels (0 = negative, 1 = positive)  
- `y_pred`: model predictions

Next, we’ll build a **confusion matrix** to see TP, TN, FP, FN counts.


In [2]:
cm = confusion_matrix(y_true, y_pred)
cm


array([[4, 1],
       [1, 4]])

The **confusion matrix** shows:

|               | Predicted 0 | Predicted 1 |
|---------------|-------------|-------------|
| **Actual 0**  | TN          | FP          |
| **Actual 1**  | FN          | TP          |

From this matrix we can compute metrics:

- **TP** = true positives → correct positive predictions  
- **TN** = true negatives → correct negative predictions  
- **FP** = false positives → wrongly predicted as positive  
- **FN** = false negatives → missed positives


In [3]:
TN, FP, FN, TP = cm.ravel()

accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP)
recall = TP / (TP + FN)

accuracy, precision, recall


(np.float64(0.8), np.float64(0.8), np.float64(0.8))

We compute each metric manually:

- **Accuracy** = (TP + TN) / Total  
- **Precision** = TP / (TP + FP)  
- **Recall** = TP / (TP + FN)

These values give insight into the model’s performance from different angles.


In [4]:
print("Accuracy :", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall   :", recall_score(y_true, y_pred))


Accuracy : 0.8
Precision: 0.8
Recall   : 0.8


In [None]:
import torch

In [6]:
from torchmetrics.classification import BinaryAccuracy, BinaryPrecision, BinaryRecall


# Dummy binary classification outputs
y_true = torch.tensor([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])
y_pred = torch.tensor([1, 0, 1, 0, 0, 1, 0, 1, 1, 0])


acc = BinaryAccuracy()
prec = BinaryPrecision()
rec = BinaryRecall()

print("TorchMetrics Accuracy :", acc(y_pred, y_true).item())
print("TorchMetrics Precision:", prec(y_pred, y_true).item())
print("TorchMetrics Recall   :", rec(y_pred, y_true).item())

TorchMetrics Accuracy : 0.800000011920929
TorchMetrics Precision: 0.800000011920929
TorchMetrics Recall   : 0.800000011920929


In [7]:
# Example: medical screening for a rare disease (positives are rare)
torch.manual_seed(0)

# Create imbalanced dummy data (10% positives)
n = 100
y_true = (torch.rand(n) < 0.10).long() # 1 = has disease, 0 = healthy (rare positives)

# positives tend to have higher scores than negatives, but with overlap
y_scores = 0.65 * y_true + 0.35 * torch.rand(n)

# Helper to compute confusion & metrics for a given threshold 
def metrics_at_threshold(scores, labels, thresh=0.50):
    y_pred = (scores >= thresh).long()

    TP = ((labels == 1) & (y_pred == 1)).sum().item()
    TN = ((labels == 0) & (y_pred == 0)).sum().item()
    FP = ((labels == 0) & (y_pred == 1)).sum().item()
    FN = ((labels == 1) & (y_pred == 0)).sum().item()

    total = TP + TN + FP + FN
    acc = (TP + TN) / total if total else 0.0
    prec = TP / (TP + FP) if (TP + FP) else 0.0
    rec = TP / (TP + FN) if (TP + FN) else 0.0

    return {
        "threshold": thresh,
        "TP": TP, "FP": FP, "FN": FN, "TN": TN,
        "accuracy": acc, "precision": prec, "recall": rec
    }

# Compare two thresholds: conservative vs. sensitive screening
results = [
    metrics_at_threshold(y_scores, y_true, thresh=0.50),  # more conservative (higher precision, lower recall)
    metrics_at_threshold(y_scores, y_true, thresh=0.30),  # more sensitive (higher recall, lower precision)
]


print(f"Positives in data: {y_true.sum().item()} / {n}")
for r in results:
    print(f"\nThreshold = {r['threshold']:.2f}")
    print(f"Confusion Matrix -> TP:{r['TP']}  FP:{r['FP']}  FN:{r['FN']}  TN:{r['TN']}")
    print(f"Accuracy : {r['accuracy']:.3f}")
    print(f"Precision: {r['precision']:.3f}")
    print(f"Recall   : {r['recall']:.3f}")


Positives in data: 9 / 100

Threshold = 0.50
Confusion Matrix -> TP:9  FP:0  FN:0  TN:91
Accuracy : 1.000
Precision: 1.000
Recall   : 1.000

Threshold = 0.30
Confusion Matrix -> TP:9  FP:9  FN:0  TN:82
Accuracy : 0.910
Precision: 0.500
Recall   : 1.000
