# F1 Score, ROC, and AUC (Complete Overview)

---

## Confusion Matrix & Core Metrics

Let \(TP, FP, TN, FN\) be the counts for a binary classifier.

- **Precision:**  
  $$ \text{Precision} = \frac{TP}{TP + FP} $$
- **Recall / True Positive Rate (TPR):**  
  $$ \text{Recall} = \text{TPR} = \frac{TP}{TP + FN} $$
- **False Positive Rate (FPR):**  
  $$ \text{FPR} = \frac{FP}{FP + TN} $$

---

## 1. F1 Score — Balancing Precision and Recall

### In Simple Words:
F1 score tells you **how good your model is at finding positives** (like detecting spam emails, diseases, or frauds) **without making too many mistakes**.  
It combines both **precision** (how many predicted positives are correct) and **recall** (how many real positives you caught).

Think of:
- **Precision** → “When I say this is spam, how often am I right?”  
- **Recall** → “Of all the real spam, how many did I catch?”  
- **F1** → “How well am I balancing the two?”

It’s the **harmonic mean** of precision and recall — meaning if either one is low, F1 will also drop.

### Formula:
$$
F_1 = 2\cdot\frac{\text{Precision}\cdot \text{Recall}}
{\text{Precision} + \text{Recall}}
= \frac{2TP}{2TP + FP + FN}.
$$

More generally,
$$
F_\beta = (1+\beta^2)\cdot \frac{\text{Precision}\cdot \text{Recall}}
{\beta^2\cdot \text{Precision} + \text{Recall}},
$$

where \(\beta>1\) emphasizes **recall** and \(\beta<1\) emphasizes **precision**.

> **Note:** F1 is computed *after* converting probabilities to **hard labels** using a **threshold** (often not 0.5).  
> You can tune this threshold on validation data to get the best trade-off.

### Applications:
- **Medical tests:** Catch diseases (high recall) without too many false alarms.  
- **Spam or fraud detection:** Deal with imbalanced data; balance recall vs. false positives.  
- **Search systems / recommendation:** Ensure retrieved items are both relevant and complete.

---

## 2. ROC Curve — Visualizing Trade-Offs

### In Simple Words:
The **ROC curve** shows how your model’s performance changes when you move the decision **threshold** up and down.

- Lowering the threshold → **catch more positives** (higher recall / TPR)  
  but also → **more false alarms** (higher FPR).  
- Increasing the threshold → **fewer false alarms**,  
  but also → **miss more real positives.**

The ROC curve plots:
- **Y-axis:** True Positive Rate (TPR = Recall)
- **X-axis:** False Positive Rate (FPR)

This lets you **see how well your model separates the classes** (good vs. bad, positive vs. negative) **independent of any specific threshold**.

### Formula:
- **TPR:**  
  $$ \text{TPR} = \frac{TP}{TP+FN} $$
- **FPR:**  
  $$ \text{FPR} = \frac{FP}{FP+TN} $$

### When to Use ROC:
- To **compare models visually** across thresholds.  
- When you care about **ranking** ability, not just binary accuracy.  
- To check whether the model truly distinguishes between classes.

### Examples:
- Comparing two medical models to see which better separates healthy vs. sick patients.  
- Evaluating credit risk models — higher ROC = better separation between safe and risky clients.

---

## 3. AUC — Area Under the ROC Curve

### In Simple Words:
AUC is a **single number summary** of the ROC curve.  
It represents the **probability that your model ranks a random positive higher than a random negative**.

- **AUC ≈ 1.0** → near-perfect model  
- **AUC ≈ 0.5** → random guessing  
- **AUC < 0.5** → model is worse than random (reversed ordering)

### Formula:
$$
\text{AUC} = \int_0^1 \text{TPR}(\text{FPR})\, d(\text{FPR})
$$

### Applications:
- **Ranking tasks:** Fraud detection, credit scoring, medical triage.  
- **When you want one metric** to summarize how well the model separates classes.  
- Great for **imbalanced datasets** where accuracy can be misleading.

---

## ROC vs. PR Curve (When Positives Are Rare)
- **ROC curves** can look overly optimistic if positives are very rare.  
- In such cases, prefer **Precision–Recall (PR) curves** and **AUPRC**, which focus on positive detection quality.

---

## Multiclass Extensions
- **F1:** compute per class (one-vs-rest), then average:  
  - **Macro** (unweighted mean)  
  - **Weighted** (weighted by class frequency)  
  - **Micro** (global TP/FP/FN)  
- **ROC/AUC:** one-vs-rest per class, then macro/micro average.

---

## Practical Use Cases
| Problem Type | Goal | Best Metric |
|---------------|------|--------------|
| **Medical screening / safety** | Catch all positives (high recall) | $F_1$, $F_\beta$ with $\beta>1$ |
| **Fraud / spam detection** | Handle rare positives | PR Curve, AUPRC, F1 |
| **Ranking / triage systems** | Measure separability | ROC & AUC |
| **Alert systems (fixed capacity)** | Balance false alarms vs misses | Tune threshold, track F1 |

---

## Key Takeaways
1. Use **scores (not hard labels)** to compute **ROC** and **AUC**.  
2. **Threshold choice** heavily affects **F1** — always tune it on validation data.  
3. For **imbalanced** datasets, prioritize **PR/AUPRC** alongside F1.  
4. For **multiclass**, report **macro/micro/weighted** averages for clarity and fairness.  
5. Use **AUC** when you want a **threshold-free comparison**, and **F1** when you must output binary decisions.

---


### Create a Toy Dataset  
We generate a small synthetic dataset with `y_true` labels and model prediction `scores`.  
This dataset will be used for all our F1, ROC, and AUC calculations.


In [1]:
# create a small reproducible toy dataset

import numpy as np

rng = np.random.default_rng(42)

n = 200

y_true = (rng.random(n) < 0.30).astype(int)

scores = rng.beta(a = 2+2*y_true, b = 5 -2*y_true, size= n)

print("y_ture distribution:", np.bincount((y_true)))

print("scores range:", float(scores.min()), 'to', float(scores.max()))

y_ture distribution: [140  60]
scores range: 0.03497046698601967 to 0.8789788976876499


### Cell 2 — Compute F1 Score from Scratch  
We define helper functions to calculate `TP, FP, TN, FN` at a given threshold  
and then compute `Precision`, `Recall`, and `F1` manually using NumPy.


In [2]:
# Confusion counts, Precision, Recall, F1 at a given threshold

import numpy as np

def confusion_counts_from_threshold(y_true: np.ndarray, scores: np.ndarray, threshold: float):
    """
    Convert scores to hard labels using threshold, then compute TP, FP, TN, FN.
    """
    y_pred = (scores >= threshold).astype(int)
    TP = int(((y_pred == 1) & (y_true == 1)).sum())
    FP = int(((y_pred == 1) & (y_true == 0)).sum())
    TN = int(((y_pred == 0) & (y_true == 0)).sum())
    FN = int(((y_pred == 0) & (y_true == 1)).sum())
    return TP, FP, TN, FN

def precision_recall_f1_from_counts(TP: int, FP: int, TN: int, FN: int, eps: float = 1e-12):
    """
    Compute precision, recall, F1 from TP/FP/TN/FN. eps avoids division by zero.
    """
    precision = TP / (TP + FP + eps)
    recall    = TP / (TP + FN + eps)
    f1        = 2 * precision * recall / (precision + recall + eps)
    return precision, recall, f1

# Demo at an arbitrary threshold (e.g., 0.5)
TP, FP, TN, FN = confusion_counts_from_threshold(y_true, scores, threshold=0.5)
P, R, F1 = precision_recall_f1_from_counts(TP, FP, TN, FN)

print(f"Threshold=0.50 -> TP={TP}, FP={FP}, TN={TN}, FN={FN}")
print(f"Precision={P:.3f}, Recall={R:.3f}, F1={F1:.3f}")


Threshold=0.50 -> TP=41, FP=19, TN=121, FN=19
Precision=0.683, Recall=0.683, F1=0.683


### Build ROC Curve from Scratch  
We sweep through all thresholds and compute pairs of `(FPR, TPR)` to draw the ROC curve.  
This shows how the model performance changes with different classification thresholds.


In [3]:
# ROC curve by sweeping thresholds

import numpy as np

def roc_curve_from_scores(y_true: np.ndarray, scores: np.ndarray):
    """
    Compute ROC curve points (FPR, TPR, thresholds) by sweeping unique score values.
    Returns arrays sorted by descending threshold.
    """
    # Sort by score descending for cumulative computations
    order = np.argsort(-scores)
    y_true_sorted = y_true[order]
    scores_sorted = scores[order]

    # Total positives/negatives
    P = (y_true == 1).sum()
    N = (y_true == 0).sum()

    # Sweep thresholds at each unique score
    distinct_idxs = np.where(np.diff(scores_sorted, prepend=np.inf))[0]

    TPR_list = []
    FPR_list = []
    thr_list = []

    TP = FP = 0
    # We iterate through sorted scores; at each index where score changes,
    # we record the current (TPR, FPR) just *before* lowering threshold past that score.
    for idx in range(len(scores_sorted)):
        # Predict positive for everything up to idx
        if y_true_sorted[idx] == 1:
            TP += 1
        else:
            FP += 1

        # If next score is different
        if (idx + 1 == len(scores_sorted)) or (scores_sorted[idx + 1] != scores_sorted[idx]):
            TPR = TP / (P if P > 0 else 1)
            FPR = FP / (N if N > 0 else 1)
            TPR_list.append(TPR)
            FPR_list.append(FPR)
            thr_list.append(scores_sorted[idx])

    # Add the (0,0) at threshold above max (predict none) and (1,1) at threshold below min (predict all)
    # Ensure proper starting/ending anchors for integration
    FPR_arr = np.array([0.0] + FPR_list + [1.0])
    TPR_arr = np.array([0.0] + TPR_list + [1.0])
    thr_arr = np.array([np.inf] + thr_list + [-np.inf])
    return FPR_arr, TPR_arr, thr_arr

FPR, TPR, THR = roc_curve_from_scores(y_true, scores)
print("ROC points:", len(FPR), "FPR[0..3]=", FPR[:3], "TPR[0..3]=", TPR[:3], "THR[0..3]=", THR[:3])


ROC points: 202 FPR[0..3]= [0. 0. 0.] TPR[0..3]= [0.         0.01666667 0.03333333] THR[0..3]= [       inf 0.8789789  0.85540335]


###  Compute AUC from Scratch  
We calculate the **Area Under the ROC Curve (AUC)** using two methods:  
1. **Trapezoidal rule (integration)**  
2. **Pairwise ranking probability** interpretation.


In [4]:
# AUC (AUROC) via trapezoidal rule + pairwise interpretation

import numpy as np

def auc_trapezoid(FPR: np.ndarray, TPR: np.ndarray):
    """
    Compute area under ROC via trapezoidal rule.
    Assumes FPR is monotonically increasing from 0 to 1.
    """
    # np.trapz integrates TPR dFPR
    return float(np.trapz(TPR, FPR))

def auc_pairwise_probability(y_true: np.ndarray, scores: np.ndarray):
    """
    Pairwise AUC interpretation: probability that a random positive has a higher score
    than a random negative (ties count as 0.5).
    """
    pos_scores = scores[y_true == 1]
    neg_scores = scores[y_true == 0]
    if len(pos_scores) == 0 or len(neg_scores) == 0:
        return 0.5

    # Efficient rank-based computation
    # Concatenate and rank; AUC = (sum of positive ranks - m*(m+1)/2) / (m*n)
    # where m=#pos, n=#neg; ranks are 1-based. Handle ties via average ranks.
    all_scores = np.concatenate([pos_scores, neg_scores])
    order = all_scores.argsort()
    ranks = np.empty_like(order, dtype=float)
    ranks[order] = np.arange(1, len(all_scores) + 1)

    m = len(pos_scores)
    n = len(neg_scores)
    sum_pos_ranks = ranks[:m].sum()
    auc = (sum_pos_ranks - m * (m + 1) / 2) / (m * n)
    return float(auc)

# Compute AUC both ways and compare
auc_trap = auc_trapezoid(FPR, TPR)
auc_pair = auc_pairwise_probability(y_true, scores)

print(f"AUC (trapezoid) = {auc_trap:.4f}")
print(f"AUC (pairwise ) = {auc_pair:.4f}")


AUC (trapezoid) = 0.8615
AUC (pairwise ) = 0.8615


  return float(np.trapz(TPR, FPR))


### Find the Best F1 Threshold  
We test multiple thresholds and find which one gives the highest **F1 score**,  
helping us decide the best point for converting probabilities to binary labels.


In [5]:
# Find threshold that maximizes F1

import numpy as np

def best_threshold_for_f1(y_true: np.ndarray, scores: np.ndarray):
    """
    Search thresholds (unique score values + 0 and 1) to maximize F1.
    Returns (best_threshold, best_F1, precision_at_best, recall_at_best).
    """
    uniq = np.unique(scores)
    candidates = np.concatenate([[0.0], uniq, [1.0]])
    best = (-1.0, 0.0, 0.0, 0.0)

    for thr in candidates:
        TP, FP, TN, FN = confusion_counts_from_threshold(y_true, scores, thr)
        P, R, F1 = precision_recall_f1_from_counts(TP, FP, TN, FN)
        if F1 > best[1]:
            best = (thr, F1, P, R)
    return best

thr, f1, P, R = best_threshold_for_f1(y_true, scores)
print(f"Best F1={f1:.3f} at threshold={thr:.3f} (Precision={P:.3f}, Recall={R:.3f})")


Best F1=0.734 at threshold=0.418 (Precision=0.646, Recall=0.850)


### PyTorch Version of F1, ROC, and AUC  
We re-implement the same logic using **PyTorch tensors**,  
demonstrating how to compute F1, ROC, and AUC directly in a deep learning workflow.


In [6]:
# Pure torch implementation (mirrors the NumPy logic)

import torch

torch.manual_seed(0)

# Convert our NumPy arrays to torch tensors
y_true_t = torch.from_numpy(y_true).long()
scores_t = torch.from_numpy(scores).float()

def confusion_counts_torch(y_true_t: torch.Tensor, scores_t: torch.Tensor, threshold: float):
    """
    Torch version: TP, FP, TN, FN at a given threshold.
    """
    y_pred_t = (scores_t >= threshold).long()
    TP = int(((y_pred_t == 1) & (y_true_t == 1)).sum().item())
    FP = int(((y_pred_t == 1) & (y_true_t == 0)).sum().item())
    TN = int(((y_pred_t == 0) & (y_true_t == 0)).sum().item())
    FN = int(((y_pred_t == 0) & (y_true_t == 1)).sum().item())
    return TP, FP, TN, FN

def precision_recall_f1_torch(TP: int, FP: int, TN: int, FN: int, eps: float = 1e-12):
    TP = torch.tensor(TP, dtype=torch.float64)
    FP = torch.tensor(FP, dtype=torch.float64)
    TN = torch.tensor(TN, dtype=torch.float64)
    FN = torch.tensor(FN, dtype=torch.float64)

    precision = TP / (TP + FP + eps)
    recall    = TP / (TP + FN + eps)
    f1        = 2 * precision * recall / (precision + recall + eps)
    return float(precision), float(recall), float(f1)

def roc_curve_torch(y_true_t: torch.Tensor, scores_t: torch.Tensor):
    """
    Compute ROC (FPR, TPR, thresholds) using pure torch ops.
    Returns CPU numpy arrays for convenience.
    """
    # Sort by scores desc
    scores_sorted, order = torch.sort(scores_t, descending=True)
    y_sorted = y_true_t[order]

    P = (y_true_t == 1).sum().item()
    N = (y_true_t == 0).sum().item()
    if P == 0 or N == 0:
        # Degenerate case
        FPR = torch.tensor([0.0, 1.0])
        TPR = torch.tensor([0.0, 1.0])
        THR = torch.tensor([float('inf'), float('-inf')])
        return FPR.numpy(), TPR.numpy(), THR.numpy()

    TPR_list = []
    FPR_list = []
    THR_list = []

    TP = FP = 0
    for i in range(len(scores_sorted)):
        if y_sorted[i] == 1:
            TP += 1
        else:
            FP += 1
        # Check threshold change
        if (i + 1 == len(scores_sorted)) or (scores_sorted[i + 1] != scores_sorted[i]):
            TPR_list.append(TP / P)
            FPR_list.append(FP / N)
            THR_list.append(float(scores_sorted[i].item()))

    # Add anchors
    FPR = torch.tensor([0.0] + FPR_list + [1.0], dtype=torch.float64)
    TPR = torch.tensor([0.0] + TPR_list + [1.0], dtype=torch.float64)
    THR = torch.tensor([float('inf')] + THR_list + [float('-inf')], dtype=torch.float64)

    return FPR.numpy(), TPR.numpy(), THR.numpy()

def auc_trapz_torch(FPR_np, TPR_np):
    """
    Torch trapezoidal AUC (accepts numpy arrays for convenience).
    """
    FPR = torch.from_numpy(FPR_np).double()
    TPR = torch.from_numpy(TPR_np).double()
    # Integrate TPR dFPR
    auc = torch.trapz(TPR, FPR)
    return float(auc.item())

# Demo: F1 at threshold 0.5
TP, FP, TN, FN = confusion_counts_torch(y_true_t, scores_t, threshold=0.5)
P, R, F1 = precision_recall_f1_torch(TP, FP, TN, FN)
print(f"[Torch] Threshold=0.50 -> TP={TP}, FP={FP}, TN={TN}, FN={FN}")
print(f"[Torch] Precision={P:.3f}, Recall={R:.3f}, F1={F1:.3f}")

# Demo: ROC + AUC
FPR_np, TPR_np, THR_np = roc_curve_torch(y_true_t, scores_t)
auc_t = auc_trapz_torch(FPR_np, TPR_np)
print(f"[Torch] AUC (trapz) = {auc_t:.4f}")


[Torch] Threshold=0.50 -> TP=41, FP=19, TN=121, FN=19
[Torch] Precision=0.683, Recall=0.683, F1=0.683
[Torch] AUC (trapz) = 0.8615


###  Find Best F1 Threshold (PyTorch Version)  
We search for the threshold that maximizes F1 using PyTorch and NumPy together.  
This step helps us verify consistency between NumPy and PyTorch implementations.


In [7]:
# choose the threshold that maximizes F1 (pure torch/np hybrid)

import numpy as np
import torch

def best_threshold_for_f1_torch(y_true_t: torch.Tensor, scores_t: torch.Tensor):
    """
    Evaluate F1 on a grid of candidate thresholds (unique scores + [0,1]).
    Returns best (threshold, F1, precision, recall).
    """
    uniq = torch.unique(scores_t).cpu().numpy()
    candidates = np.concatenate([[0.0], uniq, [1.0]])
    best = (-1.0, 0.0, 0.0, 0.0)

    y_true_np = y_true_t.cpu().numpy()
    scores_np = scores_t.cpu().numpy()

    for thr in candidates:
        y_pred = (scores_np >= thr).astype(int)
        TP = int(((y_pred == 1) & (y_true_np == 1)).sum())
        FP = int(((y_pred == 1) & (y_true_np == 0)).sum())
        TN = int(((y_pred == 0) & (y_true_np == 0)).sum())
        FN = int(((y_pred == 0) & (y_true_np == 1)).sum())
        P = TP / (TP + FP + 1e-12)
        R = TP / (TP + FN + 1e-12)
        F1 = 2 * P * R / (P + R + 1e-12)
        if F1 > best[1]:
            best = (float(thr), float(F1), float(P), float(R))
    return best

thr_t, f1_t, P_t, R_t = best_threshold_for_f1_torch(y_true_t, scores_t)
print(f"[Torch] Best F1={f1_t:.3f} at threshold={thr_t:.3f} (Precision={P_t:.3f}, Recall={R_t:.3f})")


[Torch] Best F1=0.734 at threshold=0.418 (Precision=0.646, Recall=0.850)
