# Top-k Accuracy & Perplexity

In this notebook, we explore two essential evaluation metrics used in modern ML and DL models:

---

## 1. Top-k Accuracy
Top-k accuracy measures how often the correct label is among the model’s **top-k predicted classes**.
- **Top-1 Accuracy** = standard accuracy.
- **Top-5 Accuracy** = checks if the true label is within top 5 predictions.
- Common in **image classification** tasks like ImageNet or CIFAR-100.

**Formula:**
$$
\text{Top-k Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\{y_i \in \text{TopK}(p_i, k)\}
$$

---

## 2. Perplexity
Perplexity quantifies **how well a probabilistic model (usually a Language Model)** predicts a sequence of words.  
It’s the exponential of the average negative log-likelihood:

$$
\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_{<i})\right)
$$

- Lower perplexity → model is more confident and accurate.
- Common in **language modeling, text generation**, and **speech recognition**.

---

## Goals of This Notebook
- Implement `top_k_accuracy` from scratch using PyTorch & NumPy.  
- Compute **Top-1** and **Top-5** accuracy on dummy data.  
- Implement **Perplexity** for a sample language model output.  
- Compare results and interpret their meaning.



# Top-k Implementation - Pytorch Version

In [1]:
import torch

def topk_accuracy_simple(logits, targets, k=5):
    """
    This function computes Top-k Accuracy (default k=5).

    Inputs:
      logits: model outputs shaped (N, C) where each row has the prediction scores for C classes.
      targets: true labels shaped (N,).
      k: how many top predictions to check.

    Output:
    Returns the percentage of samples where the correct label appears in the model’s top-k predictions.
    """
    # Get the indices of the top k predictions for each sample
    topk = torch.topk(logits, k, dim=1).indices

    # Check if the true label is in the top-k predictions
    correct = topk.eq(targets.view(-1, 1))

    # Compute how many are correct
    correct_count = correct.any(dim=1).sum().item()

    # Convert to percentage
    accuracy = 100.0 * correct_count / logits.size(0)
    return accuracy


In [2]:
# Example: 5 samples, 4 classes
logits = torch.tensor([
    [2.5, 1.2, 0.3, 0.7],
    [0.1, 2.1, 0.5, 0.2],
    [0.8, 0.3, 2.9, 0.1],
    [1.0, 0.5, 0.2, 3.0],
    [0.2, 1.5, 2.2, 0.9]
])
targets = torch.tensor([0, 1, 2, 3, 1])

# Compute Top-1 and Top-3 accuracy
top1 = topk_accuracy_simple(logits, targets, k=1)
top3 = topk_accuracy_simple(logits, targets, k=3)

print(f"Top-1 Accuracy: {top1:.2f}%")
print(f"Top-3 Accuracy: {top3:.2f}%")


Top-1 Accuracy: 80.00%
Top-3 Accuracy: 100.00%


# Top-k Implementation - Numpy Version

In [3]:
import numpy as np

def topk_accuracy_simple_numpy(scores, targets, k=5):
    """
    This function calculates Top-k accuracy using NumPy.

    Inputs:
      scores: an array (N, C) of model outputs (probabilities or logits).
      targets: an array (N,) of true class labels.
      k: number of top predictions to consider (default = 5).

    Output:
      Returns a single float accuracy value, showing the percentage of samples where the true label appears in the model’s top-k predictions.
    """
    # Step 1: Find indices of the top k predictions for each sample
    topk = np.argsort(scores, axis=1)[:, -k:]

    # Step 2: Check if each true label is inside its top-k predictions
    correct = np.any(topk == targets[:, None], axis=1)

    # Step 3: Count how many are correct
    accuracy = 100.0 * np.sum(correct) / len(targets)
    return accuracy


In [4]:
# Example: 5 samples, 4 classes
scores = np.array([
    [2.5, 1.2, 0.3, 0.7],
    [0.1, 2.1, 0.5, 0.2],
    [0.8, 0.3, 2.9, 0.1],
    [1.0, 0.5, 0.2, 3.0],
    [0.2, 1.5, 2.2, 0.9]
])
targets = np.array([0, 1, 2, 3, 1])

# Compute Top-1 and Top-3 accuracy
top1 = topk_accuracy_simple_numpy(scores, targets, k=1)
top3 = topk_accuracy_simple_numpy(scores, targets, k=3)

print(f"Top-1 Accuracy: {top1:.2f}%")
print(f"Top-3 Accuracy: {top3:.2f}%")


Top-1 Accuracy: 80.00%
Top-3 Accuracy: 100.00%


# Perplexity - Pytorch Version

Perplexity = “how confused is the model on average per token?”

Mathematically: ppl = exp(average negative log-likelihood).

- Lower ppl → better (model is more confident/correct).

- Typically computed for language models on next-token prediction.

In [5]:
import torch
import torch.nn.functional as F

def perplexity_from_logits_torch(logits, targets):
    """
    This function computes perplexity from model logits.

    Process:
        Calculates per-token cross-entropy loss and then takes the exponential of the mean loss.

    Output:
        Returns a single float value (perplexity) — a measure of how well the model predicts the sequence (lower is better).
    """
    B, T, V = logits.shape
    loss_per_token = F.cross_entropy(
        logits.view(-1, V),
        targets.view(-1),
        reduction="none"
    )
    avg_nll = loss_per_token.mean()   # average negative log-likelihood
    ppl = torch.exp(avg_nll).item()
    return ppl

# Tiny sanity check
torch.manual_seed(0)
B, T, V = 2, 4, 6
logits = torch.randn(B, T, V)
targets = torch.randint(0, V, (B, T))
print("PyTorch perplexity:", perplexity_from_logits_torch(logits, targets))


PyTorch perplexity: 8.822303771972656


# Tip: If you already have your average cross-entropy loss (e.g., loss.item() from your training step), perplexity is just:

In [6]:
# ppl = float(torch.exp(loss))

# (Optional) Mask padding in PyTorch (one extra line)

If your sequences are padded with pad_token_id, ignore those tokens:

In [7]:
def perplexity_from_logits_torch_masked(logits, targets, pad_token_id):
    B, T, V = logits.shape
    loss_per_token = F.cross_entropy(
        logits.view(-1, V),
        targets.view(-1),
        reduction="none"
    )
    mask = (targets.view(-1) != pad_token_id)
    avg_nll = (loss_per_token[mask]).mean()
    return float(torch.exp(avg_nll))


# Perplexity (from probabilities or logits) - Numpy Version

- A) If you have probabilities for the true tokens:

In [8]:
import numpy as np

def perplexity_from_probs_numpy(true_token_probs):
    """
    Args:
        true_token_probs: 1D array of length N with the model's probability
                          assigned to the correct token at each step (0< p <=1)
    Returns:
        float perplexity
    """
    # avoid log(0)
    eps = 1e-12
    nll = -np.log(true_token_probs + eps)   # per-token negative log-likelihood
    ppl = float(np.exp(nll.mean()))
    return ppl

# Example with fake probs for 6 tokens
p = np.array([0.3, 0.1, 0.25, 0.8, 0.5, 0.2])
print("NumPy ppl (from probs):", perplexity_from_probs_numpy(p))


NumPy ppl (from probs): 3.443299437293505


## B) If you have logits (stable softmax inside)

In [9]:
def softmax_numpy(x, axis=-1):
    x_max = np.max(x, axis=axis, keepdims=True)
    e = np.exp(x - x_max)
    return e / np.sum(e, axis=axis, keepdims=True)

def perplexity_from_logits_numpy(logits, targets):
    """
    Args:
        logits : (B, T, V) numpy array of unnormalized scores
        targets: (B, T)   numpy int array of true token ids
    Returns:
        float perplexity
    """
    B, T, V = logits.shape
    probs = softmax_numpy(logits, axis=-1)
    # pick probability of the true token at each position
    rows = np.arange(B)[:, None]
    cols = np.arange(T)[None, :]
    true_probs = probs[rows, cols, targets]
    return perplexity_from_probs_numpy(true_probs.ravel())

# Tiny example
np.random.seed(0)
B, T, V = 2, 4, 6
logits_np = np.random.randn(B, T, V)
targets_np = np.random.randint(0, V, size=(B, T))
print("NumPy ppl (from logits):", perplexity_from_logits_numpy(logits_np, targets_np))


NumPy ppl (from logits): 4.743589836038051
