# ML - LLMs - Evaluation - Perplexity

Perplexity is a metric used to evaluate how well a probability model predicts a sample. It is calculated as the exponential of the cross-entropy, which reflects the model's uncertainty in predicting the next word.

The formula for perplexity is:
 
$ \text{Perplexity} = \exp\left(H(P)\right) $

By substituting the cross-entropy formula, perplexity can also be expressed as:

$ \text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(x_i)\right) $

In essence, perplexity is a transformation of the cross-entropy loss, expressing the error in terms of the effective number of choices (or "confusion") the model faces. A lower cross-entropy corresponds to a lower perplexity, indicating better model predictions.

To compute perplexity:
1. Compute the softmax probabilities.
2. Calculate the cross-entropy loss for each sample in the batch.
3. Average the losses.
4. Compute the perplexity over the batch.

Learn more: [Perplexity Explanation](https://chatgpt.com/share/67fad511-d220-8009-a7b1-060df0840166)


In [4]:
import numpy as np

def compute_softmax(logits):
    """
    Compute the softmax probabilities for each row in the logits matrix.

    Parameters:
        logits (np.ndarray): A 2D array of shape (n_samples, n_classes)
    
    Returns:
        np.ndarray: A 2D array of the same shape with softmax probabilities.
    """
    # Exponentiate the logits
    # Subtract the max value for numerical stability.
    logits_stable = logits - np.max(logits, axis=1, keepdims=True)
    exp_logits = np.exp(logits_stable)
    # Sum along the classes (axis=1) and maintain dimensions for broadcasting
    sum_exp = np.sum(exp_logits, axis=1, keepdims=True)
    # Divide each exponential by the sum of exponentials for that sample
    softmax_probs = exp_logits / sum_exp
    return softmax_probs

def compute_cross_entropy_loss(softmax_probs, true_labels):
    """
    Compute the cross-entropy loss for each sample using one-hot encoded true labels.
    
    In this implementation:
      - `softmax_probs` is a 2D numpy array of shape (n_samples, n_classes),
      - `true_labels` is a 2D numpy array of one-hot encoded labels of shape (n_samples, n_classes).
      
    The loss is computed as:
         loss = - sum( y * log(probabilities) )
    where the summation is taken over the classes.
    
    Because the true labels are one-hot encoded, all elements except the one corresponding to the true class are zero.
    Summing along the classes picks out the negative log probability for the true class.
    
    Parameters:
        softmax_probs (np.ndarray): Predicted probabilities with shape (n_samples, n_classes)
        true_labels (np.ndarray): One-hot encoded true labels with shape (n_samples, n_classes)
    
    Returns:
        np.ndarray: A 1D array containing the cross-entropy loss for each sample.
    """
    # Use a small epsilon value to avoid log(0)
    # For each sample, multiply element-wise with the one-hot encoded true labels.
    # This zeroes out contributions from all classes except the true one.
    # Sum along the classes gives the negative log probability of the true class.
    epsilon = 1e-12
    # Compute the loss: -sum(y * log(probabilities)) for each sample
    losses = -np.sum(true_labels * np.log(softmax_probs + epsilon), axis=1)

    # Step 3: Compute the average cross-entropy loss over the batch
    avg_loss = np.mean(losses)
    return avg_loss

def calculate_perplexity(logits, true_labels):
    """
    Calculate the perplexity given logits and one-hot encoded true labels.
    
    Steps:
    1. Compute softmax probabilities from logits.
    2. Calculate the cross-entropy loss for each sample.
    3. Average the loss across the batch.
    4. Compute perplexity as the exponential of the average loss.
    
    Parameters:
        logits (np.ndarray): A 2D array of shape (n_samples, n_classes)
        true_labels (np.ndarray): One-hot encoded true labels of the same shape as logits.
        
    Returns:
        float: The perplexity computed over the batch.
    """
    # Step 1: Compute softmax probabilities
    softmax_probs = compute_softmax(logits)
    
    # Step 2 and 3: Compute cross-entropy loss per sample
    avg_loss = compute_cross_entropy_loss(softmax_probs, true_labels)
    
    # Step 4: Calculate perplexity as the exponential of the average loss
    perplexity = np.exp(avg_loss)
    
    return perplexity

# -----------------------------
# Example usage of the functions:
# -----------------------------

# Step 1: Define Batch Logits and True Labels
# Assume we have a batch of 3 examples, each with logits for 3 classes.
logits = np.array([
    [2.0, 1.0, 0.1],   # Sample 1
    [1.5, 0.5, 0.0],   # Sample 2
    [0.2, 1.2, 0.5]    # Sample 3
])
print("Logits:\n", logits)

# True labels in one-hot encoded form for each sample.
true_labels = np.array([
    [1, 0, 0],  # Sample 1: true class is Class 0
    [0, 1, 0],  # Sample 2: true class is Class 1
    [0, 0, 1]   # Sample 3: true class is Class 2
])
print("True Labels (one-hot):\n", true_labels)

# Calculate perplexity from logits and true labels
perplexity = calculate_perplexity(logits, true_labels)
print("Perplexity over the Batch:", perplexity)


Logits:
 [[2.  1.  0.1]
 [1.5 0.5 0. ]
 [0.2 1.2 0.5]]
True Labels (one-hot):
 [[1 0 0]
 [0 1 0]
 [0 0 1]]
Perplexity over the Batch: 2.909916162855865


# ML - LLMs - Evaluation - Perplexity (Logits from real model prediction)

In [None]:
from collections import Counter
from fractions import Fraction
import math
import re
import os
import torch
import pandas as pd
import math
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM, BitsAndBytesConfig
from datasets import load_dataset
from collections import Counter
from fractions import Fraction
# For embedding-based retrieval
from sentence_transformers import SentenceTransformer
import faiss
import torch.nn.functional as F

def calculate_perplexity(text, manually=True):
    """
    Compute the perplexity for a given text using the model.
    
    There are two modes controlled by the 'manually' flag:
    - If manually=True: the function computes the perplexity manually by:
        1. Encoding the text and obtaining logits.
        2. Shifting the logits and true labels so that we predict token t from tokens < t.
        3. Manually computing softmax probabilities, token-level cross-entropy loss,
            averaging them, and computing perplexity as exp(average loss).
    - If manually=False: the function computes perplexity using the model's built-in loss,
        which automatically handles shifting.
    
    Parameters:
        text (str): The input text.
        manually (bool): If True, use manual calculation; if False, use the model's loss.
    
    Returns:
        float: The computed perplexity.
    """

    pp_tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-1B-Instruct')
    pp_model = AutoModelForCausalLM.from_pretrained(
        'meta-llama/Llama-3.2-1B-Instruct',
        torch_dtype=torch.bfloat16,
    )
    pp_model.to("cuda")
    pp_model.eval()

    # Encode the text.
    input_ids = pp_tokenizer.encode(text, return_tensors="pt").to("cuda")
    
    if manually:
        ignore_index = -100

        # Step 2: Forward pass to obtain logits.
        with torch.no_grad():
            outputs = pp_model(input_ids)
            logits = outputs.logits  # Shape: [batch, seq_len, vocab_size]
        logits = logits.float()
        vocab_size = logits.size(-1)
        
        # Step 3: Pad the input IDs with the ignore index so that we can shift them.
        # Padding on the right will result in a tensor of shape [batch, seq_len+1].
        padded_labels = F.pad(input_ids, (0, 1), value=ignore_index)
        
        # Shift the labels: Remove the first token so that targets become tokens 1...L.
        shifted_true_labels = padded_labels[:, 1:].contiguous()  # Shape: [batch, seq_len]
        
        # Step 4: Align logits with the shifted labels.
        shifted_logits = logits[:, :shifted_true_labels.size(1), :]  # Shape: [batch, seq_len, vocab_size]
        
        # Step 5: Manually compute softmax probabilities.
        # For numerical stability, subtract the max logit in each vocabulary slice.
        max_logits, _ = torch.max(shifted_logits, dim=-1, keepdim=True)
        stable_logits = shifted_logits - max_logits
        exp_logits = torch.exp(stable_logits)
        sum_exp = torch.sum(exp_logits, dim=-1, keepdim=True)
        softmax_probs = exp_logits / sum_exp  # Shape: [batch, seq_len, vocab_size]
        
        # Step 6: Prepare valid mask and adjusted labels.
        # Create a mask of valid tokens (those not equal to ignore_index).
        valid_mask = (shifted_true_labels != ignore_index)
        # Create a copy of shifted_true_labels and replace ignore_index values with a dummy index (e.g., 0).
        adjusted_labels = shifted_true_labels.clone()
        adjusted_labels[~valid_mask] = 0
        
        # Step 7: Gather the probabilities for the true (shifted) labels.
        # Since adjusted_labels now contains valid indices everywhere, gather works without error.
        probs_for_true = softmax_probs.gather(dim=-1, index=adjusted_labels.unsqueeze(-1)).squeeze(-1)
        
        # Step 8: Compute token-level cross-entropy loss.
        epsilon = 1e-12
        token_losses = -torch.log(probs_for_true + epsilon)  # Shape: [batch, seq_len]
        # Zero out the losses for tokens that should be ignored.
        token_losses = token_losses * valid_mask.float()
        
        # Step 9: Average the loss only over valid tokens.
        total_loss = token_losses.sum()
        num_valid_tokens = valid_mask.sum().float()
        avg_loss = total_loss / num_valid_tokens
        
        # Step 10: Compute perplexity as the exponential of the average loss.
        perplexity = torch.exp(avg_loss).item()
        
        print("dfdfdfdfdfdddddd")
    else:
        # Use the built-in loss from the model.
        with torch.no_grad():
            outputs = pp_model(input_ids, labels=input_ids)
            loss = outputs.loss
        perplexity = torch.exp(loss).item()

    # Optionally, print both (or only the selected mode)
    print("=============")
    print("Calculated perplexity (manually={0}): {1}".format(manually, perplexity))
    
    print("=============")
    return perplexity

candidate_2 = "The big blue car drives quickly to the road"
perplexity = calculate_perplexity(candidate_2)
print(perplexity)

# ML - LLMs - Evaluation - BLEU - ROUGE

In [5]:
from collections import Counter
from fractions import Fraction
import math

def calculate_rouge(references, candidate):
    """
    Compute ROUGE-1, ROUGE-2, and ROUGE-L scores.
    For each metric, compute scores for each reference and select the maximum F-measure.
    
    The parameter `references` can be either a single string or a list of strings.
    """
    # Ensure references is a list.
    if not isinstance(references, list):
        references = [references]
    
    def tokenize(sentence):
        return sentence.lower().split()
    
    def ngrams(tokens, n):
        return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
    
    def rouge_n(candidate_tokens, reference_tokens, n):
        cand_ngrams = Counter(ngrams(candidate_tokens, n))
        ref_ngrams = Counter(ngrams(reference_tokens, n))
        overlap = sum(min(cand_ngrams[ngram], ref_ngrams[ngram]) for ngram in cand_ngrams)
        precision = overlap / sum(cand_ngrams.values()) if sum(cand_ngrams.values()) > 0 else 0
        recall = overlap / sum(ref_ngrams.values()) if sum(ref_ngrams.values()) > 0 else 0
        if precision + recall == 0:
            fscore = 0
        else:
            fscore = 2 * precision * recall / (precision + recall)
        return precision, recall, fscore

    # Use the provided code for LCS.
    def lcs(text1, text2):
        n1 = len(text1)
        n2 = len(text2)
        dp = [[0] * (n1 + 1) for _ in range(n2 + 1)]
        for i in range(n2 - 1, -1, -1):
            for j in range(n1 - 1, -1, -1):
                if text2[i] != text1[j]:
                    dp[i][j] = max(dp[i + 1][j], dp[i][j + 1])
                else:
                    dp[i][j] = 1 + dp[i + 1][j + 1]
        return dp[0][0]

    # Define ROUGE-L based on the LCS computed by our lcs function.
    def rouge_l(candidate_tokens, reference_tokens):
        lcs_length = lcs(candidate_tokens, reference_tokens)
        precision = lcs_length / len(candidate_tokens) if candidate_tokens else 0
        recall = lcs_length / len(reference_tokens) if reference_tokens else 0
        if precision + recall == 0:
            fscore = 0
        else:
            fscore = 2 * precision * recall / (precision + recall)
        return precision, recall, fscore

    # Tokenize candidate.
    candidate_tokens = tokenize(candidate)
    
    # Compute ROUGE scores for each reference and select the best for each metric.
    rouge1_scores = []
    rouge2_scores = []
    rougeL_scores = []
    
    for ref in references:
        ref_tokens = tokenize(ref)
        p1, r1, f1 = rouge_n(candidate_tokens, ref_tokens, 1)
        p2, r2, f2 = rouge_n(candidate_tokens, ref_tokens, 2)
        pL, rL, fL = rouge_l(candidate_tokens, ref_tokens)
        rouge1_scores.append((p1, r1, f1))
        rouge2_scores.append((p2, r2, f2))
        rougeL_scores.append((pL, rL, fL))
    
    rouge1 = max(rouge1_scores, key=lambda x: x[2])
    rouge2 = max(rouge2_scores, key=lambda x: x[2])
    rougeL = max(rougeL_scores, key=lambda x: x[2])
    
    return {
        "rouge1": {"precision": rouge1[0], "recall": rouge1[1], "fmeasure": rouge1[2]},
        "rouge2": {"precision": rouge2[0], "recall": rouge2[1], "fmeasure": rouge2[2]},
        "rougeL": {"precision": rougeL[0], "recall": rougeL[1], "fmeasure": rougeL[2]}
    }

def calculate_bleu(references, candidate, weights=(0.25, 0.25, 0.25, 0.25)):
    """
    Compute a sentence-level BLEU score from scratch.
    Uses the tutorial code with tokenization, modified precision with clipping,
    and applies a brevity penalty.
    """
    # If references is not a list, wrap it.
    if not isinstance(references, list):
        references = [references]
    # Define helper functions.
    def tokenize(sentence):
        return sentence.lower().split()

    def ngrams(tokens, n):
        return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

    def modified_precision(candidate_tokens, reference_tokens_list, n):
        candidate_ngrams = Counter(ngrams(candidate_tokens, n))
        max_ref_counts = Counter()
        for ref in reference_tokens_list:
            ref_ngrams = Counter(ngrams(ref, n))
            # For each ngram in candidate, update with maximum count observed among references.
            for ngram in candidate_ngrams:
                max_ref_counts[ngram] = max(max_ref_counts[ngram], ref_ngrams[ngram])
        clipped_counts = {ngram: min(count, max_ref_counts[ngram])
                            for ngram, count in candidate_ngrams.items()}
        numerator = sum(clipped_counts.values())
        denominator = sum(candidate_ngrams.values())
        if denominator == 0:
            return 0
        return Fraction(numerator, denominator)

    def closest_reference_length(candidate_tokens, reference_tokens_list):
        candidate_len = len(candidate_tokens)
        ref_lens = [len(ref) for ref in reference_tokens_list]
        return min(ref_lens, key=lambda ref_len: (abs(ref_len - candidate_len), ref_len))

    def brevity_penalty(candidate_tokens, reference_tokens_list):
        c_len = len(candidate_tokens)
        closest_len = closest_reference_length(candidate_tokens, reference_tokens_list)
        if c_len > closest_len:
            return 1
        else:
            return math.exp(1 - closest_len / c_len) if c_len > 0 else 0

    # Main BLEU computation.
    candidate_tokens = tokenize(candidate)
    reference_tokens_list = [tokenize(ref) for ref in references]
    precisions = []
    Hard_Smoothing = False
    for i in range(len(weights)):
        p = modified_precision(candidate_tokens, reference_tokens_list, i+1)
        if Hard_Smoothing:
            if p == 0:
                p = Fraction(1, 10**9)  # smoothing: tiny value
        precisions.append(float(p))
    # Geometric mean of n-gram precisions.
    if all(p == 0 for p in precisions):
        return 0
    
    if Hard_Smoothing:
        geo_mean = math.exp(sum(w * math.log(p) for w, p in zip(weights, precisions)))
    else:
        geo_mean = math.exp(sum(w * math.log(float(p)) for w, p in zip(weights, precisions) if p != 0))
    bp = brevity_penalty(candidate_tokens, reference_tokens_list)
    bleu = bp * geo_mean
    return min(bleu, 1)

candidate_1 = "The quick brown dog jumps over the lazy fox"
references_1 = [
    "The quick brown fox jumps over the lazy dog",
    "The fast brown fox leaps over the lazy dog",
]

# Example 2
candidate_2 = "The big blue car drives quickly to the road"
references_2 = [
    "The small red car races quickly along the road",
    "A small red car speeds rapidly down the avenue",
]
bleu_score_scratch_1 = calculate_bleu(references_1,candidate_1, weights=(0.25, 0.25, 0.25, 0.25))
bleu_score_scratch_2 = calculate_bleu(references_2,candidate_2, weights=(0.25, 0.25, 0.25, 0.25))
print("BLEU score:")
print(f"BLEU score for example 1: {bleu_score_scratch_1:.2f}")
print(f"BLEU score for example 2: {bleu_score_scratch_2:.2f}")

reference = ["The quick brown fox jumps over the lazy dog"]
candidate = "The fox jumps over the dog"
print()
print("ROUGE Score:")
print(calculate_rouge(reference, candidate))

BLEU score:
BLEU score for example 1: 0.46
BLEU score for example 2: 0.51

ROUGE Score:
{'rouge1': {'precision': 1.0, 'recall': 0.6666666666666666, 'fmeasure': 0.8}, 'rouge2': {'precision': 0.6, 'recall': 0.375, 'fmeasure': 0.4615384615384615}, 'rougeL': {'precision': 1.0, 'recall': 0.6666666666666666, 'fmeasure': 0.8}}
