 REF: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation#what-are-llm-evaluation-metrics-

# Statistical Scorers

## 1. Perplexity

Perplexity is a key metric used to evaluate how well a language model predicts a sequence of words. Importantly, it does **not** require ground truth!

### What is Perplexity?

- **Perplexity** measures how "perplexed" or "confused" a model is when predicting the next word in a sequence.
- A **lower perplexity** means the model is less confused and does a better job at predicting the next word.
- A **higher perplexity** indicates more confusion, meaning the model struggles to predict the next word accurately.

### Examples:
1. **Perplexity = 1**: The model perfectly predicted the sequence with 100% accuracy.
2. **Perplexity = 10**: The model is equally uncertain about 10 options for the next word, indicating confusion.

---

### Example: Perplexity Calculation

**Assume the LLM predicted “The Hat is on the mat.” for some prompt.**

#### **Step 1: Calculate the probabilities for each word given the previous words.**
For example, we have the following probabilities:

- P(“The”) = 0.5
- P(“hat” | “The”) = 0.4
- P(“is” | “The hat”) = 0.3
- P(“on” | “The hat is”) = 0.4
- P(“the” | “The hat is on”) = 0.5
- P(“mat” | “The hat is on the”) = 0.6

#### **Step 2: Apply log and sum the probabilities:**

```text
log(P(“The”)) +
log(P(“hat” | “The”)) +
log(P(“is” | “The hat”)) +
log(P(“on” | “The hat is”)) +
log(P(“the” | “The hat is on”)) +
log(P(“mat” | “The hat is on the”)) = abc
```
#### **Step 3: Average the log values and apply exponential:**

```text
exp(abc / 6) = 2.275 (Assumed value)

```

Hence, the Perplexity = 2.275, meaning the model had to choose from about 2.275 possible words for the next word in the sequence.


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import math

def calculate_perplexity(model, tokenizer, prompt):
    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors="pt")

    # Get input_ids
    input_ids = inputs["input_ids"]

    # Run the model in evaluation mode (no gradients)
    with torch.no_grad():
        # Get logits from the model (outputs.logits)
        outputs = model(input_ids, labels=input_ids)

    # Get the logits and calculate log probabilities
    logits = outputs.logits
    log_probs = torch.log_softmax(logits, dim=-1) # (batch_size, seq_length, vocab_size)
    print("\nShape of log probabilities: ", log_probs.shape)

    # batch_size is 1 for a single prompt.
    # seq_length is the number of tokens in the input sequence (including the prompt tokens).
    # vocab_size is the size of the model's vocabulary.

    # input_ids.unsqueeze(2) gives the indices of the tokens (the correct token IDs) for each position in the sequence.
    print("\nIndices of token from prompt: ", input_ids.unsqueeze(2))
    print("\nToken probablitites: ",log_probs.gather(2, input_ids.unsqueeze(2)))
    token_log_probs = log_probs.gather(2, input_ids.unsqueeze(2))

    avg_log_prob = token_log_probs.mean()

    perplexity = math.exp(-avg_log_prob.item())
    return perplexity

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Once upon"

perplexity = calculate_perplexity(model, tokenizer, prompt)
print(f"Perplexity: {perplexity}")



Shape of log probabilities:  torch.Size([1, 2, 50257])

Indices of token from prompt:  tensor([[[7454],
         [2402]]])

Token probablitites:  tensor([[[-11.2868],
         [ -9.3542]]])
Perplexity: 30348.55513310993


In [None]:
prompt = "Once upon a"

perplexity = calculate_perplexity(model, tokenizer, prompt)
print(f"Perplexity: {perplexity}")


Shape of log probabilities:  torch.Size([1, 3, 50257])

Indices of token from prompt:  tensor([[[7454],
         [2402],
         [ 257]]])

Token probablitites:  tensor([[[-11.2868],
         [ -9.3542],
         [ -9.2581]]])
Perplexity: 21297.67980545384


In [None]:
prompt = "Once upon a time"

perplexity = calculate_perplexity(model, tokenizer, prompt)
print(f"Perplexity: {perplexity}")


Shape of log probabilities:  torch.Size([1, 4, 50257])

Indices of token from prompt:  tensor([[[7454],
         [2402],
         [ 257],
         [ 640]]])

Token probablitites:  tensor([[[-11.2868],
         [ -9.3542],
         [ -9.2580],
         [ -7.3801]]])
Perplexity: 11156.584247829815


## 2. BLEU Metric

The BLEU (BiLingual Evaluation Understudy) metric is widely used in machine translation tasks to evaluate the quality of text generated by models like GPT-2. It compares the machine-generated translation with one or more reference translations to determine how "good" the generated text is.

It focuses on n-gram precision, meaning it measures how many n-grams (word sequences) in the generated text match those in the reference text. BLEU uses modified precision and considers brevity penalty to prevent the system from generating shorter translations just to boost precision.

### Limitations of BLEU

- **Lack of Semantics**: BLEU only focuses on surface-level word matching and doesn’t account for the meaning of the words.
- **Word Order Problems**: BLEU can give high precision for incorrect word order.
- **Non-English Languages**: BLEU struggles with languages that have different structures from English.
- **Tokenization Issues**: BLEU assumes that reference translations are already tokenized, which may create problems when comparing models that use different tokenizers.


In [None]:
!pip install evaluate



In [None]:
import evaluate
bleu = evaluate.load("bleu")

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

In [None]:
# Example 1: Perfect Match
predictions = ["the quick brown fox jumps over"]
references = [
    ["the quick brown fox jumps over"]
]
bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(results)

{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 6, 'reference_length': 6}


In [None]:
# Example 2: Partial Match
predictions = ["the quick brown fox jumps"]
references = [
    ["the quick brown fox leaps over"]]
bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(results)

{'bleu': 0.5475182535069453, 'precisions': [0.8, 0.75, 0.6666666666666666, 0.5], 'brevity_penalty': 0.8187307530779819, 'length_ratio': 0.8333333333333334, 'translation_length': 5, 'reference_length': 6}


In [None]:
# Example 3: No Match

predictions = ["good morning"]
references = [
    ["the fast brown fox leaps over"]
]
bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(results)


{'bleu': 0.0, 'precisions': [0.0, 0.0, 0.0, 0.0], 'brevity_penalty': 0.1353352832366127, 'length_ratio': 0.3333333333333333, 'translation_length': 2, 'reference_length': 6}


The brevity penalty is a factor applied to the final BLEU score to penalize translations that are shorter than the reference translations. The idea is that if a machine translation generates a much shorter output than the reference, it may have missed some important parts of the meaning. This is particularly useful in cases where a translation system generates very short sentences that aren't fully aligned with the reference in terms of content.

## 3.  ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

### 3.1. ROUGE-N Metric

ROUGE-N measures the number of matching n-grams between the model-generated text and a human-produced reference.

Consider the reference R and the candidate summary C:

* R: The hat is on the mat.
* C: The hat and the cat.

##### 3.1.1 ROUGE-1

Using **R** and **C**, we are going to compute the `precision`, `recall`, and `F1-score` of the matching `n-grams`. Let’s start computing ROUGE-1 by considering 1-grams only.

* ROUGE-1 ***precision*** can be computed as `the ratio of the number of unigrams in C that appear also in R (that are the words “the”, “hat, and “the”), over the number of unigrams in C.`
  
  ```text
  ROUGE-1 precision = 3/5 = 0.6
  ```

* ROUGE-1 ***recall*** can be computed as `the ratio of the number of unigrams in R that appear also in C (that are the words “the”, hat, and “the”), over the number of unigrams in R.`

  ```text
  ROUGE-1 recall = 3/6 = 0.5
  ```

* Then, ROUGE-1 F1-score can be directly obtained from the ROUGE-1 precision and recall using the standard F1-score formula.

  ```text
  ROUGE-1 F1-score = 2 * (precision * recall) / (precision + recall) = 0.54
  ```

##### 3.1.2 ROUGE-2
Let’s try computing the ROUGE-2 considering 2-grams.

Remember our reference R and candidate summary C:

* R: The hat is on the mat.
* C: The hat and the cat.

* ROUGE-2 ***precision*** is the ratio of the `number of 2-grams in C that appear also in R (only the 2-gram “the hat”), over the number of 2-grams in C.`

  ```text
  ROUGE-2 precision = 1/4 = 0.25
  ```

* ROUGE-2 ***recall*** is the ratio of the `number of 2-grams in R that appear also in C (only the 2-gram “the hat”), over the number of 2-grams in R.`

  ```text
  ROUGE-2 recall = 1/5 = 0.20
  ```
* Therefore, the F1-score is:

  ```text
  ROUGE-2 F1-score = 2 * (precision * recall) / (precision + recall) = 0.22
  ```

In [None]:
!pip install evaluate
!pip install rouge_score

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [

### 3.2. Rouge-L Metric

**ROUGE-L** is based on the longest common subsequence (LCS) between our model output and reference, i.e. the longest sequence of words (not necessarily consecutive, but still in order) that is shared between both. A longer shared sequence should indicate more similarity between the two sequences.

We can compute ROUGE-L recall, precision, and F1-score just like we did with ROUGE-N, but this time we replace each n-gram match with the LCS.

Remember our reference R and candidate summary C:

* R: The hat is on the mat.
* C: The hat and the cat.

The LCS is the 3-gram `“the hat the”` (remember that the words are not necessarily consecutive), which appears in both R and C.

* ROUGE-L precision is the ratio of the length of the LCS, over the number of unigrams in C.

  ```text
  ROUGE-L precision = 3/5 = 0.6
  ```

* ROUGE-L precision is the ratio of the length of the LCS, over the number of unigrams in R.
  ```text
  ROUGE-L recall = 3/6 = 0.5
  ```
Therefore, the F1-score is:
  ```text
  ROUGE-L F1-score = 2 * (precision * recall) / (precision + recall) = 0.55
  ```


In [None]:
rouge = evaluate.load('rouge')
predictions = ["hello there", "general kenobi"]
references = ["hello there", "general kenobi"]
results = rouge.compute(predictions=predictions,references=references,use_aggregator=False)
print(results)

{'rouge1': [1.0, 1.0], 'rouge2': [1.0, 1.0], 'rougeL': [1.0, 1.0], 'rougeLsum': [1.0, 1.0]}


In [None]:
import evaluate

rouge = evaluate.load('rouge')

predictions = [
    "Artificial Intelligence is the future",
    "Deep Learning is a subset of Machine Learning",
    "Have a great day!"
]

references = [
    ["AI will revolutionize the world", "The future belongs to Artificial Intelligence"],
    ["Deep Learning is part of AI", "Machine Learning includes Deep Learning"],
    ["Wishing you a wonderful day", "Hope you have an amazing day"]
]

results = rouge.compute(predictions=predictions, references=references,use_aggregator=False)

print(results)


{'rouge1': [0.7272727272727272, 0.6153846153846154, 0.4444444444444445], 'rouge2': [0.4444444444444445, 0.36363636363636365, 0.0], 'rougeL': [0.3636363636363636, 0.5714285714285715, 0.4444444444444445], 'rougeLsum': [0.3636363636363636, 0.5714285714285715, 0.4444444444444445]}
