
# Introduction

## Overview of NLP Generative Tasks

Natural Language Processing (NLP) has seen rapid advancements, leading to the development of various generative tasks. These tasks involve the creation of new text based on certain inputs and contexts. They are pivotal in applications such as chatbots, automated content creation, and language translation. The primary generative tasks we will focus on in this tutorial include:

1. **Text Generation**: This involves automatically generating coherent and contextually relevant text based on a given input. Examples include story generation, content completion, and automated journalism.

2. **Abstractive Summarization**: Unlike extractive summarization which pulls key phrases or sentences directly from the source, abstractive summarization aims to capture the essence of the source material and express it in new, concise terms. This task is crucial in creating summaries for lengthy documents like articles or reports.

3. **Abstractive Question Answering**: This task involves understanding a query posed in natural language and generating an answer that is not necessarily a verbatim segment from the source text. It requires deep comprehension and the ability to generate relevant, accurate responses.

## The Role of Metrics in NLP Model Evaluation

Evaluating the performance of models that handle these generative tasks is not straightforward. Unlike tasks with clear right or wrong answers (such as classification), generative tasks require an assessment of aspects like fluency, coherence, relevance, and factual accuracy of the generated text. This is where metrics come in. They provide a way to quantitatively measure the performance of NLP models, offering insights into aspects such as:

- **Quality**: How well does the generated text align with human standards of language quality?
- **Relevance**: Does the output adequately respond to the input prompt or question?
- **Coherence**: Is the generated text logically structured and understandable?
- **Diversity**: Does the model produce varied and unique outputs, or does it tend to generate repetitive responses?

In this tutorial, we will delve into several key metrics used to evaluate NLP generative tasks, understand how they work, and learn how to implement them using Python in a Jupyter Notebook environment.

---

This section sets the stage for your tutorial by providing an introduction to NLP generative tasks and the importance of metrics in their evaluation. It gives context to the learners about what they can expect to learn and why it's important.

# Setting Up the Jupyter Notebook Environment

To effectively work through the examples and code in this tutorial, it's essential to have a properly set up Jupyter Notebook environment. This section will guide you through the process of installing necessary libraries and setting up your Jupyter Notebook.

## Installing Necessary Libraries

In [68]:
!pip install transformers evaluate rouge_score datasets bert_score -qq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m30.7/61.1 kB[0m [31m742.1 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m961.6 kB/s[0m eta [36m0:00:00[0m
[?25h

## Importing Libraries in Jupyter Notebook

In [54]:
import nltk
import evaluate
import datasets
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import scipy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# Basics of NLP Generative Tasks

In this section, we will briefly introduce the core NLP generative tasks - Text Generation, Abstractive Summarization, Abstractive Question Answering, and Machine Translation. Understanding these tasks is crucial before we dive into the metrics used to evaluate them.

## Text Generation

Text Generation in NLP involves creating meaningful and coherent text automatically. This task has a wide range of applications, from generating creative stories to producing news articles. The key challenge is to ensure that the generated text is relevant, coherent, and contextually appropriate.

## Abstractive Summarization

Abstractive Summarization aims to produce a concise and fluent summary of a longer text. Unlike extractive summarization, which merely selects parts of the source text, abstractive summarization generates new phrases, possibly rephrasing or using new words, to convey the main points of the original text. It's vital in digesting long articles, reports, or even books into shorter, consumable content.

## Abstractive Question Answering

Abstractive Question Answering systems understand a query in natural language and generate a response that directly answers the question. This response is not a mere extraction from a given text but an abstraction, demonstrating an understanding of both the question and the relevant knowledge base or text.

## Machine Translation

Machine Translation is the task of automatically converting text from one language to another. This task is challenging due to the complexity of accurately capturing the meaning and nuances of the original text and translating them into a different language while maintaining fluency and coherence. Applications include translating web pages, documents, or even real-time translation in communication apps.

## Importance of These Tasks

These generative tasks play a crucial role in various applications, from enhancing user experience in AI-driven interfaces to providing critical information in accessible formats. As the demand for sophisticated NLP applications grows, the need for effective and accurate generative models becomes increasingly important.

In the following sections, we will explore various metrics used to evaluate these tasks. Understanding these metrics will help us gauge the performance of NLP models in generating human-like, coherent, and contextually relevant text.

---


# Comprehensive Overview of Evaluation Metrics

Evaluating NLP generative tasks requires a set of metrics that can quantitatively measure various aspects of the generated text. In this section, we will introduce a range of metrics, each providing unique insights into the performance of NLP models. Later, we will explore each metric in detail, including their implementation using the latest libraries from Hugging Face and interpreting their outputs.

## Overview of Key Metrics

1. **BLEU (Bilingual Evaluation Understudy)**: Primarily used in Machine Translation, BLEU measures how many words and phrases in the generated text match a reference text, focusing on precision.

2. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**: Commonly used in summarization tasks, ROUGE evaluates the quality of summary by measuring the overlap of n-grams between the generated summary and a reference summary.

3. **Perplexity**: Often used in text generation, perplexity measures how well a probability model predicts a sample. It gauges the uncertainty of the language model about the generated text.

4. **METEOR (Metric for Evaluation of Translation with Explicit Ordering)**: Similar to BLEU but with improved correlation with human judgment, METEOR considers word order and synonym matching in translation.

5. **BERTScore**: Utilizing BERT embeddings, this metric compares the semantic similarity between the generated text and the reference text, offering a more nuanced evaluation than n-gram matching.


In the following subsections, we will explore each of these metrics in detail. We will discuss how they work, how to implement them using Python and Hugging Face libraries, and how to interpret their outputs to effectively evaluate NLP models.





# BLEU Metric

## Understanding Clipped Precision and Brevity Penalty

## Clipped Precision vs. Regular Precision

Clipped precision is crucial in the BLEU metric to prevent models from getting undue credit for repeated words. Let's look at an example to understand the difference between regular precision and clipped precision.

### Example to Illustrate Clipped Precision

- **Reference Translation**: "the quick brown fox jumps over the lazy dog"
- **Machine Translation**: "the the the the the the the the the"

#### Regular Precision Calculation
- Total unigrams in Machine Translation: 9
- Unigram matches in Reference: 9 (naively counting "the")
- Regular Precision = (Number of Unigram Matches) / (Total Unigrams in Machine Translation) = 9 / 9 = 1.0

#### Clipped Precision Calculation
- "The" appears only once in the Reference.
- Clipped count for "the" = 2 (not 9)
- Clipped Precision = 2 / 9 ≈ 0.22

The clipped precision significantly lowers the score, reflecting a more accurate evaluation of the translation quality.

## Brevity Penalty

The brevity penalty is applied when the machine translation is shorter than the reference translation.

### Example of Brevity Penalty

- **Reference Translation**: "The quick brown fox jumps over the lazy dog"
- **Machine Translation**: "A quick fox"

#### Brevity Penalty Calculation
- Length of Machine Translation: 3
- Length of Reference Translation: 9
- Brevity Penalty is applied since the machine translation is shorter.

## Python Implementation using Hugging Face's `evaluate`
To calculate the BLEU score with the `evaluate` library from Hugging Face, use the following Python code:

In [35]:
from evaluate import load

bleu_metric = load("bleu")

reference = ["The quick brown fox jumps over the lazy dog"]
candidate = ["The quick brown fox jumped over the lazy dog"]

result = bleu_metric.compute(predictions=candidate, references=reference, max_order = 1)
result

{'bleu': 0.8888888888888888,
 'precisions': [0.8888888888888888],
 'brevity_penalty': 1.0,
 'length_ratio': 1.0,
 'translation_length': 9,
 'reference_length': 9}

## BLEU equations

Brevity Penalty (BP):

$$
BP =
\begin{cases}
1 & \text{if } c > r \\
e^{(1-r/c)} & \text{if } c \leq r
\end{cases}
$$

Then, the BLEU score is calculated as:

$$
BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log P_n\right)
$$

In the baseline, where $( N = 4)$ and uniform weights $( w_n = \frac{1}{N} )$ are used.


In [36]:
result = bleu_metric.compute(predictions=candidate, references=reference, max_order = 4)
result

{'bleu': 0.5969491792019646,
 'precisions': [0.8888888888888888,
  0.75,
  0.5714285714285714,
  0.3333333333333333],
 'brevity_penalty': 1.0,
 'length_ratio': 1.0,
 'translation_length': 9,
 'reference_length': 9}

In [43]:
np.exp(np.sum(1/4 * np.log(result['precisions'])))

0.5969491792019646

## BLEU Metric Example for the whole document

When calculating the BLEU score for an entire document with one reference per prediction, you'll follow these steps:

### 1. Tokenize Predictions and References
   - Break down each sentence in the predictions and references into individual words (tokens).

### 2. Count Matches and Clip Counts
   - For each prediction-reference pair:
     - Count the occurrences of each unigram (word) in the prediction.
     - Count the occurrences of each unigram in the corresponding reference.
     - Clip the count of each unigram in the prediction to the count in the reference. This means if a unigram appears more in the prediction than in the reference, its count is reduced to match the reference count.

### 3. Calculate Clipped Precision
   - Sum the clipped counts for all predictions.
   - Divide this sum by the total number of unigrams across all predictions. This gives the proportion of unigrams in the predictions that are correctly predicted, according to the references.

### 4. Calculate Total Lengths
   - Compute the total length of the predictions (sum of the number of unigrams in all predictions).
   - Compute the total length of the references (sum of the number of unigrams in each reference, using the length of the single reference for each prediction).

### 5. Calculate Brevity Penalty
   - The brevity penalty compensates for the tendency of shorter translations to have higher precision. It's calculated using the ratio of the total length of the translations to the total length of the references.
   - If the total length of the translations is less than the total length of the references, the brevity penalty is less than 1, reducing the BLEU score. Otherwise, it is 1 (no penalty).

### 6. Compute BLEU Score
   - Multiply the clipped precision by the brevity penalty to get the final BLEU score.

### 7. Interpret the Score
   - A higher BLEU score indicates better translation quality, with a score of 1 being perfect (an exact match with the references).

In summary, the BLEU score calculation involves a detailed comparison of the translated text against reference translations at the unigram level, considering both accuracy (through precision) and fluency (through the brevity penalty). This method provides a quantitative measure of translation quality for the entire document.

In [22]:
from evaluate import load

bleu = load("bleu")

predictions = ["The fast brown fox jumps over the lazy dog",
    "A speedy fox jumps over a dog"]

references = [
    ["The swift brown fox leaps over the lazy dog"],
    ["The quick brown fox jumps over the lazy dog"]
  ]


results = bleu.compute(predictions=predictions, references=references, max_order =1)

results

{'bleu': 0.6067166205269093,
 'precisions': [0.6875],
 'brevity_penalty': 0.8824969025845955,
 'length_ratio': 0.8888888888888888,
 'translation_length': 16,
 'reference_length': 18}

In [24]:
import math
from collections import Counter

# Define the predictions and references
predictions = ["the fast brown fox jumps over the lazy dog", "a speedy fox jumps over a dog"]
references = [
    ["the swift brown fox leaps over the lazy dog"],
    ["the quick brown fox jumps over the lazy dog"]
]

# Step 1: Tokenize Predictions and References
tokenized_predictions_manual = [p.split() for p in predictions]
tokenized_references_manual = [[r.split() for r in group] for group in references]

# Step 2 & 3: Count Matches and Clip Counts
matches = 0
for pred, refs in zip(tokenized_predictions_manual, tokenized_references_manual):
    pred_counter = Counter(pred)
    ref_counter = Counter(refs[0])  # Use the single reference
    print(pred_counter)
    print(ref_counter)
    print("------------------------------")
    matches += sum(min(pred_counter[unigram], ref_counter[unigram]) for unigram in pred_counter)

# Step 4: Calculate Clipped Precision
total_unigrams_in_predictions = sum(len(p) for p in tokenized_predictions_manual)
clipped_precision_manual = matches / total_unigrams_in_predictions

# Step 5: Calculate Total Lengths
total_translation_length = total_unigrams_in_predictions
total_reference_length = sum(len(r[0]) for r in tokenized_references_manual)  # Use the length of the single reference

# Step 6: Calculate Brevity Penalty
length_ratio = total_translation_length / total_reference_length
brevity_penalty_manual = math.exp(1 - 1 / length_ratio) if length_ratio < 1 else 1.0

# Step 7: Compute BLEU Score
bleu_score_manual = clipped_precision_manual * brevity_penalty_manual

# Final BLEU Score and components
bleu_score_manual, clipped_precision_manual, brevity_penalty_manual, length_ratio



Counter({'the': 2, 'fast': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'over': 1, 'lazy': 1, 'dog': 1})
Counter({'the': 2, 'swift': 1, 'brown': 1, 'fox': 1, 'leaps': 1, 'over': 1, 'lazy': 1, 'dog': 1})
------------------------------
Counter({'a': 2, 'speedy': 1, 'fox': 1, 'jumps': 1, 'over': 1, 'dog': 1})
Counter({'the': 2, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'over': 1, 'lazy': 1, 'dog': 1})
------------------------------


(0.6067166205269093, 0.6875, 0.8824969025845955, 0.8888888888888888)

## Extended BLEU Metric Example with Multiple References

## Scenario Setup

We're evaluating two machine translations, each with a different set of reference translations.

### Translation 1
- **Machine Translation 1**: "The fast brown fox jumps over the lazy dog"
- **Reference Translations**:
  - Reference 1: "The quick brown fox jumps over the lazy dog"
  - Reference 2: "The swift brown fox leaps over the lazy dog"

### Translation 2
- **Machine Translation 2**: "A speedy fox jumps over a dog"
- **Reference Translation**:
  - Reference: "The quick brown fox jumps over the lazy dog"


### Reference Counts for Each Token:
- **Maximum Count Across References:** For each unigram (or n-gram) in the prediction, you determine its count in each reference and then take the maximum of these counts. This method accounts for the possibility that different references might have varying frequencies of the same word.
- **Example:** If "the" appears 2 times in one reference and 3 times in another, you would use 3 as its count for clipping. Similarly, if "fox" appears 1 time in one reference and 2 times in another, you would use 2 for its count.

### Clipping Counts:
- The count of each unigram in the prediction is clipped to this maximum count found across all references. This approach ensures fairness by considering all possible valid translations represented in the references.

### Brevity Penalty:
- **Length of Shortest References:** For calculating the brevity penalty, the BLEU score uses the length of the shortest reference for each prediction. This choice is made to avoid unfairly penalizing the translation for being shorter than unnecessarily long reference translations.
- **Penalty Calculation:** The brevity penalty is calculated based on the ratio of the total length of the translations to the total length of these shortest references. If the translations are shorter, the penalty reduces the BLEU score, balancing precision with adequate translation length.

In summary, handling multiple references per prediction in BLEU score calculation involves taking the maximum count of each unigram across all references and using the shortest reference length for calculating the brevity penalty. This approach ensures a balanced and fair assessment of translation quality, accommodating the diversity in possible correct translations.

In [32]:
from evaluate import load

bleu = load("bleu")

predictions = ["the fast brown fox jumps over the lazy dog",
    "a speedy fox jumps over a dog"]

references = [
    [["the quick brown fox jumps over the lazy dog"],
     ["the swift brown fox leaps over the lazy dog"]],
    [["the quick brown fox jumps over the lazy dog"]]
]

results = bleu.compute(predictions=predictions, references=references, max_order=1)

results


{'bleu': 0.3866002193199219,
 'precisions': [0.5625],
 'brevity_penalty': 0.6872892787909722,
 'length_ratio': 0.7272727272727273,
 'translation_length': 16,
 'reference_length': 22}

## Limitations and Bias of BLEU Metric:

1. **Focus on Token Overlap, Not Meaning**: BLEU primarily assesses the similarity in tokens between the predicted and reference texts, rather than their semantic equivalence. This can lead to situations where BLEU scores do not align with human judgments of translation quality, as humans consider the conveyed meaning rather than just token matching.

2. **Bias Towards Shorter Translations**: The way BLEU is calculated inherently favors shorter translations, which may achieve higher scores due to their token overlap structure. To mitigate this, a brevity penalty is implemented, but this is only a partial solution.

3. **Lack of Comparability Across Datasets and Languages**: BLEU scores are not universally comparable. They can vary significantly when applied to different datasets or languages, making it difficult to use BLEU for standardized cross-context comparisons.

4. **Sensitivity to Parameters and Techniques Used**: The BLEU score is highly sensitive to the specific parameters and techniques used in its computation, such as tokenization methods and normalization processes. This sensitivity means that BLEU scores calculated with different sets of parameters or techniques are not directly comparable. This aspect is further discussed in related literature and debates surrounding the metric's application.

# ROUGE Metric

## Overview of ROUGE

ROUGE stands for **R**ecall-**O**riented **U**nderstudy for **G**isting **E**valuation. It is a collection of metrics used for evaluating text summarization and machine translation. ROUGE measures the overlap of content between generated text and reference texts, emphasizing capturing significant content.

### ROUGE Variants
- **ROUGE-N**: Measures n-gram overlap (unigrams, bigrams, trigrams) between generated text and reference.
- **ROUGE-L**: Focuses on the longest common subsequence (LCS) between the texts.
- **ROUGE-Lsum**: Similar to ROUGE-L, but applies LCS calculation sentence-by-sentence.


## Detailed Example and Calculation
### **Scenario**
**Predictions:**<br>
"police kill the gunman"<br>
"the gunman kill police"<br>
"the gunman police killed"<br>

**References** (all three are the same): "police killed the gunman"

### ROUGE-1 (Unigram Overlap)
For each prediction, we will calculate the number of matching unigrams with the reference:

- **Reference Summary**: ["police", "killed", "the", "gunman"]
- **Prediction 1**: ["police", "kill", "the", "gunman"] - 3 matches ("police", "the", "gunman")
- **Prediction 2**: ["the", "gunman", "kill", "police"] - 3 matches ("the", "gunman", "police")
- **Prediction 3**: ["the", "gunman", "police", "killed"] - 4 matches ("the", "gunman", "police", "killed")

Now, let's calculate the ROUGE-1 scores:

- **ROUGE-1 Recall**: Number of Matching Unigrams / Total Unigrams in Reference
  - For Prediction 1 and 2: 3 / 4 = 0.75
  - For Prediction 3: 4 / 4 = 1.0
- **ROUGE-1 Precision**: Number of Matching Unigrams / Total Unigrams in Prediction
  - For Prediction 1 and 2: 3 / 4 = 0.75
  - For Prediction 3: 4 / 4 = 1.0
- **ROUGE-1 F1 Score**:
$$F1 = 2 \times \frac{{\text{Precision} \times \text{Recall}}}{{\text{Precision} + \text{Recall}}}$$

  - The harmonic mean of precision and recall: **[0.75, 0.75, 1]**

### ROUGE-2 (Bigram Overlap)
Now for bigrams:

- **Reference Summary Bigrams**: ["police killed", "killed the", "the gunman"]
- **Prediction 1 Bigrams**: ["police kill", "kill the", "the gunman"] - 1 match ("the gunman")
- **Prediction 2 Bigrams**: ["the gunman", "gunman kill", "kill police"] - 1 match ("the gunman")
- **Prediction 3 Bigrams**: ["the gunman", "gunman police", "police killed"] - 2 match ("the gunman", "police killed")

- **ROUGE-2 Recall**: Number of Matching Bigrams / Total Bigrams in Reference
  - For Prediction 1 and 2: 1 / 3 = 0.33
  - For Prediction 3: 2 / 3 = 0.66
- **ROUGE-2 Precision**: Number of Matching Bigrams / Total Bigrams in Prediction
  - For Prediction 1 and 2: 1 / 3 = 0.33
  - For Prediction 3: 2 / 3 = 0.66
- **ROUGE-2 F1 Score**: The harmonic mean of precision and recall: **[0.33, 0.33. 0.66]**

### ROUGE-L (Longest Common Subsequence): Sentence Level LCS
We will demonstarte the the calculations for each sentence separately.

ROUGE-L is based on the longest common subsequence.

- **LCS for Prediction 1**: "police ... the gunman", **Length of LCS = 3**
- **LCS for Prediction 2**: "the gunman", **Length of LCS = 2**
- **LCS for Prediction 3**: "the gunman" AND "police killed", **Length of LCS = 2**

- **ROUGE-L Recall**: Length of LCS / Total Words in Reference
  - For Prediction 1: 1 / 3 = 3/4 = 0.75
  - For Prediction 2 and 3: 2 / 4 = 0.5
- **ROUGE-L Precision**: Length of LCS / Total Words in Prediction
  - For Prediction 1: 1 / 3 = 3/4 = 0.75
  - For Prediction 2 and 3: 2 / 4 = 0.5
- **ROUGE-L F1 Score**: The harmonic mean of precision and recall : **[0.75, 0.5, 0.5]**





In [113]:
from evaluate import load

# Define a simple whitespace tokenizer function
def whitespace_tokenizer(text):
    return text.split()

# Load ROUGE metric
rouge = load("rouge")

# Define predictions and references
predictions = ["police kill the gunman", "the gunman kill police", "the gunman police killed"]
references = ["police killed the gunman", "police killed the gunman", "police killed the gunman"]

# Compute ROUGE scores using the custom whitespace tokenizer
results = rouge.compute(predictions=predictions, references=references, tokenizer=whitespace_tokenizer,
                        use_stemmer=False, use_aggregator = False)
print("Detailed ROUGE Score:")
results

Detailed ROUGE Score:


{'rouge1': [0.75, 0.75, 1.0],
 'rouge2': [0.3333333333333333, 0.3333333333333333, 0.6666666666666666],
 'rougeL': [0.75, 0.5, 0.5],
 'rougeLsum': [0.75, 0.5, 0.5]}

In the above example we only have one sentence foir each prediction, rougeL and rougeLSum are same

Detailed ROUGE Score:
{'rouge1': 0.8333333333333334, 'rouge2': 0.4444444444444444, 'rougeL': 0.5833333333333334, 'rougeLsum': 0.5833333333333334}


This code computes the ROUGE score for our example. The output includes various components such as ROUGE-N, ROUGE-L, and ROUGE-S scores, providing a comprehensive view of the summarization quality.



## Limitations and Considerations
- **Focus on Overlap**: Like BLEU, ROUGE primarily measures overlap in tokens, which may not fully capture the semantic accuracy and coherence of the generated text.
- **Greater Emphasis on Recall**: ROUGE emphasizes content coverage from the reference in the generated text, which can sometimes overlook the conciseness and relevance aspects of a good summary.
- **Variability with Reference Quality**: The effectiveness of ROUGE is heavily dependent on the quality and representativeness of the reference summaries.
- **Sensitivity to Text Preprocessing**: Similar to BLEU, ROUGE scores can be influenced by the text preprocessing techniques used, such as tokenization and stemming.

---

This section provides a foundational understanding of the ROUGE metric with an example and a Python code implementation using Hugging Face's `evaluate` library. It offers insights into the metric's application in evaluating text summarization and the nuances involved in its interpretation.

# Perplexity

### Perplexity

Perplexity is a measurement that determines how well a probability distribution or a probability model predicts a sample. It is commonly used in natural language processing to gauge the effectiveness of language models, representing the average likelihood of a sequence of words appearing in the model's training data. Lower perplexity indicates that the model predictions are closer to the actual distribution of the words in the sequence.

### Connection to Cross-Entropy Loss

Perplexity is intrinsically related to the cross-entropy loss function, which is a standard objective for training language models:

- **Cross-Entropy Loss**: During the training of a language model, the cross-entropy loss function evaluates the quality of the model's predictions. It does so by calculating the negative log-likelihood of the predicted probability distribution against the actual distribution of words in the training data. For a given sequence of words $( w_1, w_2, ..., w_N )$, the loss is computed as:

$$L = -\frac{1}{N} \sum_{i=1}^{N} \log p(w_i | w_1, ..., w_{i-1})$$

Here, $p(w_i | w_1, ..., w_{i-1})$ is the probability assigned by the model to the actual word $ w_i$, given the preceding words in the sequence.

- **Perplexity Calculation**: Perplexity is defined as the exponential of the average cross-entropy loss across a dataset. If the base of the logarithm in the loss function is \( e \), then the perplexity is directly the exponentiation of the loss:

$$ \text{Perplexity} = e^{L}$$

This relationship means that a model with lower cross-entropy loss will have lower perplexity, indicating that the model's predictions are more accurate or 'surprising' less often. In essence, perplexity can be understood as a measure of uncertainty in the model's predictions, with lower values signifying higher confidence and better performance.

### Evaluating Models with Perplexity

In practice, perplexity provides a quantitative measure to compare the performance of different models or the same model at different points in time. For instance, when tuning hyperparameters or making architectural changes, a decrease in perplexity on a validation set would suggest improvements in the model's predictive capabilities.

However, it is essential to complement perplexity with other metrics, especially for tasks such as machine translation, summarization, or text generation, where coherence, fluency, and alignment with human judgment are critical. While perplexity can reflect the model's grasp of the language at a word-by-word level, it may not capture the overall quality of text produced in such generative tasks.

## Perplexity using evaluate

In [49]:
from evaluate import load

# Load the perplexity metric
perplexity = load("perplexity", module_type="metric")

# Define the text samples
input_texts = ["The quick brown fox jumps over the lazy dog."]

# Calculate perplexity using the pre-trained model 'gpt2'
results = perplexity.compute(model_id='gpt2',
                             add_start_token=False,
                             predictions=input_texts)

results


  0%|          | 0/1 [00:00<?, ?it/s]

{'perplexities': [162.45657348632812], 'mean_perplexity': 162.45657348632812}

## Calculate Perplexity Manually

In [51]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

inputs = tokenizer("The quick brown fox jumps over the lazy dog.",
                   return_tensors="pt")

outputs = model(input_ids=inputs["input_ids"],
             labels=inputs["input_ids"])

loss = outputs.loss
ppl = torch.exp(loss)

print(f"Perplexity: {ppl.item():.2f}")


Perplexity: 162.47


In [55]:
bperplexity = evaluate.load("perplexity", module_type="metric")
input_texts = datasets.load_dataset("wikitext",
                                    "wikitext-2-raw-v1",
                                    split="test")["text"][:50]
input_texts = [s for s in input_texts if s!='']
results = perplexity.compute(model_id='gpt2',
                             predictions=input_texts)
print(list(results.keys()))
print(round(results["mean_perplexity"], 2))
print(round(results["perplexities"][0], 2))


Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

  0%|          | 0/2 [00:00<?, ?it/s]

['perplexities', 'mean_perplexity']
320.84
567.9


# METEOR: Metric for Evaluation of Translation with Explicit Ordering

### Overview

METEOR is a metric designed for the evaluation of machine translation output, focusing on the harmonic mean of unigram precision and recall, and placing greater emphasis on recall. It aims to align with human judgment at the sentence level, improving upon corpus-level correlation metrics like BLEU.

### Algorithm and Matching Techniques

METEOR aligns words between candidate and reference sentences using three types of matches:

1. **Exact Match**: The algorithm starts by aligning words that are exactly the same in both the hypothesis and reference.
2. **Stem Match**:  If an exact match is not found, the algorithm looks for a stem match, where words are aligned if their root forms (stems) match after removing any affixes.
3. **Synonym Match**:  Finally, if neither an exact nor a stem match is found, the algorithm checks for synonyms using resources like WordNet.

These matches are incorporated into the calculation of precision and recall:

$$ P = \frac{\text{Number of mapped unigrams in candidate}}{\text{Total unigrams in candidate}} $$
$$ R = \frac{\text{Number of mapped unigrams in candidate}}{\text{Total unigrams in reference}} $$

**These are combined using a weighted harmonic mean, with recall being weighted nine times more than precision**

$$ F_{\text{mean}} = \frac{10 \cdot P \cdot R}{R + 9 \cdot P} $$

### Penalty for Fragmentation

METEOR applies a penalty for poor ordering of matched words. The penalty is based on the number of chunks of contiguous words in the machine translation that can be mapped to the reference. More and smaller chunks lead to a higher penalty, reflecting the intuition that well-ordered translations should have larger chunks of correctly ordered words.

To address alignment of larger segments, METEOR calculates a fragmentation penalty. Unigrams are grouped into chunks of adjacent mappings, with fewer chunks indicating better word order:

$$ p = 0.5 \left( \frac{c}{u_m} \right)^3 $$

where $ c$ is the number of chunks and $ u_m $ is the number of mapped unigrams. The final score is adjusted based on this penalty, which can reduce it by up to 50% for lack of longer n-gram matches.

### Correlation with Human Judgment

METEOR correlates strongly with human judgment, showing higher correlation coefficients than BLEU, both at the corpus level (up to 0.964) and at the sentence level (0.403).

### Advantages of METEOR

- **Alignment with Human Judgment**: It correlates better with human judgment on translation quality compared to BLEU, particularly at the sentence or segment level.
- **Recall-Oriented**: By emphasizing recall, METEOR acknowledges the importance of capturing all the content of the reference translation.
- **Flexible Matching**: The use of stemming and synonyms allows for a more flexible match between translations and references.

### Limitations of METEOR

- **Complexity**: METEOR is more computationally intensive than BLEU due to its multi-stage matching process and use of external linguistic resources for stemming and synonyms.
- **Language Dependency**: While METEOR is designed to be language-independent, its effectiveness can vary depending on the availability and quality of linguistic resources for the target language.

### Conclusion

METEOR's multi-faceted approach, which includes exact, stem, and synonym matching, along with a penalty for fragmentation, offers a nuanced assessment of translation quality. Despite its computational intensity, its strong correlation with human judgment makes it a valuable metric for sentence-level evaluation in machine translation.











### **Exact Match**

In [56]:
from evaluate import load

# Load the METEOR metric
meteor = load('meteor')

# Example prediction and reference with an exact match
predictions = ["The quick brown fox jumps over the lazy dog"]
reference = ["The quick brown fox jumps over the lazy dog"]

# Compute METEOR score
results = meteor.compute(predictions=predictions, references=reference)

# Print the rounded METEOR score
print(round(results['meteor'], 2))

Downloading builder script:   0%|          | 0.00/6.93k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


1.0


*Explanation:* In this case, the prediction and reference are identical, so the METEOR score should be 1.0, indicating a perfect match.


### **Synonym Match**

In [71]:
from evaluate import load

# Load the METEOR metric
meteor = load('meteor')
bleu = load('bleu')

# Example prediction and multiple references with some synonym matches
predictions = ["The quick brown fox jumps over the lazy dog"]
references = ['The swift brown fox leaps over the lethargic dog']

# Compute METEOR score
results_meteor = meteor.compute(predictions=predictions, references=references)
results_bleu = bleu.compute(predictions=predictions, references=references)

# Print the rounded METEOR score


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [72]:
results_meteor, results_bleu

({'meteor': 0.7471655328798186},
 {'bleu': 0.0,
  'precisions': [0.6666666666666666, 0.25, 0.0, 0.0],
  'brevity_penalty': 1.0,
  'length_ratio': 1.0,
  'translation_length': 9,
  'reference_length': 9})

*Explanation:* Even though there's no exact match, METEOR considers synonyms and similar phrases. The score reflects how well the prediction aligns with the references in terms of meaning.


## **Multiple References Per Prediction (Partial Match):**

In [66]:
from evaluate import load

# Load the METEOR metric
meteor = load('meteor')

# Example prediction and multiple references with partial matches
predictions = ["The quick brown fox jumps over the lazy dog"]
references = [['A fast fox leaps over a dog',
               'An agile fox hurdles a sleeping dog',
               'A rapid fox skips over a dog resting']]

# Compute METEOR score
results = meteor.compute(predictions=predictions, references=references)

# Print the rounded METEOR score
print(round(results['meteor'], 2))

0.62


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


*Explanation:* Here, the prediction and references share some common words and meanings, but the alignment is not as strong as in the exact or synonym match scenarios. The METEOR score will be lower than 1.0 but should still capture the partial match.

In each case, the `evaluate` library computes the METEOR score based on the best alignment between the prediction and the references, considering exact matches, stem matches, and synonymy. The score provides insight into the translation quality from a semantic perspective.

# BERTScore
### Overview
BERTScore is an advanced metric for evaluating the quality of text generation tasks, such as summarization and translation. It leverages the power of pre-trained language models beyond BERT, like RoBERTa, XLNet, and XLM, to capture the contextual relationships between words in a given text.

### Embeddings
BERTScore uses the contextual embeddings from these models, which reflect the surrounding context of each token in the text, providing a rich semantic representation.

### Recall
Recall in BERTScore is measured by taking each token in the reference text and finding its most similar counterpart in the candidate text using cosine similarity of their embeddings.

$$ R_{BERT} = \frac{1}{|X|} \sum_{x_i \in X} \max_{y_j \in Y} \cos(x_i, y_j) $$

### Precision
Precision is calculated similarly but inverts the relationship; each token in the candidate text is compared to the reference text.

$$ P_{BERT} = \frac{1}{|Y|} \sum_{y_i \in Y} \max_{x_j \in X} \cos(y_i, x_j) $$

### F1 Score
The F1 score is the harmonic mean of precision and recall, providing a single score that balances both aspects.

$$ F_{BERT} = \frac{2 \cdot P_{BERT} \cdot R_{BERT}}{P_{BERT} + R_{BERT}} $$

### Weighting
Optionally, an importance weighting based on Inverse Document Frequency (IDF) can be applied to each token, emphasizing the significance of rarer words.

### Rescaling
To enhance interpretability, BERTScore values are linearly rescaled based on baselines derived from a monolingual dataset, ensuring the scores fall within an intuitive range.

The rescaling equation for BERTScore is:

$$ \hat{R}_{BERT} = \frac{R_{BERT} - b}{1 - b} $$

Here's the revised Rescaling section with an explanation:

The baseline $ b $ in the BERTScore rescaling equation is a precomputed constant number, specific to the language model and the dataset it was derived from. It is calculated in advance using a representative set of reference-candidate pairs from a relevant monolingual dataset. This baseline $ b $ does not change with each new input; it is used consistently to rescale BERTScore values across different inputs to maintain comparability and interpretability.

### Advantages
BERTScore's emphasis on semantic content makes it robust to paraphrasing and offers a more meaningful assessment than overlap-based metrics.

### Disadvantages
However, its computational intensity and dependency on the quality of pre-trained embeddings can be limiting factors.

### Conclusion
BERTScore stands out for its semantic evaluation capabilities, offering significant benefits for many NLP tasks. Its potential for broader language coverage and adaptation to various text types makes it a promising tool for future advancements in language processing.

### **Exact Match**

In [70]:
from evaluate import load

# Load the METEOR metric
meteor = load('bertscore')

# Example prediction and reference with an exact match
predictions = ["The quick brown fox jumps over the lazy dog"]
reference = ["The quick brown fox jumps over the lazy dog"]

# Compute METEOR score
results = meteor.compute(predictions=predictions, references=reference, model_type="distilbert-base-uncased")

# Print the rounded METEOR score
results

{'precision': [1.0000001192092896],
 'recall': [1.0000001192092896],
 'f1': [1.0000001192092896],
 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.35.0)'}

### **Synonym Match**

In [73]:
from evaluate import load

# Load the METEOR metric
meteor = load('meteor')
bleu = load('bleu')
bert = load('bertscore')

# Example prediction and multiple references with some synonym matches
predictions = ["The quick brown fox jumps over the lazy dog"]
references = ['The swift brown fox leaps over the lethargic dog']

# Compute METEOR score
results_meteor = meteor.compute(predictions=predictions, references=references)
results_bleu = bleu.compute(predictions=predictions, references=references)
results_bert = bert.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased")


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [74]:
results_meteor, results_bleu, results_bert

({'meteor': 0.7471655328798186},
 {'bleu': 0.0,
  'precisions': [0.6666666666666666, 0.25, 0.0, 0.0],
  'brevity_penalty': 1.0,
  'length_ratio': 1.0,
  'translation_length': 9,
  'reference_length': 9},
 {'precision': [0.9558224678039551],
  'recall': [0.9063857793807983],
  'f1': [0.9304479360580444],
  'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.35.0)'})