Evaluation metrics like BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) are essential for assessing the quality of generated text by comparing it with reference texts. BLEU is commonly used for machine translation, while ROUGE is popular in text summarization. These metrics help quantify how well a generated text matches the human-written reference, giving insights into the performance and relevance of the text generation model.

### First of all let's install required packages

In [1]:
! pip install evaluate
! pip install rouge_score

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [

### Import Required Libraries
-  Purpose: Importing the evaluate library, a package designed to provide standardized implementations of popular NLP evaluation metrics like BLEU and ROUGE.
-  Why evaluate: This library consolidates several metrics in one place, making it simpler to calculate various metrics in a consistent format.

In [2]:
# Import necessary modules
import evaluate

Loading Metrics: Here, we load the BLEU and ROUGE metrics from the evaluate library. These metrics are initialized and stored as variables (bleu and rouge) so that we can use them to calculate scores in later steps.
Benefits: Loading metrics as objects provides access to each metric’s specific computation methods and parameters, allowing flexible calculations on different types of text data.

In [3]:
# Load metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Setting Up Test Cases:

-  predictions represents a sample generated text output that we’re evaluating.
-  references contains a list of reference texts that serve as the "ground truth" against which the model output is compared.


Why Use Lists: Both predictions and references are lists, enabling the metrics to evaluate multiple sentences or text blocks at onc

In [4]:
# Example predictions and references
predictions = ["The future of AI in healthcare looks promising, enhancing diagnostics and treatment."]
references = [["AI in healthcare is advancing with potential in diagnostics and treatments."]]

BLEU Score Calculation:

- The bleu.compute() function calculates the BLEU score by comparing the predictions with the references.
- BLEU measures the n-gram overlap between generated and reference texts, considering factors like precision and length penalty.

Output: The calculated BLEU score provides a sense of how well the predicted text aligns with the reference, with a higher score indicating closer similarity.

In [5]:
# Calculate BLEU score
bleu_score = bleu.compute(predictions=predictions, references=references)
print("BLEU Score:", bleu_score)

BLEU Score: {'bleu': 0.0, 'precisions': [0.42857142857142855, 0.23076923076923078, 0.08333333333333333, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 14, 'reference_length': 12}


ROUGE Score Calculation:

-  rouge.compute() calculates the ROUGE scores, which include various metrics like ROUGE-1, ROUGE-2, and ROUGE-L.
-  ROUGE assesses the recall of overlapping words, bigrams, or sequences between the prediction and reference, making it suitable for evaluating summaries.


Output: Each ROUGE score variant reflects different aspects of textual overlap (e.g., single words for ROUGE-1, pairs for ROUGE-2, and longest common subsequence for ROUGE-L), helping understand the model’s recall accuracy.


In [6]:
# Calculate ROUGE score
rouge_score = rouge.compute(predictions=predictions, references=references)
print("ROUGE Scores:", rouge_score)


ROUGE Scores: {'rouge1': 0.43478260869565216, 'rouge2': 0.28571428571428564, 'rougeL': 0.43478260869565216, 'rougeLsum': 0.43478260869565216}
