# BLEU - Evaluation Technique (Bilingual Evaluation Understudy)
Scope: Evaluates across a corpus level (not sentence by sentence).

From HuggingFace

BLEU is focus on precision metrics. (What was correct between reference & generated contents)

In [None]:
!pip install -q evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import evaluate
bleu = evaluate.load("bleu")

# Example predictions and references - Single sentence
predictions = ["the cat is on the mat"]
references = [["there is a cat on the mat"]]

# Example predictions & reference - Multiple sentence
# predictions = [
#     "the cat is on the mat",
#     "there is a dog in the house"
# ]

# references = [
#     ["there is a cat on the mat"],  # List of references for sentence 1
#     ["a dog is inside the house"]   # List of references for sentence 2
# ]

# Compute BLEU score
results = bleu.compute(predictions=predictions, references=references) # Try with smooth=True; Try smoothing techniques (BLEU has options like smooth_method to avoid zero).
print(results)


{'bleu': 0.3851442247849805, 'precisions': [0.8571428571428571, 0.5, 0.4, 0.25], 'brevity_penalty': 0.846481724890614, 'length_ratio': 0.8571428571428571, 'translation_length': 6, 'reference_length': 7}


# BLEU Result - Explanation

In [None]:
{'bleu': 0.0,
 'precisions': [0.8333333333333334, 0.4, 0.25, 0.0],
 'brevity_penalty': 0.846481724890614,
 'length_ratio': 0.8571428571428571,
 'translation_length': 6,
 'reference_length': 7}

**1. 'bleu': 0.0**

This is the final BLEU score, a weighted geometric mean of n-gram precisions (up to 4-grams) adjusted by a brevity penalty.

Your score is 0.0 — likely because the 4-gram precision is 0.0, which brings down the geometric mean to 0 (BLEU is very sensitive to this).

Even if 1-gram to 3-gram matches are good, a single 0 in 4-gram precision makes BLEU = 0.

BLEU is harsh: if any n-gram level has precision 0, the overall score drops to 0.

Note: So, the BLEU score will compute evaluation sentence by sentence, in order, and then calculate the corpus-level BLEU (which is not a simple average, but a weighted geometric mean over all n-gram matches across the corpus).



**2. 'precisions': [0.83, 0.4, 0.25, 0.0]**

These are n-gram precisions:

1-gram precision (unigram) = 83.3% of single words matched.

2-gram precision (bigram) = 40% of 2-word sequences matched.

3-gram precision = 25% of 3-word sequences matched.

4-gram precision = 0% of 4-word sequences matched → this is why overall BLEU is 0.



**3. 'brevity_penalty': 0.846...**

BLEU penalizes short translations.

A brevity penalty < 1 means the prediction is shorter than the reference.

Here: prediction = 6 tokens, reference = 7 tokens → penalty applied.


**4. 'length_ratio': 0.857...**

This is translation_length / reference_length = 6 / 7 = ~0.857.

Less than 1 → prediction is shorter than reference.


**5. 'translation_length': 6 and 'reference_length': 7**
Number of tokens (usually words) in predicted and reference sentences.

# BLEU Score - Using Simple Language Translation Model

If you're generating text using a model like T5, GPT, BART, etc.:


In [None]:
from transformers import pipeline

generator = pipeline("text2text-generation", model="t5-small")

inputs = ["Translate English to French: The house is small."]
references = [["La maison est petite."]]

# Generate
predictions = [generator(x)[0]["generated_text"] for x in inputs]

# BLEU
results = bleu.compute(predictions=predictions, references=references, smooth=True)
print(results)
print("predictions:",predictions)

Device set to use cuda:0


{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 5, 'reference_length': 5}
predictions: ['La maison est petite.']


# ROUGE Metric - Recall-Oriented Understudy for Gisting Evaluation


It is a set of metrics designed to compare machine-generated text to human-written reference text, based on n-gram overlap, word sequences, and word pairs.

**ROUGE is mostly a recall-oriented metric, meaning:**

It checks how much of the reference is captured by the generated output.

Each ROUGE metric produces:

1. Precision = how much of your output is relevant

2. Recall = how much of the reference is covered

3. F1 = harmonic mean of precision and recall


---

\

| ROUGE Type     | Measures               | Description                                                            |
| -------------- | ---------------------- | ---------------------------------------------------------------------- |
| **ROUGE-1**    | Unigrams (words)       | Measures word-level overlap (like BLEU-1)                              |
| **ROUGE-2**    | Bigrams                | Measures 2-word sequence overlap                                       |
| **ROUGE-L**    | Longest Common Subseq. | Measures longest matching sequence of words (not necessarily adjacent) |
| **ROUGE-N**    | N-gram overlap         | Generic n-gram based recall score                                      |
| **ROUGE-Lsum** | Summary-level LCS      | Modified for summarization (used by Hugging Face models)               |


In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=0f3e41a7830b4fe2848a032f065329120987b91338a612082bdf052d3b97dab8
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
import evaluate

rouge = evaluate.load("rouge")

predictions = ["the cat is on the mat"]
references = ["the cat sat on the mat"]

results = rouge.compute(predictions=predictions, references=references)
print(results)


{'rouge1': np.float64(0.8333333333333334), 'rouge2': np.float64(0.6), 'rougeL': np.float64(0.8333333333333334), 'rougeLsum': np.float64(0.8333333333333334)}


# ROUGE - Metric Explanation

In [None]:
{'rouge1': np.float64(0.8333333333333334), # unigram F1
 'rouge2': np.float64(0.6), # Bigram F1
 'rougeL': np.float64(0.8333333333333334), # LCS based F1
 'rougeLsum': np.float64(0.8333333333333334) # Same as RougeL for short tasks
 }


Let’s say:

Reference summary: "the cat sat on the mat"

Generated summary: "the cat is on the mat"

ROUGE-1 will look at word overlaps:

Unigrams: the, cat, on, the, mat → matched = 5/6

ROUGE-2 will look at bigram overlaps:

Reference: "the cat", "cat sat", "sat on", "on the", "the mat"

Prediction: "the cat", "cat is", "is on", "on the", "the mat"

Match: "the cat", "on the", "the mat" → 3 matches

ROUGE-L will look at LCS(Longest Common Subsequences)

Match: (The longest sequence): the → cat → on → the → mat

# BLEU Score Using HuggingFace Pipeline - Summarization Task

In [1]:
!pip install -q transformers datasets rouge-score bert_score nltk

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m10.8 MB/s[0m eta [36m0:0

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "facebook/bart-base" #facebook/bart-base  t5-base  google/pegasus-xsum
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")
model.eval()

# Custom Input Example (Summarization task)
input_text = "summarize: The Eiffel Tower is located in Paris and was completed in 1889. It is a major tourist attraction."

# Tokenize & Generate
inputs = tokenizer(input_text, return_tensors="pt", truncation=True).input_ids.to("cuda")
with torch.no_grad():
    outputs = model.generate(inputs, max_length=50)

# Decode and Print Result
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Input:", input_text)
print("Output:", summary)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Input: summarize: The Eiffel Tower is located in Paris and was completed in 1889. It is a major tourist attraction.
Output: summarize: The Eiffel Tower is located in Paris and was completed in 1889. It is a major tourist attraction.


In [3]:
# Import Libraries
import torch
import math
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from rouge_score import rouge_scorer
import bert_score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Local Summarization Dataset
dataset = [
    {
        "article": "The Eiffel Tower is located in Paris and was completed in 1889. It is a major tourist attraction.",
        "highlights": "The Eiffel Tower in Paris was completed in 1889."
    },
    {
        "article": "Python is a programming language known for its simplicity and readability. It is used widely in AI.",
        "highlights": "Python is a simple, readable language popular in AI."
    },
    {
        "article": "The Amazon rainforest is the largest tropical rainforest in the world and is home to diverse wildlife.",
        "highlights": "Amazon rainforest is the world's largest and rich in biodiversity."
    }
]

# Load Model
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")
model.eval()

# Generate Summaries & Collect Results
predictions = []
references = []
losses = []

for sample in dataset:
    input_text = "summarize: " + sample["article"]
    input_ids = tokenizer(input_text, return_tensors="pt", truncation=True).input_ids.to("cuda")

    with torch.no_grad():
        output_ids = model.generate(input_ids, max_length=100)
        pred = tokenizer.decode(output_ids[0], skip_special_tokens=True)

        target_ids = tokenizer(sample["highlights"], return_tensors="pt", truncation=True).input_ids.to("cuda")
        loss = model(input_ids=input_ids, labels=target_ids).loss.item()

    predictions.append(pred)
    references.append(sample["highlights"])
    losses.append(loss)

# BLEU Score

def real_bleu(pred, ref):
    pred_tokens = pred.lower().split()
    ref_tokens = [ref.lower().split()]  # note: must be a list of references
    smoothie = SmoothingFunction().method4  # avoids BLEU=0 for short outputs
    return sentence_bleu(ref_tokens, pred_tokens, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smoothie)

bleu_scores = [real_bleu(pred, ref) for pred, ref in zip(predictions, references)]
avg_bleu = sum(bleu_scores) / len(bleu_scores)

print(f"BLEU Score: {avg_bleu:.4f}")

# ROUGE-1, ROUGE-2, ROUGE-L
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
rouge1, rouge2, rougeL, precision, recall = [], [], [], [], []

for pred, ref in zip(predictions, references):
    scores = scorer.score(ref, pred)
    rouge1.append(scores["rouge1"].fmeasure)
    rouge2.append(scores["rouge2"].fmeasure)
    rougeL.append(scores["rougeL"].fmeasure)


print(f"ROUGE-1 F1 Score: {sum(rouge1)/len(rouge1):.4f}")
print(f"ROUGE-2 F1 Score: {sum(rouge2)/len(rouge2):.4f}")
print(f"ROUGE-L F1 Score: {sum(rougeL)/len(rougeL):.4f}")

'''
# Perplexity
It's the exponential of the average negative log-likelihood.

Lower perplexity = better model (less uncertainty).

Perplexity = 1 → Model is 100% confident about one choice.

Perplexity = 10 → Model is as unsure as randomly choosing among 10 equally likely words.

Perplexity = 100 → Model is very confused, could pick from 100 likely options.
'''
avg_loss = sum(losses) / len(losses)
perplexity = math.exp(avg_loss)
print(f"Perplexity: {perplexity:.2f}")

# BERTScore
P, R, F1 = bert_score.score(predictions, references, lang="en", verbose=False)
print(f"BERTScore - Precision: {P.mean().item():.4f}")
print(f"BERTScore - Recall: {R.mean().item():.4f}")
print(f"BERTScore - F1: {F1.mean().item():.4f}")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

BLEU Score: 0.1924
ROUGE-1 F1 Score: 0.6056
ROUGE-2 F1 Score: 0.3723
ROUGE-L F1 Score: 0.5448
Perplexity: 5.66


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.9294
BERTScore - Recall: 0.9462
BERTScore - F1: 0.9376


  return forward_call(*args, **kwargs)
