# A World-Class Tutorial on NLG Evaluation Metrics: BLEU, ROUGE, METEOR, BERTScore

## Introduction

Welcome, aspiring scientist! This Jupyter Notebook is your comprehensive guide to mastering BLEU, ROUGE, METEOR, and BERTScore—key metrics for evaluating machine-generated text in natural language processing (NLP). Designed for beginners yet rigorous for researchers, this tutorial assumes basic familiarity with NLP (e.g., n-grams, precision, recall) and introduces newer metrics like ROUGE and BERTScore. As you aim to advance your scientific career, this notebook provides:

- **Theory & Tutorials**: From fundamentals to advanced concepts.
- **Practical Code Guides**: Step-by-step Python code with explanations.
- **Visualizations**: Plots and diagrams for intuitive understanding.
- **Applications**: Real-world use cases.
- **Research Directions**: Insights to inspire your innovations.
- **Projects**: Mini and major projects using real datasets.
- **Exercises**: Hands-on tasks with solutions.
- **Future Directions**: Paths for further study.
- **What’s Missing**: Gaps in standard tutorials, filled here.

**Analogy**: Think of these metrics as tools in a lab. BLEU measures exact matches like a precise scale, ROUGE checks for key content like a checklist, METEOR allows flexibility like a chemist substituting ingredients, and BERTScore evaluates semantic ‘vibes’ like an AI art critic.

**Why It Matters**: As a researcher, you’ll use these to benchmark models, publish papers, and innovate. No metric is perfect—they approximate human judgment, missing nuances like creativity. Let’s dive in like Turing cracking codes!

**Note**: Install required libraries: `pip install sacrebleu rouge-score nltk bert-score matplotlib seaborn numpy pandas`.

**Important**: For METEOR, you must install NLTK and download WordNet: 
```python
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
```


## 1. Theory & Tutorials

### 1.1 Key Concepts
- **N-grams**: Sequences of n words (e.g., unigram: “cat”; bigram: “the cat”).
- **Precision**: Fraction of hypothesis words/n-grams that are correct.
- **Recall**: Fraction of reference words/n-grams captured by hypothesis.
- **Reference vs. Hypothesis**: Reference = human text; Hypothesis = machine output.

**Research Mindset**: Like Einstein questioning gravity, ask: Do these metrics capture meaning? Your experiments could redefine evaluation.

### 1.2 BLEU (Bilingual Evaluation Understudy)
**Introduced**: 2002, for machine translation.
**Logic**: Counts n-gram overlaps (n=1–4), emphasizing precision, with a brevity penalty (BP) for short outputs.
**Formula**: BLEU = BP * exp(∑ (w_n * log(p_n))), where:
- p_n = (clipped matching n-grams) / (hypothesis n-grams). Clipped = min(hyp count, ref count).
- w_n = 1/4 (uniform weight).
- BP = min(1, exp(1 - r/c)), r = ref length, c = hyp length.
**Pros**: Fast, language-agnostic. **Cons**: Ignores synonyms, context.

### 1.3 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
**Introduced**: 2004, for summarization.
**Logic**: Measures recall of key content via n-grams or longest common subsequence (LCS).
**Variants**:
- ROUGE-N: Recall = (matching n-grams) / (ref n-grams). F1 = 2*P*R / (P+R).
- ROUGE-L: F1 based on LCS length.
**Pros**: Captures content coverage. **Cons**: Misses fluency.

### 1.4 METEOR (Metric for Evaluation of Translation with Explicit ORdering)
**Introduced**: 2005, for translation.
**Logic**: Matches exact words, stems, synonyms; penalizes fragmented alignments.
**Formula**: METEOR = (1 - Penalty) * F_mean, where F_mean = 10*P*R / (9*P + R), Penalty = 0.5 * (chunks^3 / matched unigrams).
**Pros**: Captures meaning. **Cons**: Needs linguistic resources (e.g., WordNet).

### 1.5 BERTScore
**Introduced**: 2019, for semantic tasks.
**Logic**: Uses BERT embeddings to compute cosine similarity of token vectors, capturing meaning.
**Formula**: F1 = 2*P*R / (P+R), where P/R = avg(max cosine sims).
**Pros**: Handles paraphrases; high human correlation. **Cons**: Slow, model-dependent.

## 2. Practical Code Guides

Let’s compute these metrics for a unified example using Python libraries.

**Example**:
- Reference: “The cat sits on the mat.”
- Hypothesis: “A feline sat on the rug.”

In [1]:
# Install libraries if needed
# !pip install sacrebleu rouge-score nltk bert-score matplotlib seaborn numpy pandas

import sacrebleu
from rouge_score import rouge_scorer
from nltk.translate.meteor_score import meteor_score
from bert_score import score
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Example texts
reference = "The cat sits on the mat."
hypothesis = "A feline sat on the rug."

# BLEU
bleu = sacrebleu.corpus_bleu([hypothesis], [[reference]])
print(f"BLEU Score: {bleu.score:.3f}")

# ROUGE
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
rouge_scores = scorer.score(reference, hypothesis)
print(f"ROUGE-1 F1: {rouge_scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-L F1: {rouge_scores['rougeL'].fmeasure:.3f}")

# METEOR (Note: Requires NLTK WordNet; see intro for setup)
meteor = meteor_score([reference], hypothesis)
print(f"METEOR Score: {meteor:.3f}")

# BERTScore
P, R, F1 = score([hypothesis], [reference], lang="en", verbose=False)
print(f"BERTScore F1: {F1[0].item():.3f}")

ModuleNotFoundError: No module named 'sacrebleu'

**Expected Output** (approximate):
- BLEU: ~0.0–0.1 (low due to few exact matches).
- ROUGE-1 F1: ~0.333, ROUGE-L F1: ~0.333.
- METEOR: ~0.806 (high due to synonyms like “cat/feline”).
- BERTScore F1: ~0.941 (high due to semantic similarity).

**Code Explanation**:
- **BLEU**: `sacrebleu` computes n-gram precision with BP.
- **ROUGE**: `rouge_scorer` calculates ROUGE-1 and ROUGE-L F1 scores.
- **METEOR**: Requires WordNet; matches synonyms/stems.
- **BERTScore**: Uses BERT embeddings for cosine similarity.

## 3. Visualizations

Visualizations make metrics intuitive. Let’s create a bar plot comparing scores and a heatmap for BERTScore similarities.

In [None]:
# Bar plot of metric scores
metrics = ['BLEU', 'ROUGE-1', 'ROUGE-L', 'METEOR', 'BERTScore']
scores = [bleu.score / 100, rouge_scores['rouge1'].fmeasure, rouge_scores['rougeL'].fmeasure, meteor, F1[0].item()]

plt.figure(figsize=(8, 5))
sns.barplot(x=metrics, y=scores, palette='viridis')
plt.title('Comparison of NLP Evaluation Metrics')
plt.ylabel('Score (0–1)')
plt.ylim(0, 1)
plt.show()

# BERTScore similarity heatmap
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

def get_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()

ref_emb = get_embeddings(reference)
hyp_emb = get_embeddings(hypothesis)
cos_sim = np.dot(ref_emb, hyp_emb.T) / (np.linalg.norm(ref_emb) * np.linalg.norm(hyp_emb))

plt.figure(figsize=(6, 4))
sns.heatmap([[cos_sim.item()]], annot=True, cmap='Blues', cbar=True)
plt.title('BERTScore Cosine Similarity')
plt.xticks([]); plt.yticks([])
plt.show()

**Visualization Explanation**:
- **Bar Plot**: Compares metrics (BLEU low, BERTScore high due to semantics).
- **Heatmap**: Shows semantic similarity (closer to 1 = better match).
- **Sketch Idea**: For ROUGE-L, draw sentences with a curved line for LCS; for METEOR, a graph with nodes (words) and edges (match types).

## 4. Applications

- **Machine Translation (BLEU, METEOR)**: Google Translate uses BLEU for exactness and METEOR for synonym-rich translations (e.g., legal documents).
- **Summarization (ROUGE)**: CNN auto-summarizes articles, ensuring key facts (e.g., “climate crisis at 1.5°C”) are retained.
- **Dialogue Systems (BERTScore)**: Evaluates chatbot responses (e.g., “I’m fine” ≈ “Doing well”) for semantic equivalence.

**Research Insight**: Combine metrics for robustness (e.g., ROUGE + BERTScore for summarization tasks).

## 5. Research Directions & Rare Insights

- **Limitations**: BLEU and ROUGE ignore semantics; METEOR needs resources; BERTScore is slow. Question like Einstein: Can we create a metric blending speed and meaning?
- **Rare Insight**: BLEU correlates poorly with human judgment for creative tasks (e.g., story generation). BERTScore excels here but may overfit to BERT’s biases.
- **Innovation**: Experiment with hybrid metrics (e.g., ROUGE’s structure + BERTScore’s semantics) or use newer models like RoBERTa.
- **Ethical Consideration**: Metrics may undervalue culturally nuanced translations. Your research could address bias in evaluation.

## 6. Mini & Major Projects

### 6.1 Mini Project: Compare Metrics on a Small Dataset
**Task**: Evaluate a toy translation dataset.
**Dataset**: Create a small dataset with 3 reference-hypothesis pairs.
**Code**:

In [None]:
references = [
    "The cat sits on the mat.",
    "The dog runs in the park.",
    "The sun shines brightly today."
]
hypotheses = [
    "A feline sat on the rug.",
    "A puppy jogs in the garden.",
    "The sun glows today."
]

bleu_scores = [sacrebleu.corpus_bleu([hyp], [[ref]]).score / 100 for hyp, ref in zip(hypotheses, references)]
rouge_scores = [scorer.score(ref, hyp)['rougeL'].fmeasure for ref, hyp in zip(references, hypotheses)]
meteor_scores = [meteor_score([ref], hyp) for ref, hyp in zip(references, hypotheses)]
bert_scores = score(hypotheses, references, lang="en")[2].numpy()

import pandas as pd
df = pd.DataFrame({
    'BLEU': bleu_scores, 'ROUGE-L': rouge_scores, 'METEOR': meteor_scores, 'BERTScore': bert_scores.flatten()
})
print(df.mean())

plt.figure(figsize=(8, 5))
sns.barplot(data=df)
plt.title('Metric Comparison Across Sentences')
plt.ylabel('Score')
plt.show()

**Expected**: BERTScore highest due to semantic matches; BLEU lowest due to exactness.

### 6.2 Major Project: Evaluate Summarization on CNN/DailyMail
**Task**: Use a real dataset to compare a summarization model’s outputs.
**Dataset**: CNN/DailyMail (available via Hugging Face `datasets`).
**Steps**:
1. Load dataset.
2. Use a pre-trained model (e.g., BART) to generate summaries.
3. Compute metrics.
4. Analyze which metric best reflects quality.

In [None]:
from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset('cnn_dailymail', '3.0.0', split='test[:10]')
summarizer = pipeline('summarization', model='facebook/bart-large-cnn')

hypotheses = [summarizer(article['article'], max_length=100, min_length=30)[0]['summary_text'] for article in dataset]
references = [article['highlights'] for article in dataset]

bleu_scores = [sacrebleu.corpus_bleu([hyp], [[ref]]).score / 100 for hyp, ref in zip(hypotheses, references)]
rouge_scores = [scorer.score(ref, hyp)['rougeL'].fmeasure for ref, hyp in zip(references, hypotheses)]
meteor_scores = [meteor_score([ref], hyp) for ref, hyp in zip(references, hypotheses)]
bert_scores = score(hypotheses, references, lang="en")[2].numpy()

df = pd.DataFrame({
    'BLEU': bleu_scores, 'ROUGE-L': rouge_scores, 'METEOR': meteor_scores, 'BERTScore': bert_scores.flatten()
})
print(df.mean())

plt.figure(figsize=(8, 5))
sns.boxplot(data=df)
plt.title('Metric Distribution for CNN/DailyMail Summaries')
plt.ylabel('Score')
plt.show()

**Research Question**: Does BERTScore correlate better with human judgments on abstractive summaries?

## 7. Exercises

### Exercise 1: Manual BLEU Calculation
**Task**: Compute BLEU for:
- Reference: “The dog barks loudly.”
- Hypothesis: “Dog barks loud.”
**Steps**: Calculate p1–p4, BP, and BLEU score manually.
**Solution**:
- Unigrams: Matches = dog, barks (2). p1 = 2/3 ≈ 0.667.
- Bigrams: Match = none. p2 = 0 (use 1e-4).
- Trigrams: p3 = 0. 4-grams: p4 = 0.
- BP: r=4, c=3, BP = exp(1-4/3) ≈ 0.717.
- BLEU ≈ 0.717 * exp((1/4)*(log(0.667) + 3*log(1e-4))) ≈ 0.01.

### Exercise 2: Compare Metrics
**Task**: Use code to compute all metrics for above example. Why does BERTScore differ?
**Solution**: Run code; BERTScore is higher due to semantic similarity (“loud” ≈ “loudly”).

## 8. Future Directions & Next Steps

- **Explore Datasets**: WMT for translation, XSum for summarization.
- **Advanced Models**: Test metrics with GPT-4 or LLaMA outputs.
- **Innovate Metrics**: Develop a hybrid metric combining ROUGE’s structure and BERTScore’s semantics.
- **Publish**: Submit findings to arXiv or conferences like ACL/NeurIPS.
- **Read**: Papers like “BLEURT” or “MoverScore” for advanced metrics.

## 9. What’s Missing in Standard Tutorials

- **Metric Limitations**: Most tutorials don’t discuss how BLEU fails for creative text or BERTScore’s computational cost.
- **Practical Context**: Few show real dataset applications (e.g., CNN/DailyMail).
- **Research Focus**: Rarely encourage innovation (e.g., hybrid metrics).
- **Ethical Gaps**: Metrics may undervalue non-English or culturally nuanced text—your research can address this.

## 10. Conclusion

You’ve mastered BLEU, ROUGE, METEOR, and BERTScore—tools to benchmark and innovate in NLP. Like Tesla iterating inventions, experiment with these metrics, analyze their limits, and propose improvements. Your next step could redefine evaluation—start coding, researching, and publishing!

**Note**: See `Case_Studies.md` for detailed real-world examples.