#[ROUGE](https://aclanthology.org/W04-1013.pdf)

**Recall-Oriented Understudy for Gisting Evaluation**

Set of metrics that measure the n-gram overlap between the generated summary and the reference summary.



1.   **Rouge-N:** based on the count of overlapping n-gram units
2.   **Rouge-L**:  based on the longest common subsequence between two summaries
3.   **Rouge-W**: takes into account skip-bigram co-occurrence statistics
4.   **Rouge-S**: based on skip-bigram co-occurrence statistics and measures the similarity between a candidate translation and a set of reference translations



In [None]:
!pip install rouge_metric

Collecting rouge_metric
  Downloading rouge_metric-1.0.1-py3-none-any.whl (151 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.7/151.7 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rouge_metric
Successfully installed rouge_metric-1.0.1


In [None]:
from rouge_metric import PyRouge

# Load summary results
hypotheses = [
    'how are you\ni am fine',  # document 1: hypothesis
    'it is fine today\nwe won the football game',  # document 2: hypothesis
]
references = [[
    'how do you do\nfine thanks',  # document 1: reference 1
    'how old are you\ni am three',  # document 1: reference 2
], [
    'it is sunny today\nlet us go for a walk',  # document 2: reference 1
    'it is a terrible day\nwe lost the game',  # document 2: reference 2
]]

# Evaluate document-wise ROUGE scores
rouge = PyRouge(rouge_n=(1, 2, 4), rouge_l=True, rouge_w=True,
                rouge_w_weight=1.2, rouge_s=True, rouge_su=True, skip_gap=4)
scores = rouge.evaluate(hypotheses, references)
print(scores)

{'rouge-1': {'r': 0.5182186234817814, 'p': 0.5555555555555556, 'f': 0.5362379555927943}, 'rouge-2': {'r': 0.19518716577540107, 'p': 0.2125, 'f': 0.20347597966879813}, 'rouge-4': {'r': 0.07142857142857142, 'p': 0.08333333333333333, 'f': 0.07692307692307691}, 'rouge-l': {'r': 0.5182186234817814, 'p': 0.5555555555555556, 'f': 0.5362379555927943}, 'rouge-w-1.2': {'r': 0.33608419409971513, 'p': 0.4734837712933738, 'f': 0.3931242798550236}, 'rouge-s4': {'r': 0.2549450549450549, 'p': 0.2916666666666667, 'f': 0.27207237393198186}, 'rouge-su4': {'r': 0.3149522799575822, 'p': 0.35526315789473684, 'f': 0.3338954468802698}}


In [None]:
print(f"rouge-1: {scores['rouge-1']}")
print(f"rouge-2: {scores['rouge-2']}")
print(f"rouge-4: {scores['rouge-4']}")
print(f"rouge-l: {scores['rouge-l']}")
print(f"rouge-w-1.2: {scores['rouge-w-1.2']}")
print(f"rouge-s4: {scores['rouge-s4']}")
print(f"rouge-su4: {scores['rouge-s4']}")

rouge-1: {'r': 0.5182186234817814, 'p': 0.5555555555555556, 'f': 0.5362379555927943}
rouge-2: {'r': 0.19518716577540107, 'p': 0.2125, 'f': 0.20347597966879813}
rouge-4: {'r': 0.07142857142857142, 'p': 0.08333333333333333, 'f': 0.07692307692307691}
rouge-l: {'r': 0.5182186234817814, 'p': 0.5555555555555556, 'f': 0.5362379555927943}
rouge-w-1.2: {'r': 0.33608419409971513, 'p': 0.4734837712933738, 'f': 0.3931242798550236}
rouge-s4: {'r': 0.2549450549450549, 'p': 0.2916666666666667, 'f': 0.27207237393198186}
rouge-su4: {'r': 0.2549450549450549, 'p': 0.2916666666666667, 'f': 0.27207237393198186}


## Results
1. **r** represents **recall**, which is the proportion of words in the reference summary sentence that are also present in the candidate summary sentence.
2. **p** represents **precision**, which is the proportion of words in the candidate summary sentence that are also present in the reference summary sentence
3. **f** represents **F-measure**, which is a combined measure of recall and precision. F = 2 * (r * p) / (r + p)

In [None]:
from rouge_metric import PyRouge

# Load summary results
hypotheses = [
    """
    Jorge Amado nascido in Itabuna, Brazil, 10 de agosto de 1912 – Salvador . Filiado ao Partido Comunista Brasileiro, por ele foi eleito Deputado Federal em 1946 . Teve outros três irmãos: Jofre, Joelson e James, único nacido em Ilhéus . Fled parte of a funda da "Acadovação dos Rebeldes", grupo de jovens que despencede importante .'
    """
] #document 1
references = [[
    """
    Jorge Amado nascido in Itabuna, Brazil, 10 de agosto de 1912 – Salvador . Filiado a Partido Comunista Brasileiro, por ele foi eleito Deputado Federal em 1946 . Teve outros três irmãos: Jofre, Joelson e James
    """
]]

# Evaluate document-wise ROUGE scores
rouge = PyRouge(rouge_n=(1, 2, 4), rouge_l=True, rouge_w=True,
                rouge_w_weight=1.2, rouge_s=True, rouge_su=True, skip_gap=4)
scores = rouge.evaluate(hypotheses, references)
print(f"rouge-1: {scores['rouge-1']}")
print(f"rouge-2: {scores['rouge-2']}")
print(f"rouge-4: {scores['rouge-4']}")
print(f"rouge-l: {scores['rouge-l']}")
print(f"rouge-w-1.2: {scores['rouge-w-1.2']}")
print(f"rouge-s4: {scores['rouge-s4']}")
print(f"rouge-su4: {scores['rouge-s4']}")

rouge-1: {'r': 0.9722222222222222, 'p': 0.6140350877192983, 'f': 0.7526881720430108}
rouge-2: {'r': 0.9142857142857143, 'p': 0.5714285714285714, 'f': 0.7032967032967032}
rouge-4: {'r': 0.8484848484848485, 'p': 0.5185185185185185, 'f': 0.6436781609195402}
rouge-l: {'r': 0.9444444444444444, 'p': 0.5964912280701754, 'f': 0.7311827956989247}
rouge-w-1.2: {'r': 0.41147727286425834, 'p': 0.5321499161545168, 'f': 0.4640976834970099}
rouge-s4: {'r': 0.9151515151515152, 'p': 0.5592592592592592, 'f': 0.6942528735632184}
rouge-su4: {'r': 0.9151515151515152, 'p': 0.5592592592592592, 'f': 0.6942528735632184}


# BLEU

**Bilingual Evaluation Understudy**

It measures the similarity between the generated summary and the reference summary in terms of n-grams.

In [None]:
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction

def evaluate_summary(reference_summaries, generated_summaries):
    # Ensure that the reference summaries and generated summaries are lists of lists
    # Each inner list corresponds to multiple reference summaries or a single generated summary
    assert len(reference_summaries) == len(generated_summaries), "Mismatched summary lengths"

    # Calculate the BLEU score
    smoothie = SmoothingFunction().method4 # using smoothing method 4 as an example
    bleu_score = corpus_bleu(reference_summaries, generated_summaries, smoothing_function=smoothie)

    return bleu_score

# Example Usage
reference_summaries = [[
    "The quick brown fox jumps over the lazy dog",
    "A quick brown fox leaps over a lazy dog"
]]

generated_summaries = [
    "A speedy brown fox jumps over the lazy dog"
]

bleu_score = evaluate_summary(reference_summaries, generated_summaries)
print(f'BLEU Score: {bleu_score:.2f}')

Exception in thread Thread-5 (attachment_entry):
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/debugpy/server/api.py", line 237, in listen
    sock, _ = endpoints_listener.accept()
  File "/usr/lib/python3.10/socket.py", line 293, in accept
    fd, addr = self._accept()
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/google/colab/_debugpy.py", line 52, in attachment_entry
    debugpy.listen(_dap_port)
  File "/usr/local/lib/python3.10/dist-packages/debugpy/public_api.py", line 31, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/debugpy/server/api.py", line 143, in debug
    log.reraise

BLEU Score: 0.81


# BERTScore

Metric for evaluating text generation tasks, such as summarization. It computes token-level cosine similarity between embeddings of the generated text and a reference text using BERT embeddings

In [None]:
!pip install bert-score



In [None]:
from bert_score import score

def evaluate_summary(generated_summary, reference_summary):
    # Compute BERTScore
    P, R, F1 = score([generated_summary], [reference_summary], lang='en', model_type='bert-base-uncased', num_layers=9, rescale_with_baseline=True)

    # Convert tensors to numpy arrays for easier handling
    P = P.numpy()[0]
    R = R.numpy()[0]
    F1 = F1.numpy()[0]

    return P, R, F1

# Example usage:
generated_summary = "The AI model performed exceedingly well on the given task."
reference_summary = "The given task was performed exceedingly well by the AI model."

precision, recall, f1_score = evaluate_summary(generated_summary, reference_summary)

print(f'Precision: {precision:.3f}, Recall: {recall:.3f}, F1 Score: {f1_score:.3f}')

Precision: 0.796, Recall: 0.778, F1 Score: 0.788


# METEOR

**Metric for Evaluation of Translation with Explicit ORdering**

Designed to evaluate machine translation systems, but can also be applied to summarization tasks. It's one of the metrics that tries to overcome the limitations of BLEU by considering synonyms, stemming, and word order when comparing the generated text to a reference text.

In METEOR, a score is calculated based on the harmonic mean of precision and recall between the generated text and the reference(s), with several additional considerations:

1.  Exact match: Words in the generated text and reference that match exactly.
2.  Stem match: Words that match after stemming.
3.  Synonym match: Words that are synonyms but not exact matches.
4.  Word order: The order of matching words.

In [None]:
from nltk.tokenize import word_tokenize
from nltk.translate import meteor_score
import nltk
nltk.download('wordnet')
# Assume `generated_summary` and `reference_summary` are your variables
generated_summary = "This is a summary generated by your model."
reference_summary = "This is the reference summary."

# Tokenize the summaries
tokenized_generated_summary = word_tokenize(generated_summary)
tokenized_reference_summary = word_tokenize(reference_summary)

# Now compute the METEOR score
score = meteor_score.single_meteor_score(tokenized_reference_summary, tokenized_generated_summary)

print(f'METEOR score: {score:.2f}')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


METEOR score: 0.50


# MOVERScore

**Matching Overlap of Variable-length n-grams with Embedding Re-ranking**

The metric compares a generated summary to a reference summary, analyzing both the content overlap and the semantic similarity. Here's how it generally works:

In [None]:
!pip install moverscore
!pip install pyemd
!pip install pytorch_pretrained_bert



In [None]:
from moverscore import get_idf_dict, word_mover_score
import numpy as np

# Define your reference and candidate summaries
references = [
    "The cat sat on the mat.",
    "The quick brown fox jumped over the lazy dog.",
    "A journey of a thousand miles begins with a single step.",
    "To be or not to be, that is the question.",
    "All that glitters is not gold."
]

candidates = [
    "A cat sat on a mat.",
    "A fast, dark-colored fox leapt over a slow, lazy dog.",
    "A long trip begins with one step.",
    "To exist or not to exist, that's the inquiry.",
    "Not everything that shines is made of gold."
]

# Precompute IDF dictionaries
idf_dict_ref = get_idf_dict(references)  # or provide your own IDF dictionary
idf_dict_hyp = get_idf_dict(candidates)  # or provide your own IDF dictionary

# Compute MOVERScore
scores = word_mover_score(references, candidates, idf_dict_ref, idf_dict_hyp, stop_words=[], n_gram=1, remove_subwords=True)

# Convert scores to a numpy array and print the mean MOVERScore
scores = np.array(scores)
print(f'Mean MOVERScore: {scores.mean():.4f}')


Mean MOVERScore: 0.5540
