# LLM Response evaluation using ROUGE SCORE

**The ROUGE score (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of text summarization and machine-generated text against a reference or ground truth**

* Objective : To evaluate rogue score of the given text.

In [1]:
!pip install rouge-score

Defaulting to user installation because normal site-packages is not writeable
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py): started
  Building wheel for rouge-score (setup.py): finished with status 'done'
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24970 sha256=7f5600c1ef1f2a922a34b22c89443822d3eff48375f0acb3698cb18fe18bfcc0
  Stored in directory: c:\users\gkris\appdata\local\pip\cache\wheels\85\9d\af\01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


* Rogue - Score is mainly depended upon these factors.
* Precision: Fraction of matched n-grams in the generated text.
* Recall: Fraction of matched n-grams in the reference text.
* F1-Score: Harmonic mean of precision and recall.

# Rouge - Score Implementation

In [2]:
from rouge_score import rouge_scorer

# Reference and generated summaries
reference = "The cat sat on the mat, watching tom and jerry cartoon in television"
generated = "The cat is sitting on the mat, watching tom and jerry"

# Initialize the ROUGE scorer
#scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
#scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2'], use_stemmer=True)
#scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2','rouge3'], use_stemmer=True)
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2','rouge3', 'rougeL'],use_stemmer=True)

# Compute ROUGE scores
scores = scorer.score(reference, generated)

# Print results
for rouge_type, score in scores.items():
    print(f"{rouge_type}: Precision: {score.precision:.4f}, Recall: {score.recall:.4f}, F1: {score.fmeasure:.4f}")


rouge1: Precision: 0.8182, Recall: 0.6923, F1: 0.7500
rouge2: Precision: 0.7000, Recall: 0.5833, F1: 0.6364
rouge3: Precision: 0.5556, Recall: 0.4545, F1: 0.5000
rougeL: Precision: 0.8182, Recall: 0.6923, F1: 0.7500


In [14]:
reference1 = "The cat is on the mat"
generated1 = "the cat is on mat"
scorer4 = rouge_scorer.RougeScorer(['rouge1'],['rouge2'],['rouge3'],['rougeL'])

scores = scorer.score(reference1,generated1)

for rouge_type, score in scores.items():
    print(f'{rouge_type}: Precision: {score.precision:.4f}, Recall: {score.recall:.4f}, F1: {score.fmeasure:.4f}')

rouge1: Precision: 1.0000, Recall: 0.8333, F1: 0.9091
rouge2: Precision: 0.7500, Recall: 0.6000, F1: 0.6667
rouge3: Precision: 0.6667, Recall: 0.5000, F1: 0.5714
rougeL: Precision: 1.0000, Recall: 0.8333, F1: 0.9091


*  In the above example, the small differences like a missing word can be neglected and we can maintain a relatively high ROUGE score due to partial matches and overlap.

In [16]:
reference2 = "Cybercrime is one of the most known crime and taking money of most the people which is leading to loss of income"
generated2 = "Cybercrime is one of the most prevalent crimes, affecting many individuals and leading to financial losses" 

scorer4 = rouge_scorer.RougeScorer(['rouge1','rouge2','rouge3','rougeL'])

scores = scorer.score(reference2,generated2)

for rouge_type, score in scores.items():
    print(f'{rouge_type}: Precision: {score.precision:.4f}, Recall: {score.recall:.4f}, F1: {score.fmeasure:.4f}')

rouge1: Precision: 0.6875, Recall: 0.5000, F1: 0.5789
rouge2: Precision: 0.4000, Recall: 0.2857, F1: 0.3333
rouge3: Precision: 0.2857, Recall: 0.2000, F1: 0.2353
rougeL: Precision: 0.6875, Recall: 0.5000, F1: 0.5789


* In the above example, the semantic similarity is high, but the lack of direct matches in sequences of three consecutive words leads to a low trigram ROUGE score.

In [13]:
reference3 =  "Online crimes are increasingly common, resulting in monetary losses for victims"
generated3 =" The rise in cybercrime has caused substantial financial losses for countless individuals"
scorer4 = rouge_scorer.RougeScorer(['rouge1'],['rouge2'],['rouge3'],['rouge4'])

scores = scorer.score(reference3,generated3)

for rouge_type, score in scores.items():
    print(f'{rouge_type}: Precision: {score.precision:.4f}, Recall: {score.recall:.4f}, F1: {score.fmeasure:.4f}')

rouge1: Precision: 0.2500, Recall: 0.2727, F1: 0.2609
rouge2: Precision: 0.0909, Recall: 0.1000, F1: 0.0952
rouge3: Precision: 0.0000, Recall: 0.0000, F1: 0.0000
rougeL: Precision: 0.2500, Recall: 0.2727, F1: 0.2609


* In the above example, Rogue-score mainly depends upon n gram matches and paraphrasing of the word or rewording doesn't increase the precision of the evaluation

In [21]:
reference4 = "The quick brown fox jumped over the lazy dog"
generated4 = "The quick brown dog jumped over the lazy fox"
#scorer4 = rouge_scorer.RougeScorer(['rouge1'],['rouge2'],['rouge3'],['rouge4'])
scorer2 = rouge_scorer.RougeScorer(['rouge1'],['rouge2'])
scores = scorer.score(reference4,generated4)

for rouge_type, score in scores.items():
    print(f'{rouge_type}: Precision: {score.precision:.4f}, Recall: {score.recall:.4f}, F1: {score.fmeasure:.4f}')

rouge1: Precision: 1.0000, Recall: 1.0000, F1: 1.0000


* In the above example, Rogue score doesn't analyze the change in text hence it provide 100% precision