# NLP Performance Evaluation Metrics

### BLEU Score:
- Measure of the precision of n-grams in the model output, against the reference text, that is human-generated.
- Initially designed for machine translations tasks, but it's been adopted widely across several NLP tasks.
- BLEU stands for Bilingual Evaluation Understudy.

BLEU can be thought of as a grammar checker for machines. 
- It compares what a machine wrote to what a human wrote, and checks how many of the same words and short phrases appear in both.
- It's like grading a student's essay by how many of the same word patterns it uses compared to a perfect answer.
- It’s very strict — it gives higher scores when the machine uses exactly the same words in the same order as a human would.
- Commonly used to evaluate machine translations or summaries.
- What it measures: Precision — how many of the words the machine used were actually right (according to the human example).

###  ROUGE Score: 
- Specifically more focused on recall.
- Compares overlapping units like n-grams, words sequences, and word-pairs, in both generated text and the reference text.
- ROUGE scores commonly used for specific NLP tasks like text summarization.
- ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation.

ROUGE is more like a reading comprehension teacher. 
- It checks how much of the important stuff from the human-written version the machine remembered and included.
- It’s more forgiving than BLEU — it doesn’t require the words to be in the exact same order.
- Often used for testing how good a machine-generated summary is.
- What it measures: Recall — how much of the important content from the human version was captured by the machine.



In [None]:
# Importing needed libraries
import evaluate
import nltk
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from sacrebleu import corpus_bleu

In [None]:
# Example sentences
# Candidate is the LLM generated version
reference = ["the cat is on the mat"]
candidate = ["the cat is on mat"]

In [None]:
# BLEU Score Calculation
bleu = corpus_bleu(candidate, [reference])
print(f"BLEU Score: {bleu.score}")


## Explanation of Results

The score means that the machine-generated sentence (candidate) matches the human-written reference fairly well, but not perfectly.
- BLEU scores range from 0 to 100 (sometimes shown as 0.0 to 1.0 depending on the library).
- A higher score means a closer match between the candidate and the reference in terms of exact word choices and word order.
- A score of 57.89 means the overlap is moderately strong, but some words or phrases don’t exactly match.


In [None]:
# ROUGE Score Calculation
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference[0], candidate[0])
print(f"ROUGE-1: {scores['rouge1']}")
print(f"ROUGE-L: {scores['rougeL']}")

## Explanation of Results

##### ROUGE-1

Measures overlap of single words (unigrams) between the reference and the candidate.
- Precision = 1.0 → Every word in the candidate was found in the reference.
- Recall = 0.833... → 83.33% of the words in the reference were captured in the candidate.
- F1-score = 0.909... → The harmonic mean of precision and recall; gives a balanced view of performance.
- Interpretation: The candidate used all correct words, but missed one or two from the reference.

##### ROUGE-L

Measures longest common subsequence (LCS) which is the longest series of words that appear in both the reference and candidate in the same order, though not necessarily contiguously.
- Same scores here imply the LCS was very good
- All words in the candidate were correct and in order, but the candidate missed a small portion of the original meaning/content.
- Interpretation: The structure and ordering of the candidate closely matched the human version, missing only a little content.

In [None]:
# Alternative code using the sacreblue library

from sacrebleu import corpus_bleu
from rouge_score import rouge_scorer

# Example sentences
reference = ["the cat is on the mat"]
candidate = ["the cat is on mat"]

# BLEU Score Calculation
bleu = corpus_bleu(candidate, [reference])
print(f"BLEU Score: {bleu.score}")

# ROUGE Score Calculation
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference[0], candidate[0])
print(f"ROUGE-1: {scores['rouge1']}")
print(f"ROUGE-L: {scores['rougeL']}")