In [1]:
!pip install sentence-transformers rouge-score bert-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting absl-py (from rouge-score)
  Downloading absl_py-2.3.0-py3-none-any.whl.metadata (2.4 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
Downloading absl_py-2.3.0-py3-none-any.whl (135 kB)
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py): started
  Building wheel for rouge-score (setup.py): finished with status 'done'
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24972 sha256=9e77d988301f01010db9d990d413144c300d01e9c7da7fac242bd1a87cfaacae
  Stored in directory: c:\users\nisharg\appdata\local\pip\cache\wheels\85\9d\af\01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: a

In [3]:
from sentence_transformers import SentenceTransformer, util
from rouge_score import rouge_scorer
from bert_score import score
import numpy as np

# Sample long-answer test set
test_data = [
    {
        "question": "Explain the causes of World War I.",
        "ground_truth": "World War I was caused by a combination of nationalism, militarism, alliances, and imperial rivalries. The immediate trigger was the assassination of Archduke Franz Ferdinand in 1914.",
        "prediction": "The war began due to rising nationalism, a buildup of military power, and a complex web of alliances. The assassination of Archduke Franz Ferdinand served as the spark for the conflict."
    },
    {
        "question": "What are the benefits of exercise?",
        "ground_truth": "Exercise improves cardiovascular health, boosts mood, helps with weight management, strengthens muscles and bones, and can reduce the risk of chronic diseases like diabetes and hypertension.",
        "prediction": "Regular physical activity can improve heart health, regulate body weight, enhance mental well-being, and lower the chances of conditions like diabetes and high blood pressure."
    }
]

# Load models
bert_model = SentenceTransformer("all-MiniLM-L6-v2")
rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

results = []

for item in test_data:
    pred = item["prediction"]
    gt = item["ground_truth"]

    # Cosine similarity
    emb_pred = bert_model.encode(pred, convert_to_tensor=True)
    emb_gt = bert_model.encode(gt, convert_to_tensor=True)
    cosine_sim = util.pytorch_cos_sim(emb_pred, emb_gt).item()

    # ROUGE
    r_scores = rouge.score(gt, pred)
    rouge1 = r_scores["rouge1"].fmeasure
    rouge2 = r_scores["rouge2"].fmeasure
    rougel = r_scores["rougeL"].fmeasure

    results.append({
        "question": item["question"],
        "prediction": pred,
        "ground_truth": gt,
        "cosine_similarity": round(cosine_sim, 4),
        "rouge1": round(rouge1, 4),
        "rouge2": round(rouge2, 4),
        "rougeL": round(rougel, 4)
    })

# BERTScore (evaluate all at once)
preds = [r["prediction"] for r in results]
gts = [r["ground_truth"] for r in results]
P, R, F1 = score(preds, gts, lang='en', verbose=False)

for i, f in enumerate(F1):
    results[i]["bert_score_f1"] = round(f.item(), 4)

# Print
from pprint import pprint
pprint(results)

# Averages
avg_cosine = np.mean([r["cosine_similarity"] for r in results])
avg_rouge1 = np.mean([r["rouge1"] for r in results])
avg_rouge2 = np.mean([r["rouge2"] for r in results])
avg_rougel = np.mean([r["rougeL"] for r in results])
avg_bert = np.mean([r["bert_score_f1"] for r in results])

print("\nAverage Metrics:")
print(f"Cosine Similarity: {avg_cosine:.4f}")
print(f"ROUGE-1: {avg_rouge1:.4f}")
print(f"ROUGE-2: {avg_rouge2:.4f}")
print(f"ROUGE-L: {avg_rougel:.4f}")
print(f"BERTScore F1: {avg_bert:.4f}")


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'bert_score_f1': 0.9299,
  'cosine_similarity': 0.7858,
  'ground_truth': 'World War I was caused by a combination of nationalism, '
                  'militarism, alliances, and imperial rivalries. The '
                  'immediate trigger was the assassination of Archduke Franz '
                  'Ferdinand in 1914.',
  'prediction': 'The war began due to rising nationalism, a buildup of '
                'military power, and a complex web of alliances. The '
                'assassination of Archduke Franz Ferdinand served as the spark '
                'for the conflict.',
  'question': 'Explain the causes of World War I.',
  'rouge1': 0.4483,
  'rouge2': 0.1786,
  'rougeL': 0.3448},
 {'bert_score_f1': 0.9249,
  'cosine_similarity': 0.8387,
  'ground_truth': 'Exercise improves cardiovascular health, boosts mood, helps '
                  'with weight management, strengthens muscles and bones, and '
                  'can reduce the risk of chronic diseases like diabetes and '
 

🔍 ROUGE Scores Breakdown:
Metric	Value	Meaning
rouge1	0.3846	38.5% unigram (word) overlap between prediction and reference.
rouge2	0.08	8% bigram (2-word sequence) overlap. This is much harder to match, especially for paraphrased answers.
rougeL	0.3462	Measures Longest Common Subsequence (LCS). 34.6% of the word sequence matches in order.

📌 Interpretation
ROUGE-1 is relatively moderate, showing decent word overlap — words like "health", "weight", "diabetes", etc., occur in both.

ROUGE-2 is low — because the prediction paraphrased the ground truth. Matching exact 2-word sequences like “cardiovascular health” or “chronic diseases” is harder.

ROUGE-L is higher than ROUGE-2 because it rewards ordered subsequences even if they’re not exact n-gram matches.

🆚 Compared with Semantic Metrics
Metric	Value	Meaning
bert_score_f1	0.9249	Shows strong semantic similarity using BERT embeddings (close to 1 = very similar meaning).
cosine_similarity	0.8387	Vector similarity — also supports that prediction and reference are semantically close.

🔎 These confirm: even if ROUGE scores are lower (due to paraphrasing), semantically the answer is very good.

✅ When to Use ROUGE vs. Semantic Metrics
Use Case	            Prefer ROUGE	Prefer BERT / Cosine
Exact phrasing matters	    ✅	         ❌
Meaning over wording	    ❌	         ✅
Short answers (e.g. QA) 	✅	         ✅
Long, paraphrased responses	❌	         ✅
Summarization	            ✅         	 ✅ (BERTScore recommended too)

🔁 Conclusion
ROUGE-1 = 0.38 → fair word overlap.

ROUGE-2 = 0.08 → low phrase overlap due to paraphrasing.

ROUGE-L = 0.34 → good overall sequence similarity.

BERTScore (0.92) and cosine (0.83) show that the prediction is semantically correct and fluent, even though it uses different words.

