# 🧪 LLM Output Evaluation (BLEU, ROUGE, METEOR, BERTScore, and More!)

Yo, friend! Ready to see how good your LLM’s answers are? 🎉
We’re diving into evaluating model outputs using BLEU, ROUGE, METEOR, BERTScore, and a hallucination check!

In [None]:
# Step 1: Install the tools we need
# Grabbing HuggingFace’s evaluate and other goodies for metrics. Let’s roll!
!pip install -q evaluate==0.4.3 sacrebleu==2.4.3 rouge-score==0.1.2 sentence-transformers==3.0.1 nltk==3.8.1 bert-score==0.3.13 fsspec==2024.6.0
import nltk
nltk.download('wordnet')  # Needed for METEOR
print("🎉 Evaluation tools loaded — time to judge some outputs!")

In [2]:
# Step 2: Clear cache to avoid loading issues
# Let’s start fresh to dodge any pesky cache errors!
try:
    import shutil
    import os
    cache_dir = "/root/.cache/huggingface"
    if os.path.exists(cache_dir):
        shutil.rmtree(cache_dir)
        print("🧹 Cleared cache to start fresh!")
    os.makedirs(cache_dir, exist_ok=True)
except Exception as e:
    print(f"😕 Cache clearing failed: {e}")
    raise

In [3]:
# Step 3: Set up sample predictions and references
# These are our model’s answers (predictions) vs. the correct answers (references).
try:
    predictions = [
        "The capital of France is Paris.",
        "Cats are mammals that are often kept as pets.",
        "The moon orbits the Earth every 27.3 days.",
        "AI is transforming industries like healthcare and finance."
    ]
    references = [
        "Paris is the capital of France.",
        "Cats are common domestic animals and mammals.",
        "The moon completes one orbit of Earth roughly every 27 days.",
        "Artificial intelligence is revolutionizing sectors like healthcare and finance."
    ]
    print(f"📝 Loaded {len(predictions)} predictions and references — ready to evaluate!")
except Exception as e:
    print(f"😕 Failed to set up data: {e}")
    raise

📝 Loaded 4 predictions and references — ready to evaluate!


In [4]:
# Step 4: Evaluate with BLEU
# BLEU checks word overlap between predictions and references (like a matching game).
try:
    import sacrebleu
    bleu_scores = [sacrebleu.corpus_bleu([pred], [[ref]]) for pred, ref in zip(predictions, references)]
    avg_bleu = sum(score.score for score in bleu_scores) / len(bleu_scores)
    print(f"\n🔵 BLEU Score (average): {avg_bleu:.4f}")
    for i, score in enumerate(bleu_scores):
        print(f"   Sample {i+1}: {score.score:.4f}")
except Exception as e:
    print(f"😕 BLEU evaluation failed: {e}")
    raise


🔵 BLEU Score (average): 22.8866
   Sample 1: 29.0715
   Sample 2: 9.9801
   Sample 3: 10.6933
   Sample 4: 41.8013


In [5]:
# Step 5: Evaluate with ROUGE
# ROUGE measures overlap in words, phrases, and sequences (great for text similarity).
try:
    from rouge_score import rouge_scorer
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = [scorer.score(ref, pred) for pred, ref in zip(predictions, references)]
    avg_rouge = {'rouge1': 0, 'rouge2': 0, 'rougeL': 0}
    for score in rouge_scores:
        for key in avg_rouge:
            avg_rouge[key] += score[key].fmeasure / len(rouge_scores)
    print("\n🟠 ROUGE Scores (average):")
    for k, v in avg_rouge.items():
        print(f"   {k}: {v:.4f}")
except Exception as e:
    print(f"😕 ROUGE evaluation failed: {e}")
    raise


🟠 ROUGE Scores (average):
   rouge1: 0.6658
   rouge2: 0.3413
   rougeL: 0.5825


In [6]:
# Step 6: Evaluate with METEOR
# METEOR considers synonyms and word order, making it smarter than BLEU.
try:
    import evaluate
    meteor = evaluate.load("meteor")
    meteor_score = meteor.compute(predictions=predictions, references=references)
    print(f"\n🟢 METEOR Score: {meteor_score['meteor']:.4f}")
except Exception as e:
    print(f"😕 METEOR evaluation failed: {e}")
    raise

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script: 0.00B [00:00, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...



🟢 METEOR Score: 0.5388


In [7]:
# Step 7: Evaluate with BERTScore
# BERTScore uses embeddings to check semantic similarity (super smart for meaning!).
try:
    import torch
    import evaluate
    bertscore = evaluate.load("bertscore", model_type="distilbert-base-uncased", device=0 if torch.cuda.is_available() else -1)
    bertscore_result = bertscore.compute(predictions=predictions, references=references, lang="en")
    print(f"\n🟣 BERTScore (average):")
    print(f"   Precision: {sum(bertscore_result['precision']) / len(bertscore_result['precision']):.4f}")
    print(f"   Recall: {sum(bertscore_result['recall']) / len(bertscore_result['recall']):.4f}")
    print(f"   F1: {sum(bertscore_result['f1']) / len(bertscore_result['f1']):.4f}")
except Exception as e:
    print(f"😕 BERTScore evaluation failed: {e}")
    raise

Downloading builder script: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



🟣 BERTScore (average):
   Precision: 0.9474
   Recall: 0.9388
   F1: 0.9430


In [8]:
# Step 8: Hallucination check
# A simple check to flag if predictions are way off (low similarity to references).
try:
    import difflib
    def hallucination_flags(preds, refs):
        flags = []
        for pred, ref in zip(preds, refs):
            ratio = difflib.SequenceMatcher(None, pred.lower(), ref.lower()).ratio()
            flags.append("Possible hallucination" if ratio < 0.6 else "Looks good")
        return flags
    flags = hallucination_flags(predictions, references)
    print("\n🚨 Hallucination Flags:")
    for i, (pred, flag) in enumerate(zip(predictions, flags)):
        print(f"   Sample {i+1}: '{pred}' → {flag}")
    print("\n👉 Note: This is a basic check. For better hallucination detection, try a fact-checking model like facebook/bart-large-mnli!")
except Exception as e:
    print(f"😕 Hallucination check failed: {e}")
    raise


🚨 Hallucination Flags:
   Sample 1: 'The capital of France is Paris.' → Looks good
   Sample 2: 'Cats are mammals that are often kept as pets.' → Possible hallucination
   Sample 3: 'The moon orbits the Earth every 27.3 days.' → Looks good
   Sample 4: 'AI is transforming industries like healthcare and finance.' → Looks good

👉 Note: This is a basic check. For better hallucination detection, try a fact-checking model like facebook/bart-large-mnli!


# 📚 Tips for Having Fun
- Add your own predictions and references in Step 3 to test your model’s outputs.
- Try other metrics like BLEURT or SacreBLEU for extra precision (install via `pip install bleurt sacrebleu`).
- For hallucination checks, explore fact-checking models like facebook/bart-large-mnli.
- Check out HuggingFace’s evaluate docs (https://huggingface.co/docs/evaluate) for more metric magic!

# 🚀 What’s Next?
- Save this as your third notebook in your learning path.
- Revisit your Fine-Tuning notebook (step 4) to evaluate your DistilBERT model’s outputs.
- Dive deeper into RAG with ChromaDB or Weaviate (steps 5 and 6) for more fun!
