🎯 Let’s kick off **LLM Evaluation Lab 07** — where we compare **BLEU**, **ROUGE**, and **BERTScore** in one unified pipeline. This is your go-to suite when someone asks: *“Which model generates better text?”*

---

# 📒 `07_lab_bleu_rouge_bertscore_eval_suite.ipynb`  
## 📁 `05_llm_engineering/05_llm_evaluation`

---

## 🎯 **Notebook Goals**

- Load ground-truth and generated LLM outputs  
- Compute:
  - BLEU (precision-based n-gram)
  - ROUGE (recall-based n-gram)
  - BERTScore (semantic similarity)
- Compare metrics across models (e.g., GPT-2 vs FLAN-T5 vs Falcon)

---

## ⚙️ 1. Install Required Libraries

```bash
!pip install evaluate bert-score
```

---

## 🧪 2. Sample Dataset

You can use your own generations or demo with this:

```python
references = [
    "The Eiffel Tower is located in Paris.",
    "Machine learning allows computers to learn from data.",
    "Photosynthesis occurs in the chloroplasts of plant cells."
]

candidates = [
    "Paris has the Eiffel Tower.",
    "ML helps computers learn things from data.",
    "Plants do photosynthesis using chloroplasts."
]
```

---

## 📊 3. BLEU & ROUGE via 🤗 `evaluate`

```python
import evaluate

bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

bleu_result = bleu.compute(predictions=candidates, references=[[ref] for ref in references])
rouge_result = rouge.compute(predictions=candidates, references=references)

print("BLEU:", bleu_result["bleu"])
print("ROUGE-L:", rouge_result["rougeL"])
```

---

## 🧠 4. BERTScore

```python
import bert_score

P, R, F1 = bert_score.score(candidates, references, lang="en", verbose=True)
print("BERTScore F1 (avg):", F1.mean().item())
```

---

## 📈 5. Visualize Metric Spread

```python
import matplotlib.pyplot as plt

plt.bar(["BLEU", "ROUGE-L", "BERTScore F1"], [
    bleu_result["bleu"],
    rouge_result["rougeL"],
    F1.mean().item()
])
plt.title("LLM Evaluation Metrics")
plt.ylabel("Score")
plt.ylim(0, 1)
plt.show()
```

---

## ✅ What You Built

| Metric       | What It Measures     |
|--------------|----------------------|
| BLEU         | Precision on n-grams |
| ROUGE        | Recall on n-grams    |
| BERTScore    | Semantic similarity  |

---

## ✅ Wrap-Up

| Task                    | ✅ |
|-------------------------|----|
| Metrics computed         | ✅ |
| Results visualized       | ✅ |
| Pipeline colab-ready     | ✅ |

---

## 🔮 Next Lab?

📒 `08_lab_human_eval_grading_interface.ipynb`  
Build a UI to collect **human ratings** of LLM outputs — or plug in GPT-4 for automated review.

Ready to jump into *human eval meets LLM eval*, Professor?