🧠 Let’s move into **Lab 08** — this one’s about **human evaluation**, the gold standard in LLM testing. We’ll build a clean UI for **rating generations**, plug in **GPT-4 as a reviewer**, and even prep for **crowdsourcing** if needed.

---

# 📒 `08_lab_human_eval_grading_interface.ipynb`  
## 📁 `05_llm_engineering/05_llm_evaluation`

---

## 🎯 **Notebook Goals**

- Create a **streamlit-style rating UI**
- Let humans grade:
  - Fluency ✅
  - Factual accuracy ✅
  - Relevance ✅
- Bonus: Plug in GPT-4 to simulate **automated human eval**

---

## ⚙️ 1. Sample Dataset: References vs Generations

```python
samples = [
    {
        "prompt": "What is photosynthesis?",
        "reference": "Photosynthesis is the process by which green plants use sunlight to synthesize food.",
        "generation": "Photosynthesis is how plants turn light into energy for survival."
    },
    {
        "prompt": "Where is the Eiffel Tower?",
        "reference": "The Eiffel Tower is in Paris, France.",
        "generation": "The Eiffel Tower is located in Berlin."
    }
]
```

---

## 🧪 2. Build Grading Form (Colab Friendly)

```python
from IPython.display import display, Markdown
from ipywidgets import widgets

def create_grading_interface(sample):
    display(Markdown(f"### 🧪 Prompt:\n{sample['prompt']}"))
    display(Markdown(f"**📖 Reference:** {sample['reference']}"))
    display(Markdown(f"**🤖 LLM Output:** {sample['generation']}"))

    fluency = widgets.IntSlider(description="Fluency", min=1, max=5, value=3)
    factual = widgets.IntSlider(description="Factuality", min=1, max=5, value=3)
    relevance = widgets.IntSlider(description="Relevance", min=1, max=5, value=3)
    submit = widgets.Button(description="Submit Score")

    output = widgets.Output()

    def on_submit_clicked(_):
        with output:
            display(Markdown(
                f"✅ **Scores Submitted:** Fluency: {fluency.value}, "
                f"Factuality: {factual.value}, Relevance: {relevance.value}"
            ))

    submit.on_click(on_submit_clicked)

    display(fluency, factual, relevance, submit, output)

for sample in samples:
    create_grading_interface(sample)
```

---

## 🧠 3. Simulate GPT-4 as Human Evaluator

```python
import openai

def gpt_evaluator(prompt, reference, generation):
    system_prompt = "You are a helpful evaluator who rates LLM outputs. Rate the fluency, factuality, and relevance on a scale from 1 to 5."
    user_input = f"""
    Prompt: {prompt}
    Reference: {reference}
    Output: {generation}
    Rate each dimension (fluency, factuality, relevance) from 1 to 5.
    """
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_input}
        ]
    )
    return response.choices[0].message["content"]

# Example
#gpt_evaluator(**samples[0])
```

---

## ✅ What You Built

| Feature         | Role |
|------------------|------|
| Rating UI        | Collect human scores |
| GPT-4 Auto Rater | Optional evaluator (for cost/time saving) |
| Manual + Automated | ✅ Combined pipeline |

---

## ✅ Wrap-Up

| Task                          | ✅ |
|-------------------------------|----|
| Built LLM rating interface     | ✅ |
| Collected fluency/factuality  | ✅ |
| Plugged in GPT-4 eval optional| ✅ |

---

## 🔮 Next Lab

📒 `09_lab_bias_and_toxicity_metrics_demo.ipynb`  
Let’s analyze **bias and toxicity** in LLM outputs — using open-source tools to detect stereotypes, slurs, or harmful completions.

Ready to go ethical, Professor?