# NB10: LLM App Evaluation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB10_llm_evaluation.ipynb)

**Time:** ~40 min

## Learning Goals

- Design evaluation rubrics for LLM-based applications
- Implement automated metrics (accuracy, F1, precision, recall)
- Combine quantitative and qualitative evaluation approaches
- Understand RAGAS concepts for evaluating retrieval-augmented generation

In [None]:
!pip install openai pydantic pandas scikit-learn numpy datasets -q

import os
import json
import time

import pandas as pd
import numpy as np
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal, Optional
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, f1_score

print("All imports successful.")

## 1. Why Evaluation Matters

LLM outputs often *look* good -- they are fluent, coherent, and confident. But looking good is not the same as being correct. Without systematic evaluation, you can easily deploy a system that:

- Misclassifies edge cases that seem reasonable on the surface
- Hallucinates facts in a retrieval-augmented pipeline
- Degrades silently when the input distribution shifts

Evaluation is how we move from "it seems to work" to "we know it works, and we know where it fails."

There are two complementary approaches:

1. **Automated metrics** -- fast, reproducible, scalable. Use them to catch regressions and compare models.
2. **Human judgment** -- slower but catches things metrics miss. Use rubric-based scoring and qualitative error analysis.

A robust evaluation strategy uses **both**.

In [None]:
GROQ_API_KEY = ""  # @param {type:"string"}

# If not set above, try Colab secrets → then environment variable
if not GROQ_API_KEY:
    try:
        from google.colab import userdata
        GROQ_API_KEY = userdata.get('GROQ_API_KEY')
    except (ImportError, Exception):
        GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")

client = OpenAI(
    api_key=GROQ_API_KEY,
    base_url="https://api.groq.com/openai/v1"
)

# === Model Selection ===
MODEL_FAST = "moonshotai/kimi-k2-instruct"   # Classification + judgment
MODEL_SMART = "moonshotai/kimi-k2-instruct"  # Rubric scoring (same model, strong enough for both)

# Test the connection
resp = client.chat.completions.create(
    model=MODEL_FAST,
    messages=[{"role": "user", "content": "Say 'ready'"}],
    max_tokens=5
)
print(resp.choices[0].message.content)

## 2. Building an Eval Set

The foundation of any evaluation is a **gold-standard eval set** -- a curated collection of examples where we know the correct answer. This set should:

- Be representative of your real data distribution
- Include edge cases and difficult examples
- Have verified, high-quality ground-truth labels
- Be large enough to give stable metrics (30+ examples minimum, 100+ for per-class metrics)

We use the **tweet_eval stance detection** dataset (climate change topic). Stance classification asks: does this tweet express a **favorable**, **against**, or **neutral** position toward climate change action? This is inherently harder than topic classification — it requires understanding the author's *position*, not just the *subject*. This makes it an excellent testbed for evaluation.

In [None]:
# Load tweet_eval stance dataset (climate change topic)
ds = load_dataset("cardiffnlp/tweet_eval", "stance_climate")

STANCE_LABELS = {0: "none", 1: "against", 2: "favor"}

# Convert test split to DataFrame
test_full = pd.DataFrame(ds["test"])
test_full["label_name"] = test_full["label"].map(STANCE_LABELS)

# Sample 30 examples as eval set
eval_set = test_full.sample(30, random_state=42)[['text', 'label_name']].reset_index(drop=True)
eval_set.columns = ['input_text', 'expected_label']

print(f"Eval set: {len(eval_set)} examples")
print(f"\nStance distribution:")
print(eval_set['expected_label'].value_counts())
print(f"\nExample tweet:")
print(eval_set.iloc[0]['input_text'][:200])

## 3. Automated Metrics

For classification tasks, the standard automated metrics are:

- **Accuracy** -- fraction of correct predictions. Simple but can be misleading with imbalanced classes.
- **Macro F1** -- average of per-class F1 scores. Treats all classes equally regardless of their size.
- **Per-class precision/recall** -- shows exactly where the model succeeds and where it fails.

Stance classification is particularly interesting because the three classes have **different confusion costs**: confusing "favor" with "against" is a much more serious error than confusing either with "none." We will run the LLM on every example in our eval set and compute these metrics.

In [None]:
CATEGORIES = ["none", "against", "favor"]

class StancePrediction(BaseModel):
    label: Literal["none", "against", "favor"] = Field(description="Stance toward climate change action")

def classify_stance(text: str, max_retries: int = 3) -> Optional[StancePrediction]:
    """Classify stance with retry logic and Pydantic validation."""
    for attempt in range(max_retries):
        try:
            resp = client.chat.completions.create(
                model=MODEL_FAST,
                messages=[
                    {"role": "system", "content": f"Determine the stance of this tweet toward climate change action. Classify into one of: {CATEGORIES}. 'favor' means supporting action on climate change, 'against' means opposing it, 'none' means neutral or unrelated. Return JSON with 'label' field only."},
                    {"role": "user", "content": text[:500]}
                ],
                response_format={"type": "json_object"},
                temperature=0.0,
                max_tokens=50
            )
            return StancePrediction.model_validate_json(resp.choices[0].message.content)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                return None

predictions = []
for _, row in eval_set.iterrows():
    result = classify_stance(row['input_text'])
    predictions.append(result.label if result else 'Unknown')
    time.sleep(0.1)

eval_set['predicted_label'] = predictions
valid = eval_set[eval_set.predicted_label != 'Unknown']

print(f"Accuracy: {accuracy_score(valid.expected_label, valid.predicted_label):.1%}")
print(f"Macro F1: {f1_score(valid.expected_label, valid.predicted_label, average='macro', zero_division=0):.3f}")
print(f"\n{classification_report(valid.expected_label, valid.predicted_label, zero_division=0)}")

# Note on interpreting these results:
# The eval set is likely imbalanced (mostly "favor" in this dataset).
# The LLM tends to hedge toward "none" — it under-predicts "favor" for tweets
# with implicit stance (e.g., discussing climate impacts without explicit policy language).
# This is a common LLM behavior on stance detection and makes the task a good eval testbed.
print(f"\nEval set class balance: {dict(valid.expected_label.value_counts())}")
print("Tip: Look at per-class recall to see which stances the LLM misses most.")

## 4. Rubric-based Evaluation

Automated metrics give you a number, but they treat all errors equally. A rubric-based evaluation lets you define **degrees of correctness**. For stance classification, we define an asymmetric rubric:

- Confusing "favor" with "against" (or vice versa) is the **worst** error — the model got the position exactly backwards
- Confusing either stance with "none" is a **moderate** error — the model missed the stance but didn't invert it
- Exact match is a **perfect** score

We use an LLM as a **judge** to apply this rubric consistently. This is sometimes called "LLM-as-judge" evaluation.

In [None]:
RUBRIC = """Score the stance classification on a scale of 1-5:
5: Exact match with ground truth
4: Close — predicted 'none' when true stance was mild, or vice versa
3: Partially correct — got the general sentiment but wrong specific label
2: Opposite direction — confused 'favor' with 'against' or vice versa
1: Completely wrong, no reasonable connection

Return JSON: {"score": <int>, "explanation": "<brief reason>"}"""

class RubricScore(BaseModel):
    score: int = Field(ge=1, le=5, description="Rubric score 1-5")
    explanation: str = Field(description="Brief reason for the score")

def score_with_rubric(input_text: str, expected: str, predicted: str, max_retries: int = 3) -> Optional[RubricScore]:
    """Score a prediction using rubric + LLM-as-judge (uses smart model for quality)."""
    for attempt in range(max_retries):
        try:
            resp = client.chat.completions.create(
                model=MODEL_SMART,  # Kimi-k2 for nuanced judgment
                messages=[
                    {"role": "system", "content": RUBRIC},
                    {"role": "user", "content": f"Tweet: {input_text[:200]}\nExpected stance: {expected}\nPredicted stance: {predicted}"}
                ],
                response_format={"type": "json_object"},
                temperature=0.0,
                max_tokens=100
            )
            return RubricScore.model_validate_json(resp.choices[0].message.content)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                return None

# Score ALL examples in the eval set
scores = []
for i, (_, row) in enumerate(eval_set.iterrows()):
    result = score_with_rubric(row['input_text'], row['expected_label'], row['predicted_label'])
    if result:
        scores.append(result)
    time.sleep(0.15)  # Rate limiting for smart model
    if (i + 1) % 10 == 0:
        print(f"Scored {i+1}/{len(eval_set)}...")

# Show a few examples
for s, (_, row) in zip(scores[:5], eval_set.head(5).iterrows()):
    print(f"Score: {s.score} | Expected: {row['expected_label']:<10} | Predicted: {row['predicted_label']:<10} | {s.explanation[:60]}")

avg_score = np.mean([s.score for s in scores])
print(f"\nAverage rubric score: {avg_score:.1f}/5 (across {len(scores)} examples)")

## 5. Qualitative Error Analysis

This is often the most useful part of evaluation. Numbers tell you *how much* the model fails; error analysis tells you *why*.

Stance detection is particularly revealing for error analysis because:
- Some tweets use **sarcasm** — the model may take them literally
- Some tweets discuss climate change without taking a stance — the boundary between "none" and a mild stance is genuinely ambiguous
- Short tweets may lack sufficient context for reliable classification

Look at the actual misclassifications below. Are there patterns?

In [None]:
errors = eval_set[eval_set.expected_label != eval_set.predicted_label]
print(f"Errors: {len(errors)}/{len(eval_set)} ({len(errors)/len(eval_set):.0%})")

# Check for the worst kind of error: favor↔against confusion
if len(errors) > 0:
    flipped = errors[
        ((errors.expected_label == 'favor') & (errors.predicted_label == 'against')) |
        ((errors.expected_label == 'against') & (errors.predicted_label == 'favor'))
    ]
    print(f"Stance inversions (favor↔against): {len(flipped)}/{len(errors)} errors")

for i, (_, row) in enumerate(errors.head(5).iterrows()):
    print(f"\n--- Error {i+1} ---")
    print(f"Tweet: {row['input_text'][:150]}...")
    print(f"Expected: {row['expected_label']}")
    print(f"Predicted: {row['predicted_label']}")

## 6. RAGAS Concepts for Retrieval

When evaluating **retrieval-augmented generation (RAG)** systems, classification metrics are not enough. You need to evaluate the full pipeline: retrieval quality *and* generation quality.

The [RAGAS](https://docs.ragas.io/) framework defines three key metrics:

1. **Faithfulness** -- Is the generated answer grounded in (supported by) the retrieved documents? A faithful answer does not add information beyond what the context provides.

2. **Context Relevance** -- Are the retrieved documents actually relevant to the question? Irrelevant context can mislead the generator.

3. **Answer Correctness** -- Is the final answer factually correct? This combines faithfulness with factual accuracy.

Stance classification connects to these ideas: when an LLM classifies stance, we can ask whether its *reasoning* is faithful to the tweet's actual content, or whether it is projecting assumptions. Below is a simple faithfulness check you can adapt for RAG evaluation.

In [None]:
class FaithfulnessCheck(BaseModel):
    faithful: bool = Field(description="Whether the answer is faithful to the context")
    explanation: str = Field(description="Brief reason")

def check_faithfulness(question: str, context: str, answer: str, max_retries: int = 3) -> Optional[FaithfulnessCheck]:
    """Simple faithfulness check: is the answer supported by the context?"""
    for attempt in range(max_retries):
        try:
            resp = client.chat.completions.create(
                model=MODEL_FAST,
                messages=[{"role": "user", "content": f"""Given this context and answer, is the answer faithful to (supported by) the context?
Context: {context[:500]}
Answer: {answer}
Return JSON: {{"faithful": true/false, "explanation": "brief reason"}}"""}],
                response_format={"type": "json_object"},
                temperature=0.0,
                max_tokens=100
            )
            return FaithfulnessCheck.model_validate_json(resp.choices[0].message.content)
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                return None

# Example
result = check_faithfulness(
    "What causes climate change?",
    "Climate change is primarily caused by greenhouse gas emissions from burning fossil fuels.",
    "Climate change is caused by solar activity."
)
if result:
    print(f"Faithful: {result.faithful} — {result.explanation}")

## Exercise: Build Your Own Evaluation Pipeline

Apply the evaluation framework from this notebook to a different classification task:

1. **Pick a different dataset or label set** — you could use `dair-ai/emotion` (6 emotions) or any classification output from your project
2. **Run the LLM classifier** on 30+ examples
3. **Compute automated metrics** (accuracy, macro F1, classification report)
4. **Design a rubric** appropriate for your task (what are the worst errors? what's a "close miss"?)
5. **Score with rubric** and compare the rubric scores to the automated metrics

**Bonus:** Try two different LLM models (e.g., `llama-3.1-8b-instant` vs `qwen/qwen3-32b`) and compare their evaluation results. Does the more capable model score higher on both metrics and rubric?

In [None]:
# YOUR CODE HERE

# Step 1: Load a different dataset
# from datasets import load_dataset
# ds = load_dataset("dair-ai/emotion")
# eval_examples = pd.DataFrame(ds["test"]).sample(30, random_state=42)

# Step 2: Classify with LLM
# ...

# Step 3: Compute automated metrics
# accuracy = accuracy_score(eval_examples['true_label'], eval_examples['predicted_label'])
# print(classification_report(...))

# Step 4: Design and apply a rubric
# MY_RUBRIC = """..."""

# Step 5: Compare metrics vs rubric scores

## 7. Summary

A complete evaluation framework combines three approaches:

1. **Automated metrics** (accuracy, F1, precision/recall) -- fast, reproducible, good for regression testing and model comparison.
2. **Rubric-based scoring** (LLM-as-judge with defined criteria) -- captures degrees of correctness. For stance detection, this reveals that not all errors are equal.
3. **Qualitative error analysis** (manual inspection of failures) -- reveals *why* the system fails. For stance detection, common failure modes include sarcasm, implicit stance, and topic-adjacent tweets.

**Always use all three.** Metrics alone miss nuance. Rubrics alone miss scale. Qualitative analysis alone misses the big picture.

For RAG systems, add faithfulness and relevance checks (RAGAS concepts) to ensure the retrieval and generation components work together correctly.