
# LLM Evaluation Workshop — Hands‑On Notebook
**Goal:** give you a theory and framework skeleton with examples to evaluate your own agent through **5 phases**:  
1) Vibe Checks → 2) Basic Metrics → 3) LLM‑as‑Judge (+RAG) → 4) Component Validation → 5) Production Monitoring (offline harness).

Use the **example code** as reference and **replace the `MockAgent` and mock tools** with your own agent and tools. Feel free to use any coding language. 

## The mock agent in this Notebook is “Account Copilot” Internal Agent that has 3 tools: 
* calc: aggregator over account metrics (sum/avg/min/max, group-by time/app/source).
* policy_kb.search: retrieves policy snippets (RAG)+ citations
* actions.recommend: suggests next-best actions based on data patterns and policies.

  

> This notebook avoids external calls by default. Where relevant, hooks are provided to plug in your own LLM/judge and data sources.
> The main goal is to get you to make all the decisions on how to evaluate your agent as a means for "Eval-driven development"


## 0. Setup — Agent Interface & Mocks
**This is just for the example!! Use your agent built in your framework**

In [None]:

# Minimal interfaces. Replace MockAgent with your real agent.
# Your task: define agent behavior and tools. 

from typing import List, Dict, Any
from dataclasses import dataclass, field
import math, json, random, statistics, re

@dataclass
class Step:
    tool: str
    args: Dict[str, Any]
    output: Dict[str, Any]

@dataclass
class AgentResult:
    final: Dict[str, Any]
    trace: List[Step]
    termination: str
    retrieved_context: List[Dict[str, Any]] = field(default_factory=list)

class MockTools:
    # Fake tool implementations for demo purposes only
    @staticmethod
    def calc(args):
        # Pretend to compute DAU sum last 7d
        if args.get("app") == "AcmeApp" and args.get("window") == "7d":
            return {"value": 124500.0}
        return {"value": 0.0, "note": "no data in window"}

    @staticmethod
    def policy_kb_search(args):
        text = "Raw PII exports are disallowed; use aggregated reports."
        return {"hits": [{"id": "kb#12", "text": text}]}

    @staticmethod
    def actions_recommend(args):
        ctx = [{"id": "policy-ops#34",
                "text": "When uninstall spikes exceed 20% WoW, consider churn-risk audiences and alerting."}]
        return {"context": ctx,
                "recommendation": "Create a churn-risk audience and enable an uninstall anomaly alert."}

class MockAgent:
    MAX_STEPS = 4
    def run(self, query: str) -> AgentResult:
        q = query.lower()
        trace = []
        # route
        if "dau" in q or "cpi" in q:
            tool = "calc"; args = {"metric": "DAU", "agg": "sum", "window": "7d", "app": "AcmeApp"}
            out = MockTools.calc(args)
            trace.append(Step(tool, args, out))
            return AgentResult(final={"answer": out["value"], "confidence": 0.9}, trace=trace, termination="success")
        if "pii" in q or "policy" in q:
            tool = "policy_kb.search"; args = {"query": query}
            out = MockTools.policy_kb_search(args)
            trace.append(Step(tool, args, out))
            return AgentResult(final={"answer": out["hits"][0]["text"],
                                      "citation": out["hits"][0]["id"], "confidence": 0.85},
                               trace=trace, termination="success", retrieved_context=out["hits"])
        if "uninstall" in q or "what should i do" in q or "recommend" in q:
            tool = "actions.recommend"; args = {"query": query}
            out = MockTools.actions_recommend(args)
            trace.append(Step(tool, args, out))
            return AgentResult(final={"recommendation": out["recommendation"],
                                      "citation": out["context"][0]["id"], "confidence": 0.8},
                               trace=trace, termination="success", retrieved_context=out["context"])
        if "export raw pii" in q:
            # guardrail refusal example
            out = MockTools.policy_kb_search({"query": "PII policy"})
            trace.append(Step("policy_kb.search", {"query":"PII policy"}, out))
            return AgentResult(final={"answer": "I can’t export raw PII. Use aggregated reports.",
                                      "citation": out["hits"][0]["id"]},
                               trace=trace, termination="refusal", retrieved_context=out["hits"])
        # default
        return AgentResult(final={"answer": "I don't have a tool for that.", "confidence": 0.2},
                           trace=trace, termination="refusal")

agent = MockAgent()
print("MockAgent ready. Replace with your own: agent = YourAgent()")


## Phase 1 — Vibe Checks (manual smoke tests)

**What we do:** Quick manual spot-checks to see if the agent "feels" usable and makes sense before investing in formal evaluation.
**What are we looking for:**
* Does the agent return something at all?
* Is the format consistent (short answer, optional notes)?
* Does it avoid glaring hallucinations (e.g., making up policies)?
* Are recommendations sensible, even if not perfect?



**Exercise:** If you have a working agent, run a few representative questions. Eyeball the answers for sanity: format, usefulness, obvious hallucinations.
Else, define what will be main vibe checks once your agent works and move on.

> Replace the queries with your own. Keep outputs short and consistent.


In [None]:
# Try doing vibe checks that will test using all tools and not just the "happy path".
queries = [
    "DAU for AcmeApp last 7 days?",
    "What’s our PII handling policy?",
    "Uninstalls spiked 30% WoW. What should I do?"
]
for q in queries:
    res = agent.run(q)
    print("Q:", q)
    print("A:", res.final)
    print("termination:", res.termination)
    print("trace:", [ (s.tool, s.args) for s in res.trace ])
    print("-"*60)


## Phase 2 — Basic Metrics (quick automatic checks)
**What we do:** Apply simple, reference-based metrics (like BLEU/ROUGE or string matching) to measure whether outputs roughly align with expectations

**How this is done**
1. Collect a tiny eval set (5–10 examples) with expected answers
2. Apply relevant metrics, e.g.: 
* IsEqual
* ROUGE/ BLEU (text similarity – length)
* Contains/ RegexMatch
* Schema metch (IsJason)



**Exercise:** 
* Define metrics you should calculate for the entire process, or for components; 
* Create a tiny eval set with expected outputs and simple metrics:

**In this example:**
- numeric tolerance for metrics,
- contains/exact for policy/action phrases,
- optional ROUGE‑1 as a crude overlap score.

> Fill/extend `EVAL_SET` with your own cases.


In [None]:

from typing import Callable

def is_numeric_close(got, exp, tol=0.05):
    try:
        g = float(str(got).replace(",","")); e = float(str(exp).replace(",",""))
        return abs(g-e) <= tol*max(1.0, abs(e))
    except Exception:
        return False

def rouge1_recall(pred: str, ref: str) -> float:
    # token-level recall
    ps = pred.lower().split(); rs = ref.lower().split()
    if not rs: return 1.0
    from collections import Counter
    inter = sum((Counter(ps) & Counter(rs)).values())
    return inter / len(rs)

EVAL_SET = [
    {"q":"DAU for AcmeApp last 7 days?", "expected":"124500", "metric":"numeric"},
    {"q":"What’s our PII handling policy?", "expected":"Raw PII exports are disallowed; use aggregated reports.", "metric":"rouge"},
    {"q":"Uninstalls spiked 30% WoW. What should I do?", "expected":"churn-risk audience", "metric":"contains"}
]

scores = []
for item in EVAL_SET:
    res = agent.run(item["q"])
    ans = json.dumps(res.final)  # stringify in case dict
    if item["metric"] == "numeric":
        ok = is_numeric_close(res.final.get("answer"), item["expected"])
    elif item["metric"] == "rouge":
        ok = rouge1_recall(ans, item["expected"]) >= 0.6
    elif item["metric"] == "contains":
        ok = item["expected"].lower() in ans.lower()
    else:
        ok = False
    scores.append((item["q"], ok))
    print(f"{'PASS' if ok else 'FAIL'} — {item['q']}")
print("Passed:", sum(int(ok) for _,ok in scores), "/", len(scores))


## Phase 3 — LLM‑as‑Judge (+ RAG grounding)
**What we do:* Score answers on quality and truthfulness with a judging model.
For RAG: Include grounding: Is the recommendation attributable to retrieved context?

**What we can check (example)**
1. Factuality/Correctness
2. Grounding (RAG only) – claims supported by retrieved snippets; no outside hallucinations.
3. Relevance – stays on task; 
4. Action/ tool calling quality – recommendation is reasonable and safe for context.
5. Format – concise answer; includes citations for policy/RAG outputs.
* Aggregate: Weighted average

**RAG specific evaluation:**
1. Context Precision: % of retrieved tokens actually cited/used by the answer.
2. Context Recall: Did we miss obviously relevant chunks for this query?
3. Attribution: Every factual sentence in the answer maps to at least one snippet.
4. No-Context Leakage: Penalize correct answers that don’t cite the given snippets when they should.

**Bias & hygiene:**

* Use a different model for judging 
* Fixed seed / temperature 0
* Blind the judge to model identity.
* Periodically human-calibrate: sample 10 cases and compare human vs. judge scores; adjust rubric/weights if drift appears

**Example prompt for the judge call:**
- You are a strict, impartial evaluator for an internal company “Account Copilot” agent.
Judge ONLY what is provided. Do not invent facts. If evidence is missing, penalize grounding.

Score on a 1–5 scale (1=very poor, 3=acceptable, 5=excellent) using these criteria:

1) factuality — are claims internally consistent and (when context is given) accurate?
   5: all check out; 3: some uncertainty; 1: clear error or contradiction.
2) grounding — are claims attributable to retrieved_context? Penalize unsupported claims.
   5: every factual sentence is supported and cites; 3: mixed; 1: mostly unsupported.
3) relevance — focused on the user’s question and task type; no filler.
4) action_quality — for action/recommendation tasks: safe, concrete, and sensible for AppsFlyer.
   If not an action task, set this to null.
5) format — concise; includes citation ids for policy/RAG answers; no inner monologue.

Weights (default): factuality 0.30, grounding 0.30, relevance 0.15, action_quality 0.20, format 0.05.
If any score is null, renormalize remaining weights.

Return JSON ONLY with this schema:
{
  "scores": {"factuality": int|null, "grounding": int|null, "relevance": int|null, "action_quality": int|null, "format": int|null},
  "overall": float,                     // weighted average, 1–5 rounded to 3 decimals
  "verdict": "pass" | "borderline" | "fail",
  "issues": [                           // short machine-parsable flags
    "unsupported_claim" | "hallucination" | "missing_citation" | "unsafe_action" |
    "off_topic" | "format_violation" | "policy_mismatch"
  ],
  "citations_used": ["doc_id#line", ...],   // ids the answer actually relied on (if any)
  "rationale": "one concise paragraph summarizing why these scores were assigned"
}
No preamble, no additional text.







## Exercise: 
1. Define all the scoring elements & weights
2. Create or mock a gold dataset
3. Score outputs on a rubric using a judge model. 

**The example below includes:**
- a **judge hook** (`call_judge_llm`) — replace with your provider call,
- a **fallback heuristic judge** so the cell runs offline,
- simple **RAG grounding** diagnostics (attribution rate).

> Replace rubric/weights and the judge call as needed.


In [None]:

RUBRIC_WEIGHTS = {"factuality":0.30,"grounding":0.30,"relevance":0.15,"action_quality":0.20,"format":0.05}

def call_judge_llm(user_query:str, assistant_answer:str, retrieved_context:list)->dict:
    """Plug your LLM here. Must return:
    {"scores":{"factuality":1-5,"grounding":1-5,"relevance":1-5,"action_quality":1-5,"format":1-5},
     "rationale":"..."}
    The default below is a simple heuristic so the notebook runs offline.
    """
    # Heuristic: boost if answer mentions key terms & cites context id
    ans = assistant_answer.lower()
    ctx_text = " ".join(c.get("text","").lower() for c in retrieved_context)
    def s(v): return max(1, min(5, v))
    factuality = s(4 if any(x in ans for x in ["124500","disallowed","audience","alert"]) else 3)
    grounding = s(5 if any(c.get("id","").split("#")[0] in assistant_answer for c in retrieved_context) or (retrieved_context and any(w in ans and w in ctx_text for w in ["audience","alert","pii"])) else (3 if retrieved_context else 4))
    relevance = s(4 if len(ans) > 0 else 2)
    action_quality = s(4 if "recommendation" in assistant_answer or "alert" in ans or "audience" in ans else 3)
    fmt = s(4)
    return {"scores":{"factuality":factuality,"grounding":grounding,"relevance":relevance,"action_quality":action_quality,"format":fmt},
            "rationale":"heuristic offline judge"}

def overall_score(scores:dict)->float:
    return round(sum(scores[k]*RUBRIC_WEIGHTS[k] for k in RUBRIC_WEIGHTS),3)

def attribution_rate(answer:str, snippets:list)->float:
    # naive: count sentences that share terms with any snippet
    sents = [s.strip() for s in re.split(r'[.!?]\s+', answer) if s.strip()]
    if not sents: return 1.0
    txts = [s.get("text","").lower() for s in snippets]
    used = 0
    for snt in sents:
        snt_l = snt.lower()
        if any(any(w in snt_l for w in t.split()[:5]) for t in txts):
            used += 1
    return used/len(sents)

# Run judge on 3 sample cases
CASES = [
    {"q":"DAU for AcmeApp last 7 days?"},
    {"q":"What’s our PII handling policy?"},
    {"q":"Uninstalls spiked 30% WoW. What should I do?"}
]
for c in CASES:
    r = agent.run(c["q"])
    ans_str = json.dumps(r.final)
    j = call_judge_llm(c["q"], ans_str, r.retrieved_context)
    ov = overall_score(j["scores"])
    ar = attribution_rate(ans_str, r.retrieved_context) if r.retrieved_context else None
    print(c["q"], "-> overall:", ov, "| scores:", j["scores"], "| attr_rate:", ar)


## Phase 4 — Component Validation (unit + contract tests)

- **What do we do:** Test each building block in isolation so failures are obvious and cheap to fix
- **What we can test:**
1. Correctness & Contracts (schema, results, alignment)
2. Robustness & Safety
3. Performance & Cost
4. Agents: Planning, Tool Use, Guardrails, Convergence 
5. Multi-agent: Specific agent results, Handoff, Coordination
6. RAG: Retrieval & Grounding




**Exercise:** 
1. Plan how to validate the different components. 
2. Validate individual components. 
* In the example below, we show three patterns:
- `calc` numeric correctness vs a reference,
- `policy_kb.search` retrieval metrics (Recall@K, MRR) + snippet hygiene,
- RAG retriever checks for `actions.recommend` (attribution and no‑context behavior).

> Replace mock functions with your own; keep the tests.


In [None]:

# ---- calc correctness ----
def reference_sum_dau_last7(app:str)->float:
    # Trusted baseline (pretend). Replace with SQL/Pandas.
    return 124500.0 if app=="AcmeApp" else 0.0

def test_calc_sum_last7d():
    got = MockTools.calc({"metric":"DAU","agg":"sum","window":"7d","app":"AcmeApp"})
    exp = reference_sum_dau_last7("AcmeApp")
    assert abs(got["value"]-exp) <= 0.01*max(1.0,abs(exp))
test_calc_sum_last7d()
print("calc: PASS")

# ---- policy_kb.search retrieval ----
GOLD = [
  {"q":"What is our PII handling policy?","relevant":["kb#12"],"must":["PII","aggregated"]}
]

def recall_at_k(ids, relevant, k=5):
    return 1.0 if any(r in ids[:k] for r in relevant) else 0.0

def mrr_at_k(ids, relevant, k=5):
    for i,d in enumerate(ids[:k], start=1):
        if d in relevant: return 1.0/i
    return 0.0

hits = MockTools.policy_kb_search({"query":GOLD[0]["q"]})["hits"]
IDs = [h["id"] for h in hits]
assert recall_at_k(IDs, GOLD[0]["relevant"], 3) >= 1.0
assert mrr_at_k(IDs, GOLD[0]["relevant"], 3) >= 1.0
text = hits[0]["text"]
assert all(m.lower() in text.lower() for m in GOLD[0]["must"])
print("policy_kb.search: PASS")

# ---- RAG retriever checks ----
def rag_attribution_ok(query:str, phrases:list)->bool:
    out = MockTools.actions_recommend({"query": query})
    txt = " ".join(s["text"].lower() for s in out["context"])
    return all(p in txt for p in [ph.lower() for ph in phrases])

assert rag_attribution_ok("Uninstalls up 30% WoW", ["churn-risk","alerting"]) == True
print("RAG retriever: PASS")


### Optional — Tool‑Calling Validation on a Gold Set


**Exercise:** Given a table of `question` and expected `tools_to_call`, measure tool plan quality.


In [None]:

GOLD_TOOLS = [
    {"question":"What is 5 * 5?","tools_to_call":"calc"},
    {"question":"What’s are the latest models from OpenAI?","tools_to_call":"policy_kb.search"},  # placeholder
    {"question":"Uninstalls spiked 30% WoW. What should I do?","tools_to_call":"actions.recommend"}
]

def parse_tools(s): return [t.strip() for t in s.split(",") if t.strip()]

def prf1(gold_set, used_set):
    tp = len(gold_set & used_set); fp = len(used_set - gold_set); fn = len(gold_set - used_set)
    P = tp/(tp+fp) if tp+fp else 1.0
    R = tp/(tp+fn) if tp+fn else 1.0
    F1 = 2*P*R/(P+R) if P+R else 0.0
    return P,R,F1

def validate_tool_calls(q, expected_csv, ordered=True):
    exp = parse_tools(expected_csv)
    res = agent.run(q)
    used = [s.tool for s in res.trace]
    P,R,F1 = prf1(set(exp), set(used))
    exact = int(used==exp) if ordered else None
    return {"q":q,"used":used,"exp":exp,"precision":round(P,3),"recall":round(R,3),"f1":round(F1,3),"seq_exact":exact}

for row in GOLD_TOOLS:
    print(validate_tool_calls(row["question"], row["tools_to_call"]))


## Phase 5 — Production Monitoring (offline harness)
- **What we do:** Test each building block in isolation so failures are obvious and cheap to fix
- **Main categories of what we monitor:**
1. SLOs & Health
2. Tool-Calling Health
3. RAG Health
4. Safety & Guardrails
5. Online Quality: periodic LLM-as-Judge 
6. Telemetry & Tracing
7. Rollouts & Regressions



**Exercise:** 
1. Plan key metrics and monitors to implement
2. Instrument traces and compute a few online‑style KPIs locally. In production you’d emit these as telemetry events (e.g., Prometheus). Here we simulate counters/histograms.


In [None]:

from collections import defaultdict
import time

metrics = defaultdict(int)
latency = []

def handle_request(q):
    t0 = time.time()
    res = agent.run(q)
    latency.append((time.time()-t0)*1000.0)
    metrics[f"term_{res.termination}"] += 1
    # tool-level stats
    for step in res.trace:
        metrics[f"tool_{step.tool}"] += 1
    # rag stats
    if any(s.tool=="actions.recommend" for s in res.trace):
        has_ctx = any("context" in s.output for s in res.trace)
        if not has_ctx: metrics["rag_no_context"] += 1
    return res

traffic = [
    "DAU for AcmeApp last 7 days?",
    "What’s our PII handling policy?",
    "Uninstalls spiked 30% WoW. What should I do?",
    "Export raw PII for AcmeApp."
]

for q in traffic:
    handle_request(q)

def pctl(a, p):
    if not a: return 0.0
    s = sorted(a); k = int((len(s)-1)*p/100)
    return s[k]

print({
    "success_rate": metrics["term_success"]/max(1,sum(metrics[k] for k in metrics if k.startswith("term_"))),
    "p95_latency_ms": round(pctl(latency,95),2),
    "tool_calls": {k.replace("tool_",""):v for k,v in metrics.items() if k.startswith("tool_")}
})



## Next Steps
- Swap `MockAgent` with your real agent; keep the same return contract (`AgentResult`).
- Configure evals at all levels
- Replace the heuristic judge with your LLM call in `call_judge_llm`.
- Wire real references/baselines for `calc`, a real KB for `policy_kb.search`, and your retriever for RAG tests.
- Expand gold sets and negative cases; add CI to run Phases 2–4 on every change.
- In production, emit per-step telemetry; add canary/shadow gating before full rollout.
