Skip to content

Evaluation

Nikolai Sachok edited this page Jun 18, 2026 · 2 revisions

Evaluation

This page teaches why a RAG system is evaluated the way it is — why retrieval and generation are measured separately, what each metric actually tells you and its blind spot, why a golden set plus an LLM judge, and how you'd harden all of this for production. The companion page Architecture explains what the pipeline does; this one explains how you know it works — and how you'd know it broke.

Strata-RAG is eval-first: retrieval quality and generation quality are measured, not assumed, and they are measured separately, because they fail for different reasons and a single end-to-end score hides which half is broken.


1. Why retrieval and generation are measured separately

A RAG answer can be wrong in two fundamentally different ways:

  1. Retrieval failed — the right passage was never pulled into the context. No amount of careful generation can recover; the model is answering from the wrong material.
  2. Generation failed — the right passage was retrieved, but the model ignored it, embellished it, or contradicted it (a faithfulness failure / hallucination).

These have different fixes. A retrieval failure means you tune the embedder, the hybrid mix, the reranker, or the chunking. A generation failure means you tune the prompt, the grounding instructions, or the model. A single end-to-end "is the answer good?" score conflates them — it goes down, and you don't know which knob to turn. That conflation is the classic RAG-eval mistake. So Strata-RAG keeps two harnesses: a retrieval metric suite over a golden set, and an independent faithfulness judge over generation. You always know which stage to fix.

There's also a causality reason to separate them: retrieval is the ceiling on generation. If recall is low, the generator never had a chance — you fix retrieval first, then measure generation on top of a retrieval baseline you trust. Measuring generation while retrieval is broken just measures the model's willingness to bluff.


2. Retrieval metrics — what each tells you, and its blind spot

The golden set is a set of questions, each mapped to the source ids that are genuinely relevant (hand-labelled). The metric functions are pure and unit-tested — an eval harness you can't trust is worthless, so the metrics themselves are pinned by tests independent of any index. All four are computed at a cutoff K (the number of chunks handed to the generator):

Metric The question it answers The blind spot
Recall@K Did we find it? Of the relevant items, how many landed in the top K. Says nothing about ordering or junk — 100% recall with the answer at rank K, buried under 4 irrelevant hits, still scores 1.0. The headline number, but not the whole story.
Precision@K How much junk? Of the top K returned, how many are actually relevant. Trades against recall and is insensitive to rank — it doesn't care if the junk is at the top or bottom. With a small fixed K and 1 relevant item, precision is capped low by construction.
MRR Is the best one near the top? Rewards putting the first relevant item high. Only sees the first relevant hit — blind to everything after it. Great for "one right answer" questions, weak for "find all N."
nDCG@K Rank-aware quality. Credits relevant items more when higher, normalised against the ideal ordering. The most complete single signal, but harder to interpret in isolation and assumes your relevance labels are right — a mislabelled golden set quietly corrupts it.

Why all four, not one. Each is partial. Recall tells you if the material is even reachable; precision tells you how much noise rides along; MRR tells you if the top hit is good (which matters because the generator weights early context most); nDCG rolls rank-awareness into one number. Read together they triangulate where retrieval is weak. Read alone, any one of them flatters or maligns a system unfairly — high recall can hide terrible ordering; high precision can hide low recall at a small K.

The shared blind spot of all four: they are only as honest as the golden labels. If a genuinely-relevant passage isn't labelled relevant, a correct retrieval is scored as a miss. The golden set is the eval's foundation — and its weakest point. Hardening it (more questions, multiple labellers, periodic review) is where production eval investment goes.

Slicing the eval by question kind (and why)

Because the engine answers two classes of question (Architecture §1), the eval must be sliced or it measures the wrong thing:

  • --kind retrieval evaluates only the semantic questions — the real test of the embedder and reranker. Aggregation/metadata questions (counts, publisher lookups) are answered by the sidecar, not the embedding index — scoring them as a retrieval test would be meaningless (a publisher lookup is a SQL join, not a similarity search). Including them would pollute the retrieval signal with questions retrieval was never meant to answer.
  • --dense-only turns off BM25 + rerank to isolate the raw embedder contribution — the cleanest A/B signal when comparing two embedding models, because hybrid + rerank can mask a weaker embedder.

Each result table is self-labelling — tagged with its model / collection / mode — so two runs sit side by side and the winner is unambiguous. The two A/B axes are model (e.g. MiniLM vs mpnet) × mode (dense-only vs hybrid+rerank). Isolating the variable under test is the whole discipline: change one thing, measure, compare like with like.


3. Generation — LLM-as-judge faithfulness

Retrieval being correct doesn't make the answer correct. The model can hallucinate a detail not in the context, or answer a different question than the one asked. You can't catch that by checking the pipeline ran — you have to judge the content of the answer.

Why LLM-as-judge. The alternatives each fall short for faithfulness:

  • n-gram overlap (ROUGE/BLEU) needs reference answers and measures surface similarity, not whether each claim is supported by the retrieved context — the thing that actually matters.
  • Human eval is the gold standard but doesn't scale and can't run in CI.
  • No generation eval (the common gap) means you've measured retrieval and just hoped the model used it faithfully.

So a second, independent LLM grades the first against a strict rubric. It sees three things — the question, the exact context the generator was given, and the answer — and returns a structured JSON verdict. Two dimensions (the classic RAG-eval pair):

  • faithfulness / groundedness — is every claim in the answer supported by the context? (catches hallucination). Correctly saying "the context doesn't contain this" when it indeed doesn't is fully faithful.
  • answer_relevance — does the answer address the question actually asked? (catches "technically true but off-topic").

Because the verdict is structured (scores + severities), you can both gate a single response (block answers that fail) and aggregate over a question set into a pass-rate — a real regression test. The gate logic (compute_gate) is separated from the LLM call and unit-tested, so the decision is deterministic even though the grading is a model.

The blind spots of an LLM judge (worth stating plainly, because a judge you trust blindly is just a more expensive hallucination):

  • it's itself a model — it has cost-per-call, and bias/variance; mitigated by a strict structured rubric and a separable, tested gate;
  • it can only judge faithfulness against the context it's shown — if the context is wrong (a retrieval failure), the judge can't know the world answer is different. This is exactly why retrieval is measured separately and first;
  • it inherits the judging model's own blind spots — a stronger model can be passed as judge than as generator (the design allows it), and in production you'd calibrate the judge against human labels and track agreement.

The decisive lesson: grounding is orthogonal to injection defense

Faithfulness eval measures one thing (invented facts). The prompt-injection guardrails measure another (an attack obeyed). They are not the same problem: "answer only from the context" stops hallucination — but does nothing about an instruction that is itself sitting in the retrieved context. That injected instruction is "grounded" too. Keeping these two measurements separate is deliberate, and it's why the next section exists.


4. Adversarial / guardrail eval — measuring the defenses

The injection defenses are themselves measured, not asserted. The injection eval pushes an adversarial fixture set through the input scanner and reports an attack-success-rate (attacks missed / total) plus the false-positive rate on clean text — the two errors you trade off (missing an attack vs. flagging a benign chunk). It's deterministic (no LLM), so it runs anywhere, including CI.

The payoff of measuring a guardrail: toggle a single defense layer off, re-run, and watch the attack-success-rate move — that's the layer's exact contribution, proven rather than claimed. A guardrail whose value you can't quantify is a guardrail you can't trust or tune.


5. The principle, and how you'd harden it for production

Everything that matters is turned into a number against a fixed reference set:

  • retrieval → golden-set Recall@K / Precision@K / MRR / nDCG,
  • generation → LLM-judge faithfulness + answer-relevance,
  • defenses → attack-success-rate over adversarial fixtures.

Measured, separated, and reproducible out of the box on the synthetic sample corpus.

Hardening this for production — what the demo deliberately leaves as a seam:

  • A bigger, governed golden set. The metrics are only as honest as the labels. Production means more questions, multiple labellers, inter-annotator agreement, and periodic re-labelling as the corpus drifts.
  • A CI eval gate. Run the retrieval + faithfulness eval on every change and block merges that regress a threshold — eval as a build gate, not a manual command. (Strata-RAG ships the harness; wiring it as a hard CI gate is roadmap.)
  • A calibrated judge. Pin a held-out judge model, measure its agreement with human labels, and track that agreement over time so judge-drift doesn't silently move your pass-rate.
  • Observability of eval over time. Trend Recall/nDCG/faithfulness across releases (Prometheus + Grafana, Langfuse/Phoenix traces) rather than reading a one-off table — roadmap.
  • Slice-level eval. Per-question-kind, per-source-set, per-language slices, so a regression in one slice isn't hidden by an average. (The kind-slicing in §2 is the seam this builds on; a multilingual slice is the Phase-3 A/B.)

The thread through all of it is the same as the engine's: make the boundary explicit and turn it into a number. You don't assert retrieval works, the answer is faithful, or the guardrail holds — you measure each, separately, against a fixed reference, so a regression is a falling number, not a surprised user.