# Feature Track 1: Evaluation & Validation

---

Shipping a RAG system without systematic evaluation is like navigating without instruments. The pipeline may *seem* to work on the queries we tested by hand, but we have no way to know where it breaks, how often, or whether a change we made helped or hurt.

**Evaluation closes the feedback loop:**

```
Change a parameter  ──►  Measure quantitatively  ──►  Decide based on data
```

**Prerequisite:** Run `feature0_baseline_rag.ipynb` Steps 1–2 first to build the vector store.

| Notebook | Focus |
|---|---|
| Feature 0 | Working baseline prototype |
| **Feature Track 1 (this notebook)** | Quantitative evaluation |
| Feature Track 2 | Reliable, structured outputs |
| Feature Track 3 | Better retrieval strategies |
| Feature Track 4 | Multi-step agent workflows |

---

## Foundation

### Why Systematic Evaluation?
Without metrics, we are forced to manually re-read answers for a handful of test queries and guess whether a change helped or hurt. With metrics we run the evaluation suite and get a number -> one we can track across pipeline changes and use to justify decisions.

**Concrete example from Feature 0:** The baseline RAG sometimes described a non-existent product as if it exists, cited a superseded GWP figure, or reported an unverified CO₂ reduction without flagging it. These are exactly the queries that matter for compliance. How often does this happen? After every change to the pipeline, we need an answer.

---

### The RAG Pipeline

Each arrow is a potential failure point. Evaluation targets a specific stage so we can isolate *where* a problem is.

```
Ingestion (run once)
  Documents ──► [1] Chunker ──► [2] Embedder ──► [3] Vector DB

Querying (every user question)
  User query ──► [2] Embedder ──► [3] Retriever ──► Top-k Chunks
                                                          │
                                                   [4] LLM + Prompt
                                                          │
                                                   Answer + Sources
```

| Step | If it fails |
|---|---|
| [1] Chunking | Context split mid-fact; tables broken |
| [2] Embedding | Wrong chunks returned |
| [3] Vector search | Relevant chunks not retrieved |
| [4] Generation | Hallucination; ignores context; incomplete answer |

---

## RAGAS

[RAGAS](https://docs.ragas.io) (*Retrieval Augmented Generation Assessment*) is an open-source Python library for evaluating RAG pipelines, widely adopted in the LLM/RAG ecosystem.

#### How it works internally
Rather than asking a judge LLM "rate this answer 0–5", RAGAS decomposes the answer into individual atomic claims:

```
Answer: "The Logypal 1 GWP is 3.2 kg CO₂e, verified by Bureau Veritas."

  Claim 1: "GWP is 3.2 kg CO₂e"          -> supported by context?  ✓
  Claim 2: "verified by Bureau Veritas"  -> supported by context?  ✓

  Faithfulness = 2 supported / 2 total = 1.0
```

This catches partial hallucination, e.g. a correct figure with a fabricated verifier name.

#### Metrics overview

| Metric | Ground truth needed? | What it catches |
|---|---|---|
| `Faithfulness` | No | Claims not supported by the retrieved context |
| `AnswerRelevancy` | No | Off-topic or evasive answers |
| `ContextPrecision` | **Yes** (reference answer) | Irrelevant chunks ranked above relevant ones |
| `ContextRecall` | **Yes** (reference answer) | Facts needed to answer that are missing from the retrieved context |
| `AnswerCorrectness` | **Yes** (reference answer) | Wrong or missing facts vs. the reference answer |

The first two metrics can be run **on any query**, no labelling effort required. The last three require a **test set**: a curated list of (query, reference answer) pairs.

#### Strengths and weaknesses
- Standardised, reproducible metrics widely used in industry
- `Faithfulness` and `AnswerRelevancy` require zero labelling effort
- Needs a capable judge LLM (defaults to OpenAI), often costs multiple LLM calls per sample per metric
- LLM judge has its own biases; metrics are proxies, not absolute ground truth

---

### First Look at RAGAS

#### Setup

**Prerequisites:** `conversational-toolkit` and `backend` installed in editable mode. Vector store must already exist -> run `feature0_baseline_rag.ipynb` Steps 1–2 first.

RAGAS uses OpenAI as its judge LLM by default ->`OPENAI_API_KEY` must be set.


In [None]:
import os
import pathlib
import warnings

from langchain_openai import ChatOpenAI, OpenAIEmbeddings as LangChainOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper  # type: ignore[import-untyped]
from ragas.metrics import (  # type: ignore[attr-defined]
    Faithfulness as RagasFaithfulness,
    AnswerRelevancy as RagasAnswerRelevancy,
)

from conversational_toolkit.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from conversational_toolkit.evaluation import Evaluator
from conversational_toolkit.evaluation.adapters import evaluate_with_ragas
from conversational_toolkit.vectorstores.chromadb import ChromaDBVectorStore

from sme_kt_zh_collaboration_rag.feature0_baseline_rag import (
    EMBEDDING_MODEL,
    VS_PATH,
    SYSTEM_PROMPT,
    build_llm,
    build_agent,
)

warnings.filterwarnings("ignore", category=DeprecationWarning)

_secret_path = pathlib.Path("/secrets/OPENAI_API_KEY")
if "OPENAI_API_KEY" not in os.environ and _secret_path.exists():
    os.environ["OPENAI_API_KEY"] = _secret_path.read_text().strip()

RETRIEVER_TOP_K = 5
BACKEND = "openai"  # "ollama" or "openai"
# Note: RAGAS uses OpenAI for its judge LLM regardless of BACKEND above.

if not BACKEND:
    raise ValueError('Set BACKEND to "ollama" or "openai" before running.')

# RAG pipeline
embedding_model = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL)
vs = ChromaDBVectorStore(db_path=str(VS_PATH))
llm = build_llm(backend=BACKEND)
agent = build_agent(
    vector_store=vs,
    embedding_model=embedding_model,
    llm=llm,
    top_k=RETRIEVER_TOP_K,
    system_prompt=SYSTEM_PROMPT,
    number_query_expansion=0,
)

# AnswerRelevancy internally calls embed_query() / embed_documents() to compare generated questions against the original query. langchain_openai.OpenAIEmbeddings implements this interface and is accepted directly by ragas.evaluate().
ragas_embeddings = LangChainOpenAIEmbeddings(model="text-embedding-3-small")

# RAGAS defaults to max_tokens=3072 for its judge LLM. Long answers with many atomic claims overflow this limit mid-JSON, causing "output is incomplete" errors. Wrap ChatOpenAI with a higher limit and pass it explicitly to evaluate_with_ragas().
ragas_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4o-mini", max_completion_tokens=8192)
)

print(f"Embedding model : {EMBEDDING_MODEL}")
print(f"Vector store    : {VS_PATH}")
print(f"RAG agent LLM   : {BACKEND}")
print("RAGAS judge LLM : gpt-4o-mini (OpenAI)")
print("Setup complete.")

---

### Part 1: Metrics Without a Test Set

These two metrics need only the query and the system's response, **no ground truth required**:

- **Faithfulness**: are all claims in the answer supported by the retrieved context?
- **AnswerRelevancy**: does the answer directly address the question?

The `evaluate_with_ragas()` adapter converts our `EvaluationSample` objects to RAGAS format, runs the judge LLM, and returns an `EvaluationReport`.

*Can take a few minutes: RAGAS makes multiple judge LLM calls per sample.*

In [None]:
queries = [
    "Does PrimePack AG offer a product called the Lara Pallet?",
    "Which products in the portfolio have a third-party verified EPD?",
    "Can the 68% CO2 reduction claim for tesapack ECO (product 50-102) be included in a customer sustainability response?",
    "Are any tape products confirmed to be PFAS-free?",
    "Which suppliers are not yet compliant with the EPD requirement by end of 2025?",
]

print(
    f"Building {len(queries)} evaluation samples (runs the RAG agent once per query)..."
)
samples = await Evaluator.build_samples_from_agent(agent=agent, queries=queries)
print(f"Done. {len(samples)} samples built.\n")


print("Running RAGAS: Faithfulness + AnswerRelevancy\n")

report_basic = evaluate_with_ragas(
    samples=samples,
    metrics=[
        RagasFaithfulness(),  # type: ignore[call-arg]
        RagasAnswerRelevancy(strictness=1),  # type: ignore[call-arg]
    ],
    llm=ragas_llm,
    embeddings=ragas_embeddings,
)

print("─" * 40)
print(f"Samples evaluated: {report_basic.num_samples}")
print("─" * 40)
for metric_name, score in report_basic.summary().items():
    print(f"{metric_name:<22}  {score:.3f}")
print("─" * 40)

In [None]:
# Per-sample breakdown: find which queries score best / worst so we know where to focus improvement efforts.

import math

f_result = next(
    (r for r in report_basic.results if "faithfulness" in r.metric_name.lower()), None
)
a_result = next(
    (
        r
        for r in report_basic.results
        if "relevancy" in r.metric_name.lower() or "relevance" in r.metric_name.lower()
    ),
    None,
)

f_scores: list[float] = (f_result.per_sample_scores if f_result else None) or []
a_scores: list[float] = (a_result.per_sample_scores if a_result else None) or []


def fmt(v: float) -> str:
    return "  N/A" if math.isnan(v) else f"{v:>5.2f}"


print("Per-sample scores (F = Faithfulness, A = AnswerRelevancy)\n")
print(f"{'#':<3} {'F':>4} {'A':>5} {'query':>7} {'response':>45}")
print("─" * 100)
for i, (sample, f, a) in enumerate(zip(samples, f_scores, a_scores), 1):
    q = sample.query[:40] + ".." if len(sample.query) > 40 else sample.query
    r = (
        (sample.answer[:40] or "") + ".."
        if len(sample.answer or "") > 60
        else (sample.answer or "")
    )
    print(f"{i:<3} {fmt(f)} {fmt(a)} {q:<40} {r}")

---

### Part 2: Context Retrieval Evaluation (Requires a Test Set)

Faithfulness and AnswerRelevancy tell us about generation quality, but they say nothing about whether the retriever is finding the *right* chunks. For that we need a **test set**.

#### What is a test set?

A test set is a curated list of `(query, reference_answer)` pairs. The reference answer serves as a proxy for "what facts should be in the retrieved context" -> RAGAS uses it to judge retrieval without requiring us to manually label which chunks are relevant.

#### Context retrieval metrics

| Metric | What it measures | Score = 1.0 means | Low score means |
|---|---|---|---|
| **ContextPrecision** | Are retrieved chunks relevant to the reference answer? Weighted by rank: relevant chunks ranked first score better. | Every retrieved chunk is relevant, and the most relevant are ranked first | The retriever returns noisy, off-topic chunks or ranks them poorly |
| **ContextRecall** | What fraction of the reference answer's claims can be attributed to the retrieved context? | Every fact needed to answer the question is present in the retrieved context | The retriever is missing chunks that contain key facts |

**Interpretation:**
- Low `ContextPrecision` -> improve chunk filtering or reduce `top_k`
- Low `ContextRecall` -> relevant documents are not indexed, chunks are too small, or embedding mismatch
- Both low -> consider better chunking strategy or a different embedding model (see Feature Track 3)

In [None]:
EVALUATION_QUERIES: list[dict] = [
    # portfolio scope
    {
        "query": "Does PrimePack AG offer a product called the Lara Pallet?",
        "ground_truth_answer": (
            "No. The Lara Pallet does not exist in the PrimePack AG portfolio. The current pallet portfolio is: Noé Pallet (32-100, CPR System), Wooden Pallet 1208 (32-101, CPR System), Recycled Plastic Pallet (32-102, CPR System), Logypal 1 (32-103, Relicyc), LogyLight (32-104, Relicyc), and EP 08 (32-105, StabilPlastik)."
        ),
    },
    # multi-product retrieval
    {
        "query": "Which products in the portfolio have a third-party verified EPD?",
        "ground_truth_answer": (
            "Products with third-party verified EPDs: 50-100 (IPG Hot Melt Tape), 50-101 (IPG Water-Activated Tape), 32-100 (Noé Pallet, CPR System), 32-103 (Logypal 1, Relicyc), 32-105 (EP 08, StabilPlastik), 11-100 (Cartonpallet CMP, redbox), 11-101 (Corrugated cardboard, Grupak)."
        ),
    },
    # claim verification
    {
        "query": "Can the 68% CO2 reduction claim for tesapack ECO (product 50-102) be included in a customer sustainability response?",
        "ground_truth_answer": (
            "No. The 68% CO2 reduction figure is a self-declared internal assessment by Tesa SE, not independently verified through an EPD. PrimePack AGs procurement policy classifies this as Level B/C evidence. It may only be cited with an explicit caveat that it is unverified."
        ),
    },
    # missing data
    {
        "query": "What verified environmental data is available for the LogyLight pallet (product 32-104)?",
        "ground_truth_answer": (
            "No verified environmental data is available for LogyLight (32-104). The datasheet explicitly states that GWP and all other LCA figures are not yet available. An LCA study has been commissioned (REL-LCA-2024-07) and a third-party verified EPD was expected by Q2 2025, but no verified figures exist."
        ),
    },
    # source conflict
    {
        "query": "Which GWP source should be used for Relicyc Logypal 1: the 2021 datasheet or the 2023 EPD?",
        "ground_truth_answer": (
            "The 2023 third-party verified EPD (Relicyc EPD No. S-P-10482) is the authoritative source. The 2021 internal datasheet reporting 4.1 kg CO2e per pallet is marked SUPERSEDED and must not be cited. When two sources conflict, PrimePack AGs policy requires preferring the more recent third-party verified source."
        ),
    },
    # missing data
    {
        "query": "Are any tape products confirmed to be PFAS-free?",
        "ground_truth_answer": (
            "No tape product is confirmed PFAS-free. As of January 2025, no PFAS declarations have been received from IPG or Tesa SE. The Tesa hot-melt, free of intentionally added solvents claim does not constitute a PFAS declaration. No tape product may be described as PFAS-free until explicit supplier declarations are received and reviewed."
        ),
    },
    # policy (tests procurement policy retrieval)
    {
        "query": "Which suppliers are not yet compliant with the EPD requirement by end of 2025?",
        "ground_truth_answer": (
            "Tesa SE (supplier of tesapack ECO, product 50-102) and CPR System (supplier of Wooden Pallet 32-101 and Recycled Plastic Pallet 32-102) are not yet compliant with the EPD requirement by end of 2025."
        ),
    },
]

In [None]:
gt_queries = [q["query"] for q in EVALUATION_QUERIES]
gt_answers = [q["ground_truth_answer"] for q in EVALUATION_QUERIES]

print(f"Test set: {len(gt_queries)} query-answer pairs\n")
for i, (q, a) in enumerate(zip(gt_queries, gt_answers), 1):
    print(f"  {i}. {q}")
    print(f"     -> {a[:80]}{'...' if len(a) > 80 else ''}\n")

print(f"Building {len(gt_queries)} samples (runs the RAG agent once per query)...")
samples_gt = await Evaluator.build_samples_from_agent(
    agent=agent,
    queries=gt_queries,
    ground_truth_answers=gt_answers,
)
print(f"Done. {len(samples_gt)} samples built.")

In [None]:
from ragas.metrics import (  # type: ignore[attr-defined]
    ContextPrecision as RagasContextPrecision,
    ContextRecall as RagasContextRecall,
)

print("Running RAGAS: ContextPrecision + ContextRecall\n")

report_context = evaluate_with_ragas(
    samples=samples_gt,
    metrics=[
        RagasContextPrecision(),  # type: ignore[call-arg]
        RagasContextRecall(),  # type: ignore[call-arg]
    ],
    llm=ragas_llm,
    embeddings=ragas_embeddings,
)

print("─" * 40)
print(f"Samples evaluated: {report_context.num_samples}")
print("─" * 40)
for metric_name, score in report_context.summary().items():
    print(f"{metric_name:<28}  {score:.3f}")
print("─" * 40)

- A low `context_precision` score means the retriever returned off-topic chunks (or ranked them poorly).
- A low `context_recall` score means key facts are missing from the retrieved context.

In [None]:
# Per-sample breakdown: find which queries have retrieval gaps
cp_result = next(
    (r for r in report_context.results if "precision" in r.metric_name.lower()), None
)
cr_result = next(
    (r for r in report_context.results if "recall" in r.metric_name.lower()), None
)

cp_scores: list[float] = (cp_result.per_sample_scores if cp_result else None) or []
cr_scores: list[float] = (cr_result.per_sample_scores if cr_result else None) or []

print("Per-sample context scores (CP = ContextPrecision, CR = ContextRecall)\n")
print(f"{'#':<3} {'CP':>5} {'CR':>5}  {'query'}")
print("─" * 85)
for i, (sample, cp, cr) in enumerate(zip(samples_gt, cp_scores, cr_scores), 1):
    q = sample.query[:68] + "..." if len(sample.query) > 68 else sample.query
    print(f"{i:<3} {fmt(cp):>5} {fmt(cr):>5}  {q}")

---

## Summary

Qw now have four RAGAS metrics for the baseline pipeline:

| Metric | Ground truth? | Tells you |
|---|---|---|
| `Faithfulness` | No | Are answers grounded in the retrieved context? |
| `AnswerRelevancy` | No | Are answers on-topic? |
| `ContextPrecision` | Yes | Are retrieved chunks relevant and well-ranked? |
| `ContextRecall` | Yes | Does the retrieved context contain all necessary facts? |