# Retrieval Evaluation (Evidence-Based)

This notebook evaluates the retrieval component of our RAG system using
**evidence satisfaction** rather than flat relevance.

The same evaluation logic is reused across different categories of questions:
- Conceptual / Declarative
- Structural / Symbolic
- Compound / Multi-evidence

Each category is evaluated independently by passing a different gold file.


## Evaluation Metrics

We evaluate retrieval performance using **Precision@K** and **Recall@K**, defined under an **evidence-based relevance** setting rather than flat document matching.

In this context, a question is considered *successfully answered* if the retrieved chunks satisfy the logical evidence required to answer the question.

---

### Precision@K

**Plain-English definition:**

> Precision measures **how many of the retrieved documents are actually useful** for answering the question.

In other words:

> *Of what the retriever returned, how much contributed toward answering the question?*

**Formal definition:**

Let:

* **R** be the set of documents that are relevant for answering the question
* **K** be the set of top-K retrieved documents

```
Precision@K = |R ∩ K| / |K|
```

Because most questions require only a small amount of evidence (often one sufficient chunk or set of chunks), precision values are naturally lower for larger values of K.

---

### Recall@K

**Plain-English definition:**

> Recall measures **how many of the documents that are needed to answer the question were successfully retrieved**.

In other words:

> *Of what was needed to answer the question, how much did the retriever manage to find?*

**Formal definition:**

Let:

* **R** be the set of all documents required to answer the question
* **K** be the set of top-K retrieved documents

```
Recall@K = |R ∩ K| / |R|
```

In this evaluation, recall is treated as **binary per question**: a question has recall 1 if the retrieved documents satisfy the evidence requirements, and 0 otherwise. This reflects whether the retriever succeeded in providing *sufficient evidence* to answer the question at all.


In [31]:
import sys
from pathlib import Path

# Move up until we find the project root (the folder containing 'backend')
current = Path.cwd().resolve()

for parent in [current] + list(current.parents):
    if (parent / "backend").exists():
        PROJECT_ROOT = parent
        break
else:
    raise RuntimeError("Could not find project root containing 'backend'")

if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

print("Project root added to PYTHONPATH:", PROJECT_ROOT)


Project root added to PYTHONPATH: D:\GitHub Doc Bot (Capstone Project 1)


In [32]:
import json
from pathlib import Path
from typing import Dict, Tuple, List

from backend.tasks.query.rag.embed_query import embed_query
from backend.tasks.query.rag.retrieve import retrieve


In [33]:
def normalize_chunk(chunk: Dict) -> Tuple[str, int]:
    return chunk["file_path"], chunk["chunk_index"]


def evaluate_evidence(node: Dict, retrieved_set: set) -> bool:
    if "chunk" in node:
        return normalize_chunk(node["chunk"]) in retrieved_set

    op = node["op"]
    children = node["children"]

    if op == "AND":
        return all(evaluate_evidence(c, retrieved_set) for c in children)

    if op == "OR":
        return any(evaluate_evidence(c, retrieved_set) for c in children)

    if op == "K_OF_N":
        k = node["k"]
        return sum(evaluate_evidence(c, retrieved_set) for c in children) >= k

    raise ValueError(f"Unknown operator: {op}")


In [34]:
BASE_DIR = Path(".").resolve()
DATA_DIR = BASE_DIR / "data"


def evaluate_file(
    filename: str,
    top_k: int = 5,
    verbose: bool = True
):
    gold_file = DATA_DIR / filename

    with open(gold_file, "r", encoding="utf-8") as f:
        gold = json.load(f)

    satisfied_count = 0
    precisions = []
    recalls = []

    for item in gold:
        question = item["question"]
        evidence = item["evidence"]

        query_embedding = embed_query(question)
        retrieved = retrieve(query_embedding, top_k=top_k)

        retrieved_set = {
            normalize_chunk(c) for c in retrieved
        }

        satisfied = evaluate_evidence(evidence, retrieved_set)

        precision = (1 / top_k) if satisfied else 0.0
        recall = 1.0 if satisfied else 0.0

        precisions.append(precision)
        recalls.append(recall)

        if verbose:
            print(f"Q: {question}")
            print(f"  Evidence satisfied: {satisfied}")
            print(f"  Precision@{top_k}: {precision:.3f}")
            print(f"  Recall@{top_k}:    {recall:.3f}")
            print('-' * 60)

    avg_precision = sum(precisions) / len(precisions) if precisions else 0.0
    avg_recall = sum(recalls) / len(recalls) if recalls else 0.0

    print("==== Summary ====")
    print(f"Questions evaluated: {len(gold)}")
    print(f"Average Precision@{top_k}: {avg_precision:.3f}")
    print(f"Average Recall@{top_k}:    {avg_recall:.3f}")

    return {
        "questions": len(gold),
        "avg_precision": avg_precision,
        "avg_recall": avg_recall
    }


## Conceptual / Declarative Questions

These questions focus on descriptive, natural-language facts such as
frameworks, databases, and configuration.


In [35]:
conceptual_results = evaluate_file(
    filename="conceptual_gold.json",
    top_k=5
)


Q: Which web framework is used to build the API?
  Evidence satisfied: True
  Precision@5: 0.200
  Recall@5:    1.000
------------------------------------------------------------
Q: Where is the FastAPI application instance created?
  Evidence satisfied: True
  Precision@5: 0.200
  Recall@5:    1.000
------------------------------------------------------------
Q: Which database is used for persistent storage?
  Evidence satisfied: True
  Precision@5: 0.200
  Recall@5:    1.000
------------------------------------------------------------
Q: Which environment variable stores the MongoDB connection URI?
  Evidence satisfied: True
  Precision@5: 0.200
  Recall@5:    1.000
------------------------------------------------------------
Q: What programming language is used to implement the backend of this project?
  Evidence satisfied: True
  Precision@5: 0.200
  Recall@5:    1.000
------------------------------------------------------------
Q: What database technology does the application depe

## Structural / Symbolic Questions

These questions depend on code structure rather than descriptive text,
including route decorators, schemas, and endpoint definitions.


In [36]:
structural_results = evaluate_file(
    filename="structural_gold.json",
    top_k=5
)


Q: Which file contains the main API route definitions?
  Evidence satisfied: False
  Precision@5: 0.000
  Recall@5:    0.000
------------------------------------------------------------
Q: Where is the database client initialized?
  Evidence satisfied: True
  Precision@5: 0.200
  Recall@5:    1.000
------------------------------------------------------------
Q: Which library is used to define request and response schemas?
  Evidence satisfied: False
  Precision@5: 0.000
  Recall@5:    0.000
------------------------------------------------------------
Q: Where is the data model for notes defined?
  Evidence satisfied: False
  Precision@5: 0.000
  Recall@5:    0.000
------------------------------------------------------------
Q: Which endpoint is responsible for creating a new note?
  Evidence satisfied: True
  Precision@5: 0.200
  Recall@5:    1.000
------------------------------------------------------------
Q: Which endpoint deletes an existing note?
  Evidence satisfied: True
  Preci

## Compound / Multi-evidence Questions

These questions require combining information across multiple chunks,
often with logical relationships (AND / OR / K-of-N).


In [37]:
compound_results = evaluate_file(
    filename="compound_gold.json",
    top_k=5
)


Q: Which file defines the FastAPI app and also registers API routes?
  Evidence satisfied: False
  Precision@5: 0.000
  Recall@5:    0.000
------------------------------------------------------------
Q: Which file or documentation source specifies the database used by the application?
  Evidence satisfied: True
  Precision@5: 0.200
  Recall@5:    1.000
------------------------------------------------------------
Q: Where is the MongoDB connection configured and where is it consumed?
  Evidence satisfied: False
  Precision@5: 0.000
  Recall@5:    0.000
------------------------------------------------------------
Q: Which files together define the data schema and validation logic for notes?
  Evidence satisfied: False
  Precision@5: 0.000
  Recall@5:    0.000
------------------------------------------------------------
Q: Which endpoint definitions expose CRUD functionality for notes?
  Evidence satisfied: False
  Precision@5: 0.000
  Recall@5:    0.000
----------------------------------

In [None]:
import pandas as pd

summary = pd.DataFrame([
    {"Category": "Conceptual", **conceptual_results},
    {"Category": "Structural", **structural_results},
    {"Category": "Compound", **compound_results},
])

summary


Unnamed: 0,Category,questions,avg_precision,avg_recall
0,Conceptual,10,0.16,0.8
1,Structural,10,0.08,0.4
2,Compound,10,0.02,0.1


## Key Observations

- Embedding-based retrieval performs well for conceptual questions with
  strong natural-language signal.
- Performance degrades for structural questions due to the lack of
  code-aware or syntax-aware retrieval.
- Compound questions further expose the absence of multi-hop or iterative
  retrieval mechanisms.

These results reflect inherent limitations of embedding-only retrieval,
not implementation errors.
