
# Practice Notebook: RAG evaluatinn metrics 
**Audience:** 4th-year CS students  
**Focus Metrics:** Precision@K, Recall@K, Mean Average Precision (MAP), Mean Reciprocal Rank (MRR)

## What you'll do
1. Review what each metric measures and **what it’s used for** in RAG evaluation.  
2. Work through **guided practice problems** that use synthetic ranked results and ground-truth relevance labels.  
3. Use helper functions to **compute** P@K, R@K, AP, MAP, and MRR — then check your work with a **solutions toggle**.

> **Context:** In Retrieval-Augmented Generation (RAG), the retriever's quality strongly determines how well the generator can answer. These metrics help you quantify how good your retriever is at bringing useful context to the LLM.


In [None]:

# Toggle solutions here.
# Set to True to reveal solution cells, or False to hide them.
SHOW_SOLUTIONS = False



## 1) Metrics Overview — What they measure and why we use them

**Precision@K**  
- **What it measures:** The **fraction of the top-K retrieved documents that are actually relevant**.  
- **Why it matters:** In RAG, **top-of-list quality** matters because only a small number of docs go into the LLM context. High Precision@K means you minimize noise presented to the generator.

**Recall@K**  
- **What it measures:** The **fraction of all relevant documents that appear within the top-K**.  
- **Why it matters:** High recall ensures you **don’t miss** critical evidence the generator might need. This is especially important for knowledge-dense queries with multiple supporting docs.

**Average Precision (AP)**  
- **What it measures:** Aggregates **precision at the ranks where relevant documents appear** for a single query.  
- **Why it matters:** Rewards methods that **rank relevant documents early and often** — a smooth measure that considers the whole ranking.

**Mean Average Precision (MAP)**  
- **What it measures:** The **mean of AP** across many queries.  
- **Why it matters:** A **global measure** of ranking quality over a dataset — useful when queries have **multiple relevant** documents.

**Reciprocal Rank (RR)** and **Mean Reciprocal Rank (MRR)**  
- **What they measure:** **RR = 1 / (rank of the first relevant doc)** for a query; **MRR** is the average of RR across queries.  
- **Why it matters:** Ideal when you **only need one good document quickly** (e.g., QA or chat) — it emphasizes how early the **first** relevant doc appears.



## 2) Helper Functions (use these for the problems)
These pure-Python functions implement the IR metrics. They accept:
- `ranked_doc_ids`: a list of **doc IDs in ranked order** (best to worst) for a query
- `relevant_doc_ids`: a **set** of relevant doc IDs for that query


In [None]:

from typing import List, Set, Dict
import numpy as np

def precision_at_k(ranked_doc_ids: List[int], relevant_doc_ids: Set[int], k: int) -> float:
    """Precision@K: fraction of top-K that is relevant."""
    top_k = ranked_doc_ids[:k]
    hits = sum(1 for d in top_k if d in relevant_doc_ids)
    return hits / float(k)

def recall_at_k(ranked_doc_ids: List[int], relevant_doc_ids: Set[int], k: int) -> float:
    """Recall@K: fraction of all relevant docs that appear in top-K."""
    if len(relevant_doc_ids) == 0:
        return 0.0
    top_k = ranked_doc_ids[:k]
    hits = sum(1 for d in top_k if d in relevant_doc_ids)
    return hits / float(len(relevant_doc_ids))

def average_precision(ranked_doc_ids: List[int], relevant_doc_ids: Set[int]) -> float:
    """Average Precision (AP) for a single query.
    AP = mean of Precision@k over ranks k where the item at rank k is relevant.
    If no relevant docs are retrieved, AP = 0.
    """
    if len(relevant_doc_ids) == 0:
        return 0.0
    precisions = []
    hits = 0
    for i, doc_id in enumerate(ranked_doc_ids, start=1):  # ranks start at 1
        if doc_id in relevant_doc_ids:
            hits += 1
            precisions.append(hits / i)  # precision at rank i
    return float(np.mean(precisions)) if precisions else 0.0

def reciprocal_rank(ranked_doc_ids: List[int], relevant_doc_ids: Set[int]) -> float:
    """Reciprocal Rank (RR): 1 / (rank of the first relevant doc), or 0 if none."""
    for i, doc_id in enumerate(ranked_doc_ids, start=1):
        if doc_id in relevant_doc_ids:
            return 1.0 / i
    return 0.0

def evaluate_query(ranked: List[int], rel: Set[int], ks=(5,10)) -> Dict[str, float]:
    """Convenience function to compute P@K, R@K, AP, RR for a single query."""
    out = {}
    for k in ks:
        out[f'P@{k}'] = precision_at_k(ranked, rel, k)
        out[f'R@{k}'] = recall_at_k(ranked, rel, k)
    out['AP'] = average_precision(ranked, rel)
    out['RR'] = reciprocal_rank(ranked, rel)
    return out



## 3) Practice A — Single Query (Basics)
**Given:** a ranked list of doc IDs for a query, and the set of relevant doc IDs.  
**Compute:** Precision@5, Precision@10, Recall@5, Recall@10.

> Tip: Start by counting how many relevant docs appear in the top-K, then divide by K (for precision) or by the number of relevant docs (for recall).


In [None]:

# --- Problem 3.1 ---
# Ranked results for Query Q1 (best -> worst)
ranked_Q1 = [3, 9, 12, 7, 2, 5, 10, 4, 8, 1, 6, 11]
# Ground-truth relevant docs for Q1
relevant_Q1 = {2, 5, 8, 14}  # note: some IDs may not appear in the top results

# TODO: Compute P@5, P@10, R@5, R@10 (by hand or using helper functions)
# Replace the '...' with code. Use evaluate_query if you want a shortcut.
# Example: precision_at_k(ranked_Q1, relevant_Q1, 5)

P5_Q1  = precision_at_k(ranked_Q1, relevant_Q1, 5)   # ...
P10_Q1 = precision_at_k(ranked_Q1, relevant_Q1, 10)  # ...
R5_Q1  = recall_at_k(ranked_Q1, relevant_Q1, 5)      # ...
R10_Q1 = recall_at_k(ranked_Q1, relevant_Q1, 10)     # ...

print({ "P@5": P5_Q1, "P@10": P10_Q1, "R@5": R5_Q1, "R@10": R10_Q1 })


In [None]:

# --- Solution 3.1 (revealed if SHOW_SOLUTIONS=True) ---
if SHOW_SOLUTIONS:
    sol = evaluate_query([3,9,12,7,2,5,10,4,8,1,6,11], {2,5,8,14})
    print("Q1 Solution:", sol)


In [None]:

# --- Problem 3.2 ---
ranked_Q2 = [1, 13, 6, 4, 8, 3, 10, 2, 5, 12, 9, 7]
relevant_Q2 = {4, 6, 9}

# TODO: Compute P@5, P@10, R@5, R@10
P5_Q2  = precision_at_k(ranked_Q2, relevant_Q2, 5)
P10_Q2 = precision_at_k(ranked_Q2, relevant_Q2, 10)
R5_Q2  = recall_at_k(ranked_Q2, relevant_Q2, 5)
R10_Q2 = recall_at_k(ranked_Q2, relevant_Q2, 10)

print({ "P@5": P5_Q2, "P@10": P10_Q2, "R@5": R5_Q2, "R@10": R10_Q2 })


In [None]:

# --- Solution 3.2 ---
if SHOW_SOLUTIONS:
    sol = evaluate_query([1,13,6,4,8,3,10,2,5,12,9,7], {4,6,9})
    print("Q2 Solution:", sol)



## 4) Practice B — Average Precision (AP) and Reciprocal Rank (RR)
**Given:** ranked list + relevant set.  
**Compute:** the **AP** (single query) and **RR** (single query).

- **AP**: For each rank *k* where the item is relevant, compute Precision@k. AP is the **mean** of those precisions.  
- **RR**: 1 divided by the **rank of the first relevant** item.


In [None]:

# --- Problem 4.1 (AP & RR) ---
ranked_Q3 = [20, 2, 5, 17, 9, 11, 3, 7, 8, 6, 1, 4]
relevant_Q3 = {1, 3, 8, 11}

# TODO: Compute AP and RR for Q3
AP_Q3 = average_precision(ranked_Q3, relevant_Q3)
RR_Q3 = reciprocal_rank(ranked_Q3, relevant_Q3)

print({ "AP": AP_Q3, "RR": RR_Q3 })


In [None]:

# --- Solution 4.1 ---
if SHOW_SOLUTIONS:
    sol = evaluate_query([20,2,5,17,9,11,3,7,8,6,1,4], {1,3,8,11})
    print("Q3 Solution:", sol)


In [None]:

# --- Problem 4.2 (AP & RR) ---
ranked_Q4 = [9, 5, 2, 4, 1, 3, 6, 7, 8]
relevant_Q4 = {2}

# TODO: Compute AP and RR for Q4
AP_Q4 = average_precision(ranked_Q4, relevant_Q4)
RR_Q4 = reciprocal_rank(ranked_Q4, relevant_Q4)

print({ "AP": AP_Q4, "RR": RR_Q4 })


In [None]:

# --- Solution 4.2 ---
if SHOW_SOLUTIONS:
    sol = evaluate_query([9,5,2,4,1,3,6,7,8], {2})
    print("Q4 Solution:", sol)



## 5) Practice C — MAP & MRR over Multiple Queries
Below is a small **synthetic dataset** of **four queries**. For each query you get:
- `ranked`: list of doc IDs in descending relevance order
- `relevant`: a set of ground-truth relevant doc IDs

**Tasks:**
1. Compute P@5, P@10, R@5, R@10, AP, and RR **per query**.  
2. Compute **MAP** (mean of AP across all queries) and **MRR** (mean of RR across all queries).


In [None]:

# --- Synthetic dataset: 4 queries ---
toy_data = [
    {
        "qid": "Q5",
        "ranked":   [3, 1, 6, 8, 2, 5, 9, 4, 7, 10],
        "relevant": {1, 2, 5}
    },
    {
        "qid": "Q6",
        "ranked":   [11, 4, 9, 2, 1, 7, 3, 6, 5, 8],
        "relevant": {2, 11}
    },
    {
        "qid": "Q7",
        "ranked":   [14, 12, 15, 1, 5, 2, 3, 6, 8, 9],
        "relevant": {5, 6, 9, 14}
    },
    {
        "qid": "Q8",
        "ranked":   [21, 19, 17, 16, 18, 20, 22, 23, 24, 25],
        "relevant": {30}  # intentionally missing (no relevant retrieved)
    },
]

# --- YOUR WORK: compute per-query metrics and then MAP/MRR ---
per_query = []
for item in toy_data:
    ranked = item["ranked"]
    rel = item["relevant"]
    result = evaluate_query(ranked, rel, ks=(5,10))
    per_query.append({ "qid": item["qid"], **result })

import pandas as pd
df_results = pd.DataFrame(per_query)
df_results


In [None]:

# --- Aggregate MAP and MRR over the toy set ---
MAP = df_results["AP"].mean()
MRR = df_results["RR"].mean()

summary = pd.DataFrame({
    "Metric": ["MAP", "MRR", "Precision@5 (mean)", "Precision@10 (mean)", "Recall@5 (mean)", "Recall@10 (mean)"],
    "Score":  [MAP, MRR, df_results["P@5"].mean(), df_results["P@10"].mean(), df_results["R@5"].mean(), df_results["R@10"].mean()]
})
summary


In [None]:

# --- Solution discussion for Section 5 (shown only if SHOW_SOLUTIONS=True) ---
if SHOW_SOLUTIONS:
    from pprint import pprint
    print("Per-query metrics:")
    pprint(per_query)
    print("\nMAP, MRR, and means:")
    print(summary)



## 6) Concept Checks (Short Answer)

1. **When would you prefer Recall@10 over Precision@5 in RAG?**  
   *Hint:* Think about tasks where missing evidence is costly.

2. **Why might MRR be a better indicator than MAP for a chat assistant that inserts only a single snippet into the prompt?**  
   *Hint:* Consider what the generator needs to answer a single-turn question.

3. **How can increasing K change your evaluation story even if MAP stays similar?**  
   *Hint:* Consider precision vs. recall trade-offs and context-window limits.

4. **If a retriever has high Recall@10 but low Precision@5, what concrete change might you try?**  
   *Hint:* Consider rerankers or hybrid search to improve ranking at the top.



## 7) (Optional) Visual Aid — Where do the first relevant documents appear?
This simple plot shows the **1-indexed rank of the first relevant document** for each query in our toy set (ignoring queries with none).  
Lower is better — it reflects the behavior MRR cares about.


In [None]:

import matplotlib.pyplot as plt

first_ranks = []
for item in toy_data:
    ranked = item["ranked"]
    rel = item["relevant"]
    # Find first relevant rank
    pos = None
    for i, d in enumerate(ranked, start=1):
        if d in rel:
            pos = i
            break
    if pos is not None:
        first_ranks.append(pos)

plt.figure()
plt.bar(range(len(first_ranks)), first_ranks)
plt.title("Rank of First Relevant Document (lower is better)")
plt.xlabel("Toy Query Index")
plt.ylabel("First Relevant Rank")
plt.show()
