# üìê Evaluation as Education: Measuring 'Goodness'

In AI Engineering, "It feels better" is not a metric. We need numbers.
This notebook covers the basics of **Retrieval Evaluation**.

## 1. Recall vs. Precision

- **Recall**: Did we find the right document? (FOMO metric)
- **Precision**: Is the document we found actually relevant? (Noise metric)

In [None]:
def calculate_recall_at_k(retrieved_ids, relevant_ids, k=3):
    top_k = set(retrieved_ids[:k])
    relevant = set(relevant_ids)
    intersection = top_k.intersection(relevant)
    return len(intersection) / len(relevant)

# Scenario: We need Doc #5 and Doc #8.
relevant = [5, 8]
# System returns:
retrieved = [1, 2, 5, 9, 8]

print("Recall@3:", calculate_recall_at_k(retrieved, relevant, k=3))
print("Recall@5:", calculate_recall_at_k(retrieved, relevant, k=5))

## 2. Mean Reciprocal Rank (MRR)

How *early* did the right answer appear? Finding it at reasonable position #1 is better than #10.

In [None]:
def calculate_mrr(retrieved_ids, relevant_id):
    try:
        rank = retrieved_ids.index(relevant_id) + 1
        return 1 / rank
    except ValueError:
        return 0.0

print("MRR (Found at 3):", calculate_mrr([1, 2, 5], 5))
print("MRR (Found at 1):", calculate_mrr([5, 1, 2], 5))