# Notebook 2: Evaluating Your RAG Application

> **Note** : This notebook is a preview of what we cover in [improvingrag.com](https://improvingrag.com). Stop settling for "Looks Good to Me" and start building better systems. Learn how to turn RAG from a risky experiment into a structured, data-driven practice. Check out [improvingrag.com](https://improvingrag.com) for a proven foundational framework to help you go beyond the basics to improve performance, quality, and user experience. 


In [Notebook 1: Systematically Improving Your RAG Application](#notebook-1-systematically-improving-your-rag-application) we saw why focusing solely on generation can be both costly and slow. In this notebook, we'll shift our focus over to the retrieval side of things, showing how we can benchmark the performance of our retrieval system and improve it over time.

## Why This Matters

Without accurate and relevant retrieval, even the most advanced language models struggle to generate useful answers. By establishing objective performance baselines using synthetic evaluation datasets and key retrieval metrics, you can:
- **Diagnose Issues:** Identify gaps where relevant information is missed.
- **Measure Effectiveness:** Use quantitative metrics (e.g., Recall, Mean Reciprocal Rank) to assess and compare retrieval performance.
- **Iterate Rapidly:** Drive targeted, data-driven improvements to your retrieval pipeline, ultimately enhancing overall user satisfaction.

## What You'll Learn

In this notebook, you will learn to:

1. **Generate Synthetic Evaluation Datasets**
   - Leverage language models to create diverse synthetic queries.
   - Introduce controlled variations to mimic real-world user inputs.
   - Continuously refine these datasets with actual user data.

2. **Define and Compute Key Retrieval Metrics**
   - Understand metrics such as Precision, Recall, and Mean Reciprocal Rank (MRR).
   - Evaluate retrieval performance at multiple cutoff points (e.g., top-5, top-10).
   - Connect these metrics to real-world impacts on user experience.

3. **Establish a Baseline for Your Retrieval System**
   - Assess single-query performance and aggregate results across a dataset.
   - Benchmark improvements from enhancements like better metadata captioning.
   - Use data-driven insights to guide systematic, iterative improvements in your RAG application.

By the end of this notebook, you’ll have a solid framework for measuring and improving the retrieval component of your RAG system, setting the stage for faster, more effective iterations.

## Evaluation Datasets

By leveraging language models, we can bootstrap evaluation datasets with synthetic data. This allows us to understand specific areas that our retrieval might suffer before we ship to production. This is a process that should continue, even after we ship to production. We can continously iterate on these synthetic datasets to make sure that they're representative of the queries we can expect in production.

Ultimately, this allows us to iteratively update our synthetic datasets over time and make sure that they're representative of the queries we can expect in production. In this section, we'll examine this in 3 parts

1. First we'll talk briefly about what Synthetic Data is
2. Then we'll look at why we need to be mindful about the diversity in our synthetic data and how introducing controlled variation into the process can help us generate a more representative set of queries
3. Then we'll look at how we can generate an initial set of synthetic questions using the items in our dataset for reference to benchmark our retrieval system

At the end of this section you should have a clear idea of what synthetic data is and how you can generate your own variations of it.

### What is Synthetic Data?

Synthetic data is data that's not generated by a human. In our specific context here, we're refering to data that's generated by a language model itself.

By leveraging language models to generate questions for us and thinking carefully about the constraints we want to apply to these questions, we can generate datasets that are much richer and diverse than what we could do ourselves. This is because of the sheer amount of data that's been used to train these language models in the first place

In [1]:
from openai import OpenAI
import instructor
from pydantic import BaseModel

client = instructor.from_openai(OpenAI())


class Question(BaseModel):
    question: str
    answer: str


# Sample information about Paris
paris_info = """
Paris is the capital and largest city of France, with an estimated population of 2.1 million residents.
Located in northern France, it is a major global center for art, fashion, gastronomy and culture.
The city is known for its iconic landmarks including:
- The Eiffel Tower
- The Louvre Museum
- Notre-Dame Cathedral
- Arc de Triomphe
Paris is also home to world-class universities, financial institutions, and is one of the world's leading tourist destinations.
"""

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that can generate detailed questions and answers based on the information provided below. Include specific facts and figures where relevant.",
        },
        {"role": "user", "content": paris_info},
    ],
    response_model=Question,
)

print(f"Generated Question: {resp.question}")
print(f"Answer: {resp.answer}")

Generated Question: What is the estimated population of Paris?
Answer: The estimated population of Paris is 2.1 million residents.


With this, we've generated our first synthetic question! We can see how by scaling this process out, we can generate a large amount of questions easily and efficiently. 

However, this doesn't come without some challenges - the biggest of which is that of diversity which we'll talk about in the next section.

### Why Diversity Matters

As mentioned above, the biggest challenge when generating synthetic data is that of diversity. If we pass the language model the same prompt over and over again, we'll get the same output every time. 

Let's see this in action below when we use the prompt above to generate 4 questions.

In [8]:
from openai import OpenAI
import instructor
from pydantic import BaseModel

client = instructor.from_openai(OpenAI())


class Question(BaseModel):
    question: str
    answer: str


# Sample information about Paris
paris_info = """
Paris is the capital and largest city of France, with an estimated population of 2.1 million residents.
Located in northern France, it is a major global center for art, fashion, gastronomy and culture.
The city is known for its iconic landmarks including:
- The Eiffel Tower
- The Louvre Museum
- Notre-Dame Cathedral
- Arc de Triomphe
Paris is also home to world-class universities, financial institutions, and is one of the world's leading tourist destinations.
"""

for _ in range(4):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that can generate detailed questions and answers based on the information provided below.",
            },
            {"role": "user", "content": f"""
             content: { paris_info }
             """},
        ],
        response_model=Question,
    )

    print(f"Generated Question: {resp.question}")
    # print(f"Answer: {resp.answer}")

Generated Question: What is the capital and largest city of France, and what is its estimated population?
Generated Question: What is the capital and largest city of France?
Generated Question: What is the capital and largest city of France, and what is its estimated population?
Generated Question: What is the capital and largest city of France?


We can see that given the same prompt, the language model will generate very similar outputs over and over again. In order to combat this, we need to introduce some controlled variation into the process.

Now this needs to be done in a way that makes sense. Just using random bits of information in the prompt won't introduce the variation you need to get a diverse set of questions.

In [9]:
from openai import OpenAI
import instructor
from pydantic import BaseModel
import random

client = instructor.from_openai(OpenAI())


class Question(BaseModel):
    question: str
    answer: str


# Sample information about Paris
paris_info = """
Paris is the capital and largest city of France, with an estimated population of 2.1 million residents.
Located in northern France, it is a major global center for art, fashion, gastronomy and culture.
The city is known for its iconic landmarks including:
- The Eiffel Tower
- The Louvre Museum
- Notre-Dame Cathedral
- Arc de Triomphe
Paris is also home to world-class universities, financial institutions, and is one of the world's leading tourist destinations.
"""

for _ in range(4):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"You are a helpful assistant that can generate detailed questions and answers based on the information provided below. Include specific facts and figures where relevant. Note that the current time is {random.randint(1, 12)}:{random.randint(0, 59)}",
            },
            {"role": "user", "content": f"""
             content: { paris_info }
             """},
        ],
        response_model=Question,
    )

    print(f"Generated Question: {resp.question}")
    # print(f"Answer: {resp.answer}")

Generated Question: What is the estimated population of Paris?
Generated Question: What is the population of Paris as of the latest estimates?
Generated Question: What is the estimated population of Paris and what is its significance in the world?
Generated Question: What is the capital and largest city of France, and what is its estimated population?


Simply introducing variation in the prompt itself was not enough to get a diverse set of questions. In this specific case, we had two unique questions, with the rest being the same.

Instead, what we need to do is to introduce smart modes of variation.

For instance, in this case here, we could do the following

- We could vary the tone that we use in the question
- We could ask for different types of questions - trivia, history, science
- We could also vary the specific details that we ask for in the question

let's see this in action below.

In [10]:
from openai import OpenAI
import instructor
from pydantic import BaseModel
import random

client = instructor.from_openai(OpenAI())


class Question(BaseModel):
    question: str
    answer: str


# Sample information about Paris
paris_info = """
Paris is the capital and largest city of France, with an estimated population of 2.1 million residents.
Located in northern France, it is a major global center for art, fashion, gastronomy and culture.
The city is known for its iconic landmarks including:
- The Eiffel Tower
- The Louvre Museum
- Notre-Dame Cathedral
- Arc de Triomphe
Paris is also home to world-class universities, financial institutions, and is one of the world's leading tourist destinations.
"""
# Define specific questions about landmarks and attractions
questions = ["historical", "cultural", "geographical", "art"]
tones = ["curt", "formal", "technical", "casual"]
for question, tone in zip(questions, tones):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"You're a helpful assistant that can generate detailed questions and answers based on the information provided below. Make sure that the question you're generated is a question pertaining to {question} and written in a {tone} tone.",
            },
            {"role": "user", "content": f"Context: {paris_info}"},
        ],
        response_model=Question,
    )

    print(f"Generated Question: {resp.question}")
    # print(f"Answer: {resp.answer}")
    print(" ")

Generated Question: What is the capital and largest city of France?
 
Generated Question: What are some of the most notable cultural landmarks in Paris that contribute to its reputation as a global center for art and culture?
 
Generated Question: What factors contribute to Paris's status as a major global center for art, fashion, gastronomy, and culture, despite its population of 2.1 million residents?
 
Generated Question: What makes Paris such a hotspot for art lovers?
 


By being more tactful about the variation that we introduce, we can generate a much more diverse set of questions. This is useful for us because it allows us to generate a much more representative set of questions that we can use to evaluate our retrieval.

Now that we've seen what are some of the issues behind using synthetic questions and what they are, let's see how we can generate our first set of synthetic questions using the items in our dataset for reference.

### Generating Synthetic Questions

Going back to the previous section, what we want is to generate a set of questions that are going to include a mix of the following conditions

- Questions about price
- Questions about material
- Availability constraints

Let's see how we can do so for a single item in our dataset by reading in an item from our local `lancedb` database

In [13]:
from lancedb import connect

db = connect("./lancedb")
table = db.open_table("items")

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
df = table.to_pandas().head(4)
item = df.iloc[0].to_dict()
item

{'id': 1,
 'title': 'Lace Detail Sleeveless Top',
 'description': "Elevate your casual wardrobe with this elegant sleeveless top featuring intricate lace detailing at the neckline. Perfect for both day and night, it's crafted from a soft, breathable fabric for all-day comfort.",
 'brand': 'H&M',
 'category': 'Tops',
 'product_type': 'Tank Tops',
 'attributes': '[{"name": "Sleeve Length", "value": "Sleeveless"}, {"name": "Neckline", "value": "Crew Neck"}]',
 'material': 'Cotton',
 'pattern': 'Solid',
 'price': 181.04,
 'vector': array([ 0.09288761,  0.04094775, -0.00205971, ..., -0.00216066,
         0.02410295, -0.02825646], shape=(1536,), dtype=float32),
 'in_stock': False}

In [58]:
import instructor
from pydantic import BaseModel


class UserQuery(BaseModel):
    chain_of_thought: str
    query: str


async def generate_synthetic_question(client: instructor.AsyncInstructor, item: dict):
    condition = [
        "price",
        "material",
        "whether it's in stock",
    ]
    return await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """
Generate a natural shopping query where this item would be the perfect recommendation. This query should be a query about the {{ query_type }} of this specific item.

Item Details:
- Title: {{title}}
- Description: {{description}}
- Brand: {{brand}}
- Material: {{material}}
- Pattern: {{pattern}}
- Attributes: {{attributes}}
- Price: {{price}}
- Category: {{category}}
- Product Type: {{product_type}}
- In Stock : {{ stock_status }}

Make sure that 

Requirements:
- Query should be 20-30 words
- Conversational tone
- The query should describe the aspects above that make the item above a perfect match for the user's requirements
- Try to mention things which might be synonyms for the item and avoid mentioning it directly. Instead we should use specific attributes that the item has in order to make it a good fit. Make sure to use the exact attribute name so that it's unambigious
- for the price range, keep it to 15 bucks on either side of the price max
- Do not mention the item's name or brand in the query itself

Remember: The query should describe what someone would be looking for if this exact item would be their perfect match!
""",
            }
        ],
        context={
            "query_type": random.choice(condition),
            "stock_status": item["in_stock"],
            "title": item["title"],
            "description": item["description"],
            "brand": item["brand"],
            "material": item["material"],
            "pattern": item["pattern"],
            "attributes": item["attributes"],
            "price": item["price"],
            "category": item["category"],
            "product_type": item["product_type"],
        },
        response_model=UserQuery,
    )

In [59]:
import instructor
from openai import AsyncOpenAI
from rich import print

client = instructor.from_openai(AsyncOpenAI())

print(await generate_synthetic_question(client, item))

Now that we’ve explored how to generate synthetic queries—and importantly, how to introduce the controlled variations needed to mimic real user inputs—let’s shift our focus to measurement. 

In the next section, we’ll introduce key retrieval metrics like Recall and Mean Reciprocal Rank (MRR) to objectively assess the performance of our retrieval system

## Retrieval Metrics

Let's start by looking at a few metrics that we can use to evaluate the quality of our retrieval.

### Key Retrieval Metrics

**Precision** measures how many of our retrieved items are actually relevant:

$$ \text{Precision} = \frac{\text{Number of Relevant Items Retrieved}}{\text{Total Number of Retrieved Items}} $$ 

For example, if your system retrieves 10 documents but only 5 are relevant, that's 50% precision. Low precision indicates your system is wasting resources processing irrelevant content.

**Recall** measures how many of the total relevant items we managed to find:

$$ \text{Recall} = \frac{\text{Number of Relevant Items Retrieved}}{\text{Total Number of Relevant Items}} $$ 

If there are 20 relevant documents in your database but you only retrieve 10 of them, that's 50% recall. Low recall suggests you're missing important information.

**Mean Reciprocal Rank (MRR)** measures the average reciprocal rank of the first relevant document:

$$ \text{MRR} = \frac{\sum\_{i=1}^{n} \frac{1}{rank(i)}}{n} $$

The higher the MRR, the better. It penalizes systems that retrieve irrelevant documents early in the list.

In practice, we often measure these metrics at specific cutoff points (like top-5 or top-10 results), denoted as Precision@K or Recall@K. These cut-off points are always going to be determined by some specific business or product requirement. Here are some practical examples of how we might use these metrics:



| **Use Case**           | **Primary Metrics**  | **Reasoning**                                                                                                          |
|------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------|
| **Retrieval**          | Recall, MRR          | Ensures the correct context is provided by retrieving all relevant items and ranking them highly.                      |
| **Tool Calling**       | Precision, Recall    | Guarantees that all necessary tools are invoked while avoiding irrelevant tool calls that waste resources.             |
| **Metadata Filtering** | Recall, MRR          | Focuses on improving the quality of returned results by effectively filtering a large dataset for the most pertinent items. |



In [32]:
def calculate_mrr(predictions: list[str], gt: list[str]):
    mrr = 0
    for label in gt:
        if label in predictions:
            # Find the relevant item that has the smallest index
            mrr = max(mrr, 1 / (predictions.index(label) + 1))
    return mrr


def calculate_recall(predictions: list[str], gt: list[str]):
    # Calculate the proportion of relevant items that were retrieved
    return len([label for label in gt if label in predictions]) / len(gt)


def calculate_precision(predictions: list[str], gt: list[str]):
    # Calculate the proportion of retrieved items that are relevant
    return len([label for label in predictions if label in gt]) / len(predictions)

We'd ideally also like to compute the precision and recall at a variety of different cutoff points. Let's see how we can do this below.

In [33]:
import itertools 

def get_metrics_at_k(
    metrics: list[str],
    sizes: list[int],
):
    metric_to_score_fn = {
        "mrr": calculate_mrr,
        "recall": calculate_recall,
    }

    for metric in metrics:
        if metric not in metric_to_score_fn:
            raise ValueError(f"Metric {metric} not supported")

    eval_metrics = [(metric, metric_to_score_fn[metric]) for metric in metrics]

    return {
        f"{metric_name}@{size}": lambda predictions, gt, m=metric_fn, s=size: (
            lambda p, g: m(p[:s], g)
        )(predictions, gt)
        for (metric_name, metric_fn), size in itertools.product(eval_metrics, sizes)
    }

metrics = get_metrics_at_k(metrics=["mrr", "recall"], sizes=[5, 10, 25])

We can then compute apply this metrics list on a set of retrieved items as seen below.

In [13]:

retrieved_item_ids = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
desired_item = [10]

scores = {metric: score_fn(retrieved_item_ids, desired_item) for metric, score_fn in metrics.items()}
scores


{'mrr@5': 0,
 'mrr@10': 0.1,
 'mrr@25': 0.1,
 'recall@5': 0.0,
 'recall@10': 1.0,
 'recall@25': 1.0}

We can see that we've computed the metrics for a variety of different cutoff points. This is a function that we can parameterise and apply to calculate mrr and recall at a variety of different cutoff points.

let's now see how we can apply this to a single query below.

### What's a good fit for our use case?

In our RAG application, we're going to be using `recall` and `MRR` as our primary metrics. This is for two reasons

1. **Recall** : We want to make sure that for the synthetic question we generate for that specific item, we're able to retrieve it. Recall will determine whether this is the case for us

2. **MRR** : More often than not, we're also going to be displaying these suggested items to the user. In this case, we'd want to make sure that the relevant item is ranked as highly as possible so that the user has a high chance of clicking on it. MRR will help us understand how well we're able to rank the relevant item

With a clear understanding of metrics such as Recall and MRR, we’re ready to put these numbers into practice. In the next section, we'll evaluate our existing retrieval system. We'll first demonstrate how to calculate these metrics for a single query before we scale it out to our entire dataset. 

This will allow us to establish a robust baseline which will serve as our reference point for measuring future improvements.

## Computing a Baseline

### Evaluating A Single Query

We've generated a small dataset of synthetic questions ahead of time so that you can measure the performance of your retrieval system. Let's read in the dataset from the `queries.json` file and then use them to compute our baseline metrics.

Let's see how we might evaluate the recall and MRR for a single query. We'll use the query at index 3 in our dataset which our retrieval system will struggle with slightly.

In [1]:
import json

# Load in queries that we generated previously
with open("./data/queries.json", "r") as f:
    queries = [json.loads(line) for line in f]


In [42]:
queries[3]

{'query': 'Searching for a cropped cotton top with short sleeves and a stylish turtleneck for casual outings and everyday wear, ideally priced below $400.',
 'title': "Fila Women's Cropped Logo T-Shirt",
 'brand': 'Fila',
 'description': "Elevate your casual look with the Fila Women's Cropped Logo T-Shirt. Featuring the iconic Fila logo in a sleek design, this short-sleeve turtleneck adds a touch of sporty elegance to any ensemble.",
 'category': 'Tops',
 'product_type': 'T-Shirts',
 'attributes': '[{"name": "Sleeve Length", "value": "Short Sleeve"}, {"name": "Neckline", "value": "Turtleneck"}, {"name": "Fit", "value": "Cropped"}]',
 'material': 'Cotton',
 'pattern': 'Solid',
 'id': 4,
 'price': 374.89}

Let's now see how well our retrieval system performs of this specific query. We'll do so by fetching the top 25 items from our retrieval system and then computing the recall and MRR at a variety of cutoff points.

In [43]:
target_query = queries[3]
retrieved_item_ids = [item['id'] for item in table.search(target_query["query"]).limit(25).to_list()]
desired_item = [target_query["id"]]

Once we've fetched the top 25 items from our retrieval system, we can then compute the recall and MRR at a variety of cutoff points as seen below

In [44]:
metrics = get_metrics_at_k(metrics=["mrr", "recall"], sizes=[5, 10, 15, 20, 25])
scores = {metric: score_fn(retrieved_item_ids, desired_item) for metric, score_fn in metrics.items()}
scores

{'mrr@5': 0,
 'mrr@10': 0,
 'mrr@15': 0,
 'mrr@20': 0,
 'mrr@25': 0,
 'recall@5': 0.0,
 'recall@10': 0.0,
 'recall@15': 0.0,
 'recall@20': 0.0,
 'recall@25': 0.0}

We can see that our retrieval system wasn't able to retrieve the relevant items for this specific query itself. Let's take a closer look to see what was retrieved and what the query was.

In [60]:
from rich import print

print(target_query['query'])


In [61]:
table.search(target_query["query"]).select(["id", "title","category", "product_type","price","material","attributes"]).limit(25).to_pandas().head(20)

Unnamed: 0,id,title,category,product_type,price,material,attributes,_distance
0,145,Off-Shoulder Crop Top,Tops,Blouses,184.61,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Short Sle...",0.832879
1,130,Women's Cutout Cropped Top,Tops,Blouses,222.68,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Long Slee...",0.886718
2,73,Smocked Button Front Crop Top,Tops,Tank Tops,355.15,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Sleeveles...",0.89511
3,5,Plaid Crop Top,Tops,Tank Tops,261.05,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Sleeveles...",0.915811
4,108,Lace-Up Cropped Blouse,Tops,Blouses,158.65,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Short Sle...",0.931424
5,4,Fila Women's Cropped Logo T-Shirt,Tops,T-Shirts,374.89,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Short Sle...",0.949217
6,85,Women's Sleeveless Tie-Waist Crop Top,Tops,Tank Tops,163.79,Spandex,"[{""name"": ""Sleeve Length"", ""value"": ""Sleeveles...",0.950213
7,172,Patterned Long Sleeve Top,Tops,Sweaters,382.99,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Long Slee...",0.960017
8,44,Women's Black Wrap Crop Top,Tops,Blouses,99.28,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Long Slee...",0.963834
9,51,Women's Cropped Logo T-Shirt,Tops,T-Shirts,226.56,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Short Sle...",0.969768


We can see that while the price was adhered to, we've completely missed the mark on the other aspects. 

1. We've got a mix of different materials - even though most of cotton, we do have some Polyester and Spandex in the mix
2. We've got a mix of different product types - we've got a t-shirt, a dress and a skirt when what we wanted was a T-Shirt
3. We've also got a mix of different attributes - we wanted short sleeved shirts but we've got some long sleeved shirts in the mix as well

As a result, our original item didn't appear in the top 25 results. Let's now compute the recall and MRR for all of our given queries and see how we perform as a whole.

### Computing Our Baseline

We'll use pandas to store the results of our computations and then dump the results into a simple .csv file for future reference. To run our code in Parallel, we'll use a ThreadPoolExecutor to run these async calls in parallel.

In [62]:
from lancedb.table import Table

def retrieve_items(table:Table,item:dict, metrics:dict):
    retrieved_item_ids = [item['id'] for item in table.search(item['query']).limit(25).to_list()]
    desired_item = [item['id']]
    return {metric: score_fn(retrieved_item_ids, desired_item) for metric, score_fn in metrics.items()}


db = connect("./lancedb")
table = db.open_table("items")
retrieve_items(table,queries[3],metrics)

{'mrr@5': 0,
 'mrr@10': 0,
 'mrr@15': 0,
 'mrr@20': 0,
 'mrr@25': 0,
 'recall@5': 0.0,
 'recall@10': 0.0,
 'recall@15': 0.0,
 'recall@20': 0.0,
 'recall@25': 0.0}

In [64]:
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

# Create a ThreadPoolExecutor to run queries in parallel
with ThreadPoolExecutor() as executor:
    # Map the retrieve_items function across all queries
    results = list(executor.map(
        lambda q: retrieve_items(table, q, metrics),
        queries
    ))

# Convert results to DataFrame for analysis
results_df = pd.DataFrame(results)
round(results_df.mean(),2)

mrr@5        0.47
mrr@10       0.49
mrr@15       0.49
mrr@20       0.49
mrr@25       0.49
recall@5     0.63
recall@10    0.76
recall@15    0.79
recall@20    0.84
recall@25    0.84
dtype: float64

With this, we've now computed an initial set of baseline scores for our retrieval system using metrics like recall and MRR. In the next section, we'll see how we can improve the performance of our retrieval system by enriching item descriptions with metdata to boost retrieval evals. 

## Better Captioning

One of the easiest changes that we can make to our retrieval system is to modify the text that we're using for similarity search. In this section, we'll see how we can improve performance by simplfy concatenating metadata information to the item description and using this new description for similarity search.

We'll start by loading in our dataset using `lancedb` and then we'll create a new table that embeds the item description and metadata information together.

In [66]:
# Import required libraries for LanceDB and data handling
from lancedb.pydantic import LanceModel, Vector
import lancedb
from lancedb.embeddings import get_registry

from datasets import load_dataset

# Load the e-commerce product dataset
dataset = load_dataset("ivanleomk/ai-engineer-summit-ecommerce-taxonomy")["train"]

# Initialize the OpenAI text embedding model
func = get_registry().get("openai").create(name="text-embedding-3-small")


# Define a Pydantic model for our database schema
class Item(LanceModel):
    id: int
    title: str
    description: str = func.SourceField()  # Field that will be embedded
    brand: str
    category: str
    product_type: str
    attributes: str
    material: str
    pattern: str
    price: float
    vector: Vector(func.ndims()) = func.VectorField()  # Store the embedding vector
    in_stock: bool


# Connect to LanceDB
db = lancedb.connect("./lancedb")
table_name = "items_better_captioning"


# Create and populate table if it doesn't exist
if table_name not in db.table_names():
    # Create new table with our schema
    concat_table = db.create_table(table_name, schema=Item, mode="overwrite")
    entries = []
    
    # Process each row in the dataset
    for row in dataset:
        # Create enhanced description by combining multiple fields
        entries.append(
            {
                "id": row["id"],
                "title": row["title"],
                "description": f"""
title: {row["title"]}
description: {row["description"]}
brand: {row["brand"]}
category: {row["category"]}
product_type: {row["product_type"]}
price: {row["price"]}
attributes: {json.dumps(row["attributes"])}
""".strip(),  # Combine metadata with description for better search
                "brand": row["brand"],
                "category": row["category"],
                "product_type": row["product_type"],
                "attributes": row["attributes"],
                "material": row["material"],
                "pattern": row["pattern"],
                "price": row["price"],
                "in_stock": row["in_stock"],
            }
        )

    # Add all entries to the table
    concat_table.add(entries)

# Open the table for querying
concat_table = db.open_table(table_name)
print(f"{table.count_rows()} rows in the table")

Now that we've created our new table, let's see how much of an improvement we get by using this new description using vector search. We'll use the same queries that we used previously and compute the recall and MRR at a variety of cutoff points.

In [69]:
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

# Create a ThreadPoolExecutor to run queries in parallel
with ThreadPoolExecutor() as executor:
    # Map the retrieve_items function across all queries
    results = list(executor.map(
        lambda q: retrieve_items(concat_table, q, metrics),
        queries
    ))

# Convert results to DataFrame for analysis
results_df = pd.DataFrame(results)
round(results_df.mean(),2)

mrr@5        0.56
mrr@10       0.58
mrr@15       0.58
mrr@20       0.58
mrr@25       0.58
recall@5     0.71
recall@10    0.84
recall@15    0.87
recall@20    0.89
recall@25    0.92
dtype: float64

In this case, we see an average increase of ~18% for MRR and ~10% for recall across all values of `k` for this new improved description that we're embedding.

| Metric | Semantic Search (Baseline) | Semantic Search + Metadata |
|---------|--------------------------|----------------------------------|
| MRR@5   | 0.47 | 0.56 (+19.15%) |
| MRR@10  | 0.49 | 0.58 (+18.43%) |
| MRR@15  | 0.49 | 0.58 (+18.42%) |
| MRR@20  | 0.49 | 0.58 (+18.21%) |
| MRR@25  | 0.49 | 0.58 (+18.21%) |
| Recall@5  | 0.63 | 0.71 (+12.50%) |
| Recall@10 | 0.76 | 0.84 (+10.34%) |
| Recall@15 | 0.79 | 0.87 (+10.00%) |
| Recall@20 | 0.84 | 0.92 (+9.52%) |
| Recall@25 | 0.84 | 0.92 (+9.52%) |

Instead of simply hoping that our tweaks will work or relying on arbitrary advice, we now have measurable evidence of improvement. For example, by concatenating metadata to the item description, we’ve boosted our MRR by 19% and our recall by 9% on average. This marks a significant shift—from a mindset of “I hope this works” to one where we can quantify the benefits and make informed, objective decisions.

This allows us to prioritise changes that make a difference and give us the biggest leverage rather than hoping it works.

## Conclusion

In this notebook, we explored a systematic approach to evaluating the retrieval component of your RAG application. By bootstrapping our evaluation datasets with synthetic queries and leveraging key metrics like Recall and Mean Reciprocal Rank (MRR), we established a robust performance baseline. We discussed the challenges of generating diverse queries and demonstrated how controlled variation in query generation can lead to a more representative evaluation of our system.

Our approach allowed us to quantify the impact of improvements—such as enhancing item descriptions with richer metadata—which directly translated into measurable gains in recall and MRR. This systematic methodology not only helps identify high-leverage enhancements, like metadata filtering, but also extends to other areas such as fine-tuning embedding models or implementing re-ranking strategies. By quantifying the impact of each individual change, you can focus your efforts on modifications that truly move the needle.

In the next notebook, we'll dive into metadata filtering. We'll also provide a sneak peek into how query understanding using function calling can further improve the performance of our retrieval system. This continuous, data-driven refinement ensures that every change you make is aligned with delivering significant, measurable improvements.