# Notebook 2 : Evaluating Your RAG Application

In the previous notebook, we looked at some of the key limitations of semantic search in finding the right information given some form of constraint. 

In this notebook, we'll examine how we can bootstrap simple evaluation datasets using synthetic data, introduce two metrics we can track in order to measure the quality of our retrieval and compute an initial baseline for our semantic search approach

## Evaluation Datasets

By leveraging language models, we can bootstrap evaluation datasets with synthetic data. This allows us to understand specific areas that our retrieval might suffer before we ship to production.

Even after we ship to production, we can continously iterate on the synthetic datasets that we have by using user conversations as a reference for our language to generate more variations of. 

This allows us to iteratively update our synthetic datasets over time and make sure that they're representative of the queries we can expect in production. In this section, we'll examine this in 3 parts

1. First we'll look briefly at what Synthetic Data is
2. Then we'll talk about how we can generate an initial set of synthetic questions using the items in our dataset for reference
3. Lastly, we'll show how we might use user queries as reference to generate more variations of

At the end of this section you should have a clear idea of what synthetic data is and how you can generate your own variations of it.

### What is Synthetic Data?

Synthetic data is data that's not generated by a human. In our specific context here, we're refering to data that's generated by a language model itself.

By leveraging language models to generate questions for us and thinking carefully about the constraints we want to apply to these questions, we can generate datasets that are much richer and diverse than what we could do ourselves. This is because of the sheer amount of data that's been used to train these language models in the first place

In [1]:
from openai import OpenAI
import instructor
from pydantic import BaseModel

client = instructor.from_openai(OpenAI())


class Question(BaseModel):
    question: str
    answer: str


# Sample information about Paris
paris_info = """
Paris is the capital and largest city of France, with an estimated population of 2.1 million residents.
Located in northern France, it is a major global center for art, fashion, gastronomy and culture.
The city is known for its iconic landmarks including:
- The Eiffel Tower
- The Louvre Museum
- Notre-Dame Cathedral
- Arc de Triomphe
Paris is also home to world-class universities, financial institutions, and is one of the world's leading tourist destinations.
"""

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that can generate detailed questions and answers based on the information provided below. Include specific facts and figures where relevant.",
        },
        {"role": "user", "content": paris_info},
    ],
    response_model=Question,
)

print(f"Generated Question: {resp.question}")
print(f"Answer: {resp.answer}")

Generated Question: What is the estimated population of Paris?
Answer: The estimated population of Paris is 2.1 million residents.


With this, we've generated our first synthetic question! We can see how by scaling this process out, we can generate a large amount of questions easily and efficiently. 

However, this doesn't come without some challenges - the biggest of which is that of diversity which we'll talk about in the next section.

### Diversity

As mentioned above, the biggest challenge when generating synthetic data is that of diversity. If we pass the language model the same prompt over and over again, we'll get the same output every time. 

Let's see this in action below when we use the prompt above to generate 4 questions.

In [2]:
from openai import OpenAI
import instructor
from pydantic import BaseModel

client = instructor.from_openai(OpenAI())


class Question(BaseModel):
    question: str
    answer: str


# Sample information about Paris
paris_info = """
Paris is the capital and largest city of France, with an estimated population of 2.1 million residents.
Located in northern France, it is a major global center for art, fashion, gastronomy and culture.
The city is known for its iconic landmarks including:
- The Eiffel Tower
- The Louvre Museum
- Notre-Dame Cathedral
- Arc de Triomphe
Paris is also home to world-class universities, financial institutions, and is one of the world's leading tourist destinations.
"""

for _ in range(4):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that can generate detailed questions and answers based on the information provided below. Include specific facts and figures where relevant.",
            },
            {"role": "user", "content": paris_info},
        ],
        response_model=Question,
    )

    print(f"Generated Question: {resp.question}")
    print(f"Answer: {resp.answer}")

Generated Question: What is the estimated population of Paris?
Answer: The estimated population of Paris is 2.1 million residents.
Generated Question: What is the estimated population of Paris?
Answer: The estimated population of Paris is 2.1 million residents.
Generated Question: What is the capital and largest city of France, and what is its estimated population?
Answer: Paris is the capital and largest city of France, with an estimated population of 2.1 million residents.
Generated Question: What is the estimated population of Paris?
Answer: The estimated population of Paris is 2.1 million residents.


We can see that given the same prompt, the language model will generate very similar outputs over and over again. In order to combat this, we need to introduce some controlled variation into the process.

Now this needs to be done in a way that makes sense. Just using random bits of information in the prompt won't introduce the variation you need to get a diverse set of questions.

In [3]:
from openai import OpenAI
import instructor
from pydantic import BaseModel
import random

client = instructor.from_openai(OpenAI())


class Question(BaseModel):
    question: str
    answer: str


# Sample information about Paris
paris_info = """
Paris is the capital and largest city of France, with an estimated population of 2.1 million residents.
Located in northern France, it is a major global center for art, fashion, gastronomy and culture.
The city is known for its iconic landmarks including:
- The Eiffel Tower
- The Louvre Museum
- Notre-Dame Cathedral
- Arc de Triomphe
Paris is also home to world-class universities, financial institutions, and is one of the world's leading tourist destinations.
"""

for _ in range(4):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"You are a helpful assistant that can generate detailed questions and answers based on the information provided below. Include specific facts and figures where relevant. Note that the current time is {random.randint(1, 12)}:{random.randint(0, 59)}",
            },
            {"role": "user", "content": paris_info},
        ],
        response_model=Question,
    )

    print(f"Generated Question: {resp.question}")
    print(f"Answer: {resp.answer}")

Generated Question: What is the capital and largest city of France, and what is its estimated population?
Answer: Paris is the capital and largest city of France, with an estimated population of 2.1 million residents.
Generated Question: What is the estimated population of Paris, and what are some of its iconic landmarks?
Answer: Paris has an estimated population of 2.1 million residents. Some of its iconic landmarks include the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe.
Generated Question: What is the estimated population of Paris?
Answer: The estimated population of Paris is 2.1 million residents.
Generated Question: What is the estimated population of Paris?
Answer: The estimated population of Paris is 2.1 million residents.


Simply introducing variation in the prompt itself was not enough to get a diverse set of questions. In this specific case, we had two unique questions, with the rest being the same.

Instead, what we need to do is to introduce smart modes of variation.

For instance, in this case here, we could do the following

- We could vary the tone that we use in the question
- We could ask for different types of questions - trivia, history, science
- We could also vary the specific details that we ask for in the question

let's see this in action below.

In [4]:
from openai import OpenAI
import instructor
from pydantic import BaseModel
import random

client = instructor.from_openai(OpenAI())


class Question(BaseModel):
    question: str
    answer: str


# Sample information about Paris
paris_info = """
Paris is the capital and largest city of France, with an estimated population of 2.1 million residents.
Located in northern France, it is a major global center for art, fashion, gastronomy and culture.
The city is known for its iconic landmarks including:
- The Eiffel Tower
- The Louvre Museum
- Notre-Dame Cathedral
- Arc de Triomphe
Paris is also home to world-class universities, financial institutions, and is one of the world's leading tourist destinations.
"""
# Define specific questions about landmarks and attractions
questions = ["historical", "cultural", "geographical", "art"]
tones = ["curt", "formal", "technical", "casual"]
for question, tone in zip(questions, tones):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"You're a helpful assistant that can generate detailed questions and answers based on the information provided below. Make sure that the question you're generated is a question pertaining to {question} and written in a {tone} tone.",
            },
            {"role": "user", "content": f"Context: {paris_info}"},
        ],
        response_model=Question,
    )

    print(f"Generated Question: {resp.question}")
    print(f"Answer: {resp.answer}")
    print(" ")

Generated Question: What is the estimated population of Paris?
Answer: The estimated population of Paris is 2.1 million residents.
 
Generated Question: What are some key aspects that contribute to Paris's status as a major global center for art, fashion, gastronomy, and culture?
Answer: Key aspects contributing to Paris's status include its rich history, iconic landmarks like the Eiffel Tower and the Louvre Museum, a vibrant arts scene, prestigious fashion houses, diverse culinary offerings, and its role as home to world-class universities and financial institutions.
 
Generated Question: What are some of the iconic landmarks that contribute to Paris being a major global center for art, fashion, gastronomy, and culture?
Answer: Some iconic landmarks that contribute to Paris's global cultural significance include the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe.
 
Generated Question: Hey, what are some of the iconic landmarks in Paris that art enthusia

By being more tactful about the variation that we introduce, we can generate a much more diverse set of questions. This is useful for us because it allows us to generate a much more representative set of questions that we can use to evaluate our retrieval.

Now that we've seen what are some of the issues behind using synthetic questions and what they are, let's see how we can generate our first set of synthetic questions using the items in our dataset for reference.

### Generating Synthetic Questions

Going back to the previous section, what we want is to generate a set of questions that are going to include a mix of the following conditions

- Questions about price
- Questions about material
- Specific occasions that we'd like to use the product for
- Availability constraints

Let's see how we can do so for a single item in our dataset by reading in an item from our local `lancedb` database

In [5]:
from lancedb import connect

db = connect("./lancedb")
table = db.open_table("items")

In [6]:
df = table.to_pandas().head(4)
item = df.iloc[0].to_dict()
item

{'id': 1,
 'title': 'Lace Detail Sleeveless Top',
 'description': "Elevate your casual wardrobe with this elegant sleeveless top featuring intricate lace detailing at the neckline. Perfect for both day and night, it's crafted from a soft, breathable fabric for all-day comfort.",
 'brand': 'H&M',
 'category': 'Tops',
 'product_type': 'Tank Tops',
 'attributes': '[{"name": "Sleeve Length", "value": "Sleeveless"}, {"name": "Neckline", "value": "Crew Neck"}]',
 'material': 'Cotton',
 'pattern': 'Solid',
 'price': 181.04,
 'vector': array([ 0.09296898,  0.04098858, -0.00206752, ..., -0.00216847,
         0.02412327, -0.02827667], shape=(1536,), dtype=float32),
 'in_stock': False}

In [9]:
import instructor
from pydantic import BaseModel


class UserQuery(BaseModel):
    chain_of_thought: str
    query: str


async def generate_synthetic_question(client: instructor.AsyncInstructor, item: dict):
    condition = [
        "price",
        "material",
        "occasions to wear the outfit for",
        "whether it's in stock",
    ]
    return await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """
Generate a natural shopping query where this item would be the perfect recommendation. This query should be a query about the {{ query_type }} of this specific item.

Item Details:
- Title: {{title}}
- Description: {{description}}
- Brand: {{brand}}
- Material: {{material}}
- Pattern: {{pattern}}
- Attributes: {{attributes}}
- Price: {{price}}
- Category: {{category}}
- Subcategory: {{subcategory}}
- Product Type: {{product_type}}
- In Stock : {{ stock_status }}

Make sure that 

Requirements:
- Query should be 20-30 words
- Conversational tone
- The query should describe the aspects above that make the item above a perfect match for the user's requirements
- Try to mention things which might be synonyms for the item and avoid mentioning it directly. Instead we should use specific attributes that the item has in order to make it a good fit. Make sure to use the exact attribute name so that it's unambigious
- for the price range, keep it to 15 bucks on either side of the price max
- Do not mention the item's name or brand in the query itself

For an office blouse that costs $120:
"Need something elegant for my new corporate job. Looking for a silk top with long sleeves and a modest neckline, under $150." ( Within the price range here )

For casual wear that costs $65:
"Shopping for my weekend brunches. Need a cotton top that's both comfy and stylish, maybe with some interesting pattern - ideally something between 40-100 bucks if possible." ( 65 is less than 69 )

Remember: The query should describe what someone would be looking for if this exact item would be their perfect match!
""",
            }
        ],
        context={
            "query_type": random.choice(condition),
            "stock_status": item["in_stock"],
            "title": item["title"],
            "description": item["description"],
            "brand": item["brand"],
            "material": item["material"],
            "pattern": item["pattern"],
            "attributes": item["attributes"],
            "price": item["price"],
            "category": item["category"],
            "product_type": item["product_type"],
        },
        response_model=UserQuery,
    )

In [10]:
import instructor
from openai import AsyncOpenAI
from rich import print

client = instructor.from_openai(AsyncOpenAI())

print(await generate_synthetic_question(client, item))

With this, we've generated our first synthetic question! We've varied the query types and the item itself so that the generated queries are diverse and representative of the items in our dataset.

Let's now shift our attention to what metrics we should use to evaluate the quality of our retrieval.

## Retrieval Metrics

Let's start by looking at a few metrics that we can use to evaluate the quality of our retrieval.

### Key Retrieval Metrics

**Precision** measures how many of our retrieved items are actually relevant:

$$ \text{Precision} = \frac{\text{Number of Relevant Items Retrieved}}{\text{Total Number of Retrieved Items}} $$ 

For example, if your system retrieves 10 documents but only 5 are relevant, that's 50% precision. Low precision indicates your system is wasting resources processing irrelevant content.

**Recall** measures how many of the total relevant items we managed to find:

$$ \text{Recall} = \frac{\text{Number of Relevant Items Retrieved}}{\text{Total Number of Relevant Items}} $$ 

If there are 20 relevant documents in your database but you only retrieve 10 of them, that's 50% recall. Low recall suggests you're missing important information.

**Mean Reciprocal Rank (MRR)** measures the average reciprocal rank of the first relevant document:

$$ \text{MRR} = \frac{\sum\_{i=1}^{n} \frac{1}{rank(i)}}{n} $$

The higher the MRR, the better. It penalizes systems that retrieve irrelevant documents early in the list.

In practice, we often measure these metrics at specific cutoff points (like top-5 or top-10 results), denoted as Precision@K or Recall@K. These cut-off points are always going to be determined by some specific business or product requirement.



In [11]:
def calculate_mrr(predictions: list[str], gt: list[str]):
    mrr = 0
    for label in gt:
        if label in predictions:
            # Find the relevant item that has the smallest index
            mrr = max(mrr, 1 / (predictions.index(label) + 1))
    return mrr


def calculate_recall(predictions: list[str], gt: list[str]):
    # Calculate the proportion of relevant items that were retrieved
    return len([label for label in gt if label in predictions]) / len(gt)


def calculate_precision(predictions: list[str], gt: list[str]):
    # Calculate the proportion of retrieved items that are relevant
    return len([label for label in predictions if label in gt]) / len(predictions)

We'd ideally also like to compute the precision and recall at a variety of different cutoff points. Let's see how we can do this below.

In [12]:
import itertools 

def get_metrics_at_k(
    metrics: list[str],
    sizes: list[int],
):
    metric_to_score_fn = {
        "mrr": calculate_mrr,
        "recall": calculate_recall,
    }

    for metric in metrics:
        if metric not in metric_to_score_fn:
            raise ValueError(f"Metric {metric} not supported")

    eval_metrics = [(metric, metric_to_score_fn[metric]) for metric in metrics]

    return {
        f"{metric_name}@{size}": lambda predictions, gt, m=metric_fn, s=size: (
            lambda p, g: m(p[:s], g)
        )(predictions, gt)
        for (metric_name, metric_fn), size in itertools.product(eval_metrics, sizes)
    }

metrics = get_metrics_at_k(metrics=["mrr", "recall"], sizes=[5, 10, 25])

We can then compute apply this metrics list on a set of retrieved items as seen below.

In [13]:

retrieved_item_ids = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
desired_item = [10]

scores = {metric: score_fn(retrieved_item_ids, desired_item) for metric, score_fn in metrics.items()}
scores


{'mrr@5': 0,
 'mrr@10': 0.1,
 'mrr@25': 0.1,
 'recall@5': 0.0,
 'recall@10': 1.0,
 'recall@25': 1.0}

We can see that we've computed the metrics for a variety of different cutoff points. This is a function that we can parameterise and apply to calculate mrr and recall at a variety of different cutoff points.

let's now see how we can apply this to a single query below.

### What's a good fit for our use case?

In our RAG application, we're going to be using `recall` and `MRR` as our primary metrics. This is for two reasons

1. **Recall** : We want to make sure that for the synthetic question we generate for that specific item, we're able to retrieve it. Recall will determine whether this is the case for us

2. **MRR** : More often than not, we're also going to be displaying these suggested items to the user. In this case, we'd want to make sure that the relevant item is ranked as highly as possible so that the user has a high chance of clicking on it. MRR will help us understand how well we're able to rank the relevant item

For the remainder of this section, we're going to be looking at how we can use these metrics to compute a baseline for our retrieval system. 


## Computing a Baseline

### Evaluating A Single Query

We've generated a small dataset of synthetic questions ahead of time so that we don't need to generate them on the fly. Let's read in the dataset from the `queries.json` file and then use them to compute our baseline metrics.

Let's see how we might evaluate the recall and MRR for a single query. We'll use the query at index 3 in our dataset which our retrieval system will struggle with slightly.

In [14]:
import json

# Load in queries that we generated previously
with open("queries.json", "r") as f:
    queries = [json.loads(line) for line in f]


In [15]:
queries[3]

{'query': 'Searching for a cropped cotton top with short sleeves and a stylish turtleneck for casual outings and everyday wear, ideally priced below $400.',
 'title': "Fila Women's Cropped Logo T-Shirt",
 'brand': 'Fila',
 'description': "Elevate your casual look with the Fila Women's Cropped Logo T-Shirt. Featuring the iconic Fila logo in a sleek design, this short-sleeve turtleneck adds a touch of sporty elegance to any ensemble.",
 'category': 'Women',
 'subcategory': 'Tops',
 'product_type': 'T-Shirts',
 'attributes': '[{"name": "Sleeve Length", "value": "Short Sleeve"}, {"name": "Neckline", "value": "Turtleneck"}, {"name": "Fit", "value": "Cropped"}]',
 'material': 'Cotton',
 'pattern': 'Solid',
 'id': 4,
 'price': 374.89,
 'occasions': '["Everyday Wear", "Casual Outings", "Smart Casual", "Activewear", "Beachwear", "Loungewear", "Travel"]'}

Let's now see how well our retrieval system performs of this specific query. We'll do so by fetching the top 25 items from our retrieval system and then computing the recall and MRR at a variety of cutoff points.

In [16]:
target_query = queries[3]
retrieved_item_ids = [item['id'] for item in table.search(target_query["query"]).limit(25).to_list()]
desired_item = [target_query["id"]]

Once we've fetched the top 25 items from our retrieval system, we can then compute the recall and MRR at a variety of cutoff points as seen below

In [17]:
metrics = get_metrics_at_k(metrics=["mrr", "recall"], sizes=[5, 10, 15, 20, 25])
scores = {metric: score_fn(retrieved_item_ids, desired_item) for metric, score_fn in metrics.items()}
scores

{'mrr@5': 0,
 'mrr@10': 0,
 'mrr@15': 0,
 'mrr@20': 0,
 'mrr@25': 0,
 'recall@5': 0.0,
 'recall@10': 0.0,
 'recall@15': 0.0,
 'recall@20': 0.0,
 'recall@25': 0.0}

We can see that our retrieval system wasn't able to retrieve the relevant items for this specific query itself. Let's take a closer look to see what was retrieved and what the query was.

In [18]:
from rich import print

print(target_query['query'])


In [22]:
table.search(target_query["query"]).select(["id", "title","category", "product_type","price","material","attributes"]).limit(25).to_pandas().head(20)

Unnamed: 0,id,title,category,product_type,price,material,attributes,_distance
0,67,Women's Long Sleeve Turtleneck Top,Tops,Sweaters,154.75,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Long Slee...",0.898017
1,26,Striped Long Sleeve Top,Tops,T-Shirts,81.09,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Long Slee...",0.901962
2,73,Smocked Button Front Crop Top,Tops,Tank Tops,355.15,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Sleeveles...",0.9195
3,44,Women's Black Wrap Crop Top,Tops,Blouses,99.28,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Long Slee...",0.92839
4,172,Patterned Long Sleeve Top,Tops,Sweaters,382.99,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Long Slee...",0.928833
5,5,Plaid Crop Top,Tops,Tank Tops,261.05,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Sleeveles...",0.9335
6,59,Women's Long Sleeve Turtleneck Top,Tops,Sweaters,225.63,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Long Slee...",0.93963
7,177,Sleeveless Button-Down Blouse,Tops,Blouses,353.57,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Sleeveles...",0.942118
8,85,Women's Sleeveless Tie-Waist Crop Top,Tops,Tank Tops,163.79,Spandex,"[{""name"": ""Sleeve Length"", ""value"": ""Sleeveles...",0.944312
9,130,Women's Cutout Cropped Top,Tops,Blouses,222.68,Cotton,"[{""name"": ""Sleeve Length"", ""value"": ""Long Slee...",0.946629


We can see that while the price was adhered to, we've completely missed the mark on the other aspects. 

1. We've got a mix of different materials - even though most of cotton, we do have some Polyester and Spandex in the mix
2. We've got a mix of different product types - we've got a t-shirt, a dress and a skirt when what we wanted was a T-Shirt
3. We've also got a mix of different attributes - we wanted short sleeved shirts but we've got some long sleeved shirts in the mix as well

As a result, our original item didn't appear in the top 25 results. Let's now compute the recall and MRR for all of our given queries and see how we perform as a whole.

### Computing Our Baseline

We'll use pandas to store the results of our computations and then dump the results into a simple .csv file for future reference. To run our code in Parallel, we'll use a ThreadPoolExecutor to run these async calls in parallel.

In [23]:
from lancedb.table import Table

def retrieve_items(table:Table,item:dict, metrics:dict):
    retrieved_item_ids = [item['id'] for item in table.search(item['query']).limit(25).to_list()]
    desired_item = [item['id']]
    return {metric: score_fn(retrieved_item_ids, desired_item) for metric, score_fn in metrics.items()}


db = connect("./lancedb")
table = db.open_table("items")
retrieve_items(table,queries[2],metrics)

{'mrr@5': 1.0,
 'mrr@10': 1.0,
 'mrr@15': 1.0,
 'mrr@20': 1.0,
 'mrr@25': 1.0,
 'recall@5': 1.0,
 'recall@10': 1.0,
 'recall@15': 1.0,
 'recall@20': 1.0,
 'recall@25': 1.0}

In [24]:
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

# Create a ThreadPoolExecutor to run queries in parallel
with ThreadPoolExecutor() as executor:
    # Map the retrieve_items function across all queries
    results = list(executor.map(
        lambda q: retrieve_items(table, q, metrics),
        queries
    ))

# Convert results to DataFrame for analysis
results_df = pd.DataFrame(results)
results_df.mean()

mrr@5        0.471930
mrr@10       0.489756
mrr@15       0.491635
mrr@20       0.494731
mrr@25       0.494731
recall@5     0.631579
recall@10    0.763158
recall@15    0.789474
recall@20    0.842105
recall@25    0.842105
dtype: float64

This is a good baseline for us to start with. In the next section, we'll see how we can use metadata filtering to improve these results