# Notebook 1: Systematically Improving Your RAG Application

## What is RAG?

Language models often have a fixed knowledge cut off date and do not have access to any private information on your specific domain. As a result, to get around this issue, we add relevant information to the prompt so that the model can use it to answer the question.

This is known as RAG. Let's see an example below.

In [1]:
import openai
import instructor
from pydantic import BaseModel
from rich import print


class Response(BaseModel):
    response: str


client = instructor.from_openai(openai.OpenAI())

print(
    client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": "What's the price of AX123"},
        ],
        response_model=Response,
    )
)

Obviously this doesn't work because our model doesn't know anything about this specific item we're asking it about. In order to fix this, we can inject some additional context into the model. 

In [2]:
print(
    client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant. Here is the price list:\nAX123: $49.99\nBX456: $29.99\nCX789: $79.99\nDX012: $39.99",
            },
            {"role": "user", "content": "What's the price of AX123"},
        ],
        response_model=Response,
    )
)

Without the right context, the model will not be able to answer the question correctly. Therefore, before we start evaluating the faithfulness and quality of the responses, starting with retrieval allows us to ensure we've provided the model with the right context.

## Starting with Retrieval

![](./assets/rag.png)

RAG more often than not resembles a recomendation system - with a multi-step ranking process to identify the relevant documents that are most suited for the user's query.

Most people build a RAG system and assume that semantic search is all that they need. However, this is not the case. Let's see a few examples of how semantic search fails to retrieve the relevant documents. 

We'll be using the `ivanleomk/ai-engineer-summit-ecommerce-taxonomy` dataset that I've cleaned ahead of time which contains a list of items and their descriptions.

In [7]:
from datasets import load_dataset

dataset = load_dataset("ivanleomk/ai-engineer-summit-ecommerce-taxonomy")['train']

In [9]:
from lancedb.pydantic import LanceModel, Vector
import lancedb
from lancedb.embeddings import get_registry
import random

# Create Embedding Function
func = get_registry().get("openai").create(name="text-embedding-3-small")


# Define a Model that will be used as the schema for our collection
class Item(LanceModel):
    id: int
    title: str
    description: str = func.SourceField()
    brand: str
    category: str
    product_type: str
    attributes: str
    material: str
    pattern: str
    price: float
    vector: Vector(func.ndims()) = func.VectorField()
    in_stock: bool


db = lancedb.connect("./lancedb")
table_name = "items"

if table_name not in db.table_names():
    table = db.create_table(table_name, schema=Item, mode="overwrite")
    entries = []
    for row in dataset:
        entries.append(
            {
                "id": row["id"],
                "title": row["title"],
                "description": row["description"],
                "brand": row["brand"],
                "category": row["category"],
                "product_type": row["product_type"],
                "attributes": row["attributes"],
                "material": row["material"],
                "pattern": row["pattern"],
                "price": row["price"],
                "in_stock": row["in_stock"]
            }
        )

    table.add(entries)

table = db.open_table(table_name)
print(f"{table.count_rows()} rows in the table")

[90m[[0m2025-02-12T18:07:22Z [33mWARN [0m lance::dataset::write::insert[90m][0m No existing dataset at /Users/ivanleo/Documents/coding/ai-engineer-summit/lancedb/items.lance, it will be created


191 rows in the table


## Where Semantic Search Fails

Because semantic search relies on the cosine similarity between the query and the documents, it sometimes fails to understand specific nuances of the queries. As a result, there are a few categories of queries that we expect semantic search to fail for.

1. Queries for compliment items - Eg. `I'd like a shirt that goes well with my blue polyster skirt`
2. Queries for items that have specific attributes - Eg. `I'm looking for a blue t-shirt that's under $50`
3. Queries for items that have specific constraints - Eg. `I'm looking for a skirts that's in size S that I can wear to a party tonight`

Let's see how semantic search performs for these queries and how we end up retrieving the wrong documents.

### Compliment Items

Let's try a query of `I have a blue t-shirt at the moment and I'd love a pair of jeans to go with it`.

We can see that for compliment items, even though we're asking for a pair of jeans, we end up retrieving mostly T-Shirts.

In [11]:
import pandas as pd

items = [
    {
        "title": item["title"],
        "description": item["description"],
        "category": item["category"],
        "material": item["material"],
        "price": item["price"],
        "product_type": item["product_type"],
        "in_stock": item["in_stock"],
    }
    for item in table.search(query="I have a blue t-shirt at the moment and I'd love a pair of jeans to go with it").limit(10).to_list()
]

pd.DataFrame(items)

Unnamed: 0,title,description,category,material,price,product_type,in_stock
0,High-Waist Blue Jeans,These high-waist blue jeans are a staple for a...,Bottoms,Denim,231.27,Jeans,False
1,High-Waisted Blue Jeans,These classic high-waisted blue jeans offer a ...,Bottoms,Denim,362.77,Jeans,False
2,Thunderstorm Print Jeans,Make a bold statement with these black jeans b...,Bottoms,Denim,397.68,Jeans,False
3,Denim High-Waisted Shorts,"Versatile and comfortable, these denim high-wa...",Bottoms,Denim,390.63,Shorts,False
4,Floral V-Neck T-Shirt,This stylish navy V-neck t-shirt features a bo...,Tops,Cotton,319.97,T-Shirts,False
5,Graphic Logo T-Shirt,This Levi's graphic logo t-shirt features a bo...,Tops,Cotton,354.38,T-Shirts,False
6,Striped Crew Neck T-Shirt,This classic striped tee features a timeless c...,Tops,Cotton,139.2,T-Shirts,True
7,Women's Classic Grey V-Neck T-Shirt,This versatile grey V-neck T-shirt offers both...,Tops,Cotton,344.87,T-Shirts,True
8,Printed Yellow T-Shirt,Brighten your wardrobe with this vibrant yello...,Tops,Cotton,90.49,T-Shirts,True
9,Basic Crew Neck T-Shirt,This classic yellow crew neck t-shirt offers a...,Tops,Cotton,204.79,T-Shirts,False


Most of the items that were retrieved are skirts, which is not what we're looking for. Remember the the query is asking for a shirt, but because the query starts with the word `skirt`, semantic search will retrieve items that are related to skirts.

### Specific Attributes

Now let's see how well semantic search performs for queries that have specific attributes.

In [12]:
import pandas as pd

items = [
    {
        "title": item["title"],
        "description": item["description"],
        "category": item["category"],
        "material": item["material"],
        "price": item["price"],
        "product_type": item["product_type"],
        "in_stock": item["in_stock"],
    }
    for item in table.search(query="I want a skirt that's under $150 which is made of Cotton").limit(10).to_list()
]

pd.DataFrame(items)

Unnamed: 0,title,description,category,material,price,product_type,in_stock
0,Women's White Sleeveless Top with Skirt,This elegant white sleeveless top pairs perfec...,Tops,Cotton,262.91,Tank Tops,False
1,White Eyelet Mini Skirt,"Featuring a delicate eyelet design, this white...",Bottoms,Cotton,102.31,Skirts,True
2,Floral Print Skirt,Flaunt your femininity with this charming flor...,Bottoms,Cotton,300.24,Skirts,False
3,Rust Midi Skirt,This chic rust-colored midi skirt offers a sop...,Bottoms,Cotton,338.49,Skirts,False
4,High-Waisted Pencil Skirt,This elegant high-waisted pencil skirt is desi...,Bottoms,Polyester,64.18,Skirts,True
5,Sleeveless Button-Down Blouse,This classic sleeveless button-down blouse is ...,Tops,Cotton,353.57,Blouses,False
6,Women's High-Waisted Midi Skirt,Make a statement with this chic high-waisted m...,Bottoms,Polyester,389.11,Skirts,False
7,Striped Midi Skirt,Add a touch of elegance to your ensemble with ...,Bottoms,Polyester,349.57,Skirts,True
8,Women's Classic Grey V-Neck T-Shirt,This versatile grey V-neck T-shirt offers both...,Tops,Cotton,344.87,T-Shirts,True
9,Black Denim Skirt,A versatile black denim skirt with front pocke...,Bottoms,Denim,289.67,Skirts,True


We can see here that the model is able to retrieve mostly skirts, but there are 4 of them which are not made of cotton. Additionally, for our query, although we specified that we want an item that's under $150, we end up retrieving only 2 items that are under $150, of which only one item is made of cotton.

### Avaliability Constraints

Now let's see how well semantic search does when we specify constraints on the availability of the item.

In [13]:
import pandas as pd

items = [
    {
        "title": item["title"],
        "description": item["description"],
        "category": item["category"],
        "material": item["material"],
        "price": item["price"],
        "in_stock": item["in_stock"],
    }
    for item in table.search(query="I want a skirt that's in stock now").limit(10).to_list()
]

pd.DataFrame(items)

Unnamed: 0,title,description,category,material,price,in_stock
0,Green Plaid Mini Skirt,Add a pop of pattern to your outfit with this ...,Bottoms,Polyester,191.17,True
1,Women's Pleated Midi Skirt,Add a pop of color to your outfit with this vi...,Bottoms,Polyester,318.64,False
2,Floral Print Skirt,Flaunt your femininity with this charming flor...,Bottoms,Cotton,300.24,False
3,Black Denim Skirt,A versatile black denim skirt with front pocke...,Bottoms,Denim,289.67,True
4,Rust Midi Skirt,This chic rust-colored midi skirt offers a sop...,Bottoms,Cotton,338.49,False
5,Women's High-Waisted Midi Skirt,Make a statement with this chic high-waisted m...,Bottoms,Polyester,389.11,False
6,Plaid Pencil Skirt,This plaid pencil skirt is a versatile additio...,Bottoms,Cotton,275.42,False
7,High-Waisted Pencil Skirt,This elegant high-waisted pencil skirt is desi...,Bottoms,Polyester,64.18,True
8,White Eyelet Mini Skirt,"Featuring a delicate eyelet design, this white...",Bottoms,Cotton,102.31,True
9,Women's White Sleeveless Top with Skirt,This elegant white sleeveless top pairs perfec...,Tops,Cotton,262.91,False


Similar to the previous examples, we can see that the model is able to retrieve the right category of items here but we don't have any way to filter out items that are not in stock. 

In fact, out of the 10 retrieved items, 50% of them are not in stock. 

## Implementing Filters

One of the ways that we can get around this issue is to implement metadata filters on the retrieved items after we've retrieved them. Let's see how we can do this.

We can do this with lanceDB that supports filtering on the items out of the box. Let's revisit our previous example and see how we can implement some filters on the retrieved items.

Let's revisit our previous example and see how we can implement these filters.

### Compliment Items

Let's revisit our previous example where we wanted a shirt that goes well with a blue polyester skirt. We can ensure we have the right items by applying a filter on the category of the item itself.

In [15]:
import pandas as pd

items = [
    {
        "title": item["title"],
        "description": item["description"],
        "category": item["category"],
        "material": item["material"],
        "product_type": item["product_type"],
        "price": item["price"],
    }
    for item in table.search(query="I have a blue t-shirt at the moment and I'd love a pair of jeans to go with it")
    .limit(10)
    .where("product_type='Jeans'", prefilter=True)
    .to_list()
]

pd.DataFrame(items)

Unnamed: 0,title,description,category,material,product_type,price
0,High-Waist Blue Jeans,These high-waist blue jeans are a staple for a...,Bottoms,Denim,Jeans,231.27
1,High-Waisted Blue Jeans,These classic high-waisted blue jeans offer a ...,Bottoms,Denim,Jeans,362.77
2,Thunderstorm Print Jeans,Make a bold statement with these black jeans b...,Bottoms,Denim,Jeans,397.68
3,Olive Green Skinny Jeans,"Crafted from stretchy cotton, these olive gree...",Bottoms,Cotton,Jeans,392.11
4,High-Waisted Skinny Jeans,These high-waisted skinny jeans are a wardrobe...,Bottoms,Denim,Jeans,31.08
5,High Rise Skinny Jeans,Elevate your denim collection with these class...,Bottoms,Denim,Jeans,382.0
6,High-Waisted Relaxed Jeans,These high-waisted jeans offer a relaxed fit a...,Bottoms,Denim,Jeans,288.76
7,High-Rise Skinny Jeans,These elegant pink high-rise skinny jeans offe...,Bottoms,Denim,Jeans,144.55
8,Mid Rise Skinny Jeans,These mid-rise skinny jeans offer a flattering...,Bottoms,Denim,Jeans,103.01
9,Mid Rise Skinny Jeans,Experience the perfect fit with these mid-rise...,Bottoms,Denim,Jeans,11.87


By applying the right filter on our retrieved items, we can ensure that we're only retrieving the right items. This is important because it ensures that we're providing the model with the relevant context that it needs to answer the question correctly.

## Specific Attributes

Now let's revisit the previous example where we wanted a skirt that's under $150 which is made of cotton. We can ensure we have the right items by applying a filter on the material of the item itself.

In [16]:
import pandas as pd

items = [
    {
        "title": item["title"],
        "description": item["description"],
        "category": item["category"],
        "material": item["material"],
        "product_type": item["product_type"],
        "price": item["price"],
    }
    for item in table.search(query="I want a skirt that's under $150 which is made of Cotton")
    .limit(10)
    .where("material='Cotton' AND price < 150 and product_type='Skirts'", prefilter=True)
    .to_list()
]

pd.DataFrame(items)

Unnamed: 0,title,description,category,material,product_type,price
0,White Eyelet Mini Skirt,"Featuring a delicate eyelet design, this white...",Bottoms,Cotton,Skirts,102.31


### Availability Constraints

Now let's see the last example where we wanted a skirt that's in stock now. We can ensure we have the right items by applying a filter on the in_stock attribute of the item itself.

In [17]:
import pandas as pd

items = [
    {
        "title": item["title"],
        "description": item["description"],
        "category": item["category"],
        "material": item["material"],
        "product_type": item["product_type"],
        "price": item["price"],
    }
    for item in table.search(query="I want a skirt that's in stock now")
    .limit(10)
    .where("in_stock=True AND product_type='Skirts'", prefilter=True)
    .to_list()
]

pd.DataFrame(items)

Unnamed: 0,title,description,category,material,product_type,price
0,Green Plaid Mini Skirt,Add a pop of pattern to your outfit with this ...,Bottoms,Polyester,Skirts,191.17
1,Black Denim Skirt,A versatile black denim skirt with front pocke...,Bottoms,Denim,Skirts,289.67
2,High-Waisted Pencil Skirt,This elegant high-waisted pencil skirt is desi...,Bottoms,Polyester,Skirts,64.18
3,White Eyelet Mini Skirt,"Featuring a delicate eyelet design, this white...",Bottoms,Cotton,Skirts,102.31
4,Striped Midi Skirt,Add a touch of elegance to your ensemble with ...,Bottoms,Polyester,Skirts,349.57


We can see that by applying the right filter on our retrieved items, we can ensure that we're only retrieving the right items. This is important because it ensures that we're providing the model with the relevant context that it needs to answer the question correctly.

By having metadata filters on hand, we can ensure that we're providing the model with the right context that it needs to answer the question correctly. 

This is very common in many RAG applications where we might have chunks that have 

1. Different Permissions - only specific users have access to certain documents
2. Different Document Types - if a user is asking for information about signed proposals, then we only want to retrieve chunks from proposals that have been signed for instance
3. Different Time Periods - if a user is asking for information about a specific time period, then we only want to retrieve chunks from that time period. For instance if we have a stock application, and the user asks about stock prices between February 1st and February 10th, then we only want to retrieve chunks from that time period. We can't be generating a response based on stock prices from last year.

In all of these cases, we can ensure that we're providing the model with the right context that it needs to answer the question correctly by applying the right filters on the retrieved items.



## RAG Assumptions

Now that we've seen how semantic search fails to retrieve the right items given some form of implicit queries that we might have, let's think a bit about what other assumptions we might be making when we're building our RAG applications.


1. Cosine Similarity is enough to retrieve the right items
2. We have all the data we need to answer any user questions
3. We've got retrieval nailed down

Most people build RAG applications without thinking about these assumptions and they end up building systems that don't work as expected. Let's see how we can systematically improve our RAG applications by evaluating the faithfulness and quality of the responses.

In the next notebook, we'll see how we can start using synthetic data to evaluate the quality of our RAG applications by generating our first set of synthetic questions and then walking through simple metrics we can use to evaluate the quality of the responses.