# Notebook 3 : Metadata-Enhanced Retrieval

> **Note** : This notebook is a preview of what we cover in [improvingrag.com](https://improvingrag.com). Stop settling for "Looks Good to Me" and start building better systems. Learn how to turn RAG from a risky experiment into a structured, data-driven practice. Check out [improvingrag.com](https://improvingrag.com) for a proven foundational framework to help you go beyond the basics to improve performance, quality, and user experience. 

In this notebook, we'll explore how to incorporate LLM-generated metadata into retrieval systems to enhance precision and relevance. While semantic search helps find related content, structured metadata like product categories, types, and attributes provides crucial guardrails that align results with user intent. By combining both approaches, we can build more precise and reliable search experiences.

## Why this matters

Even the best retrieval systems can benefit from context. When users search for products, they often have specific requirements beyond semantic relevance - like price ranges, materials, or occasions. While semantic search helps find relevant items, structured metadata ensures results match specific criteria. Manual metadata tagging is expensive and inconsistent, but language models offer a fast, scalable way to generate this metadata automatically.

## What you'll learn

In this notebook, you will learn to:

1. Implement Metadata Filtering
   - Map user queries to structured metadata using a predefined taxonomy.
   - Validate and apply filters to refine the set of candidate items for retrieval.

2. Quantify Retrieval Improvements
   - Evaluate the impact of metadata filtering on performance using key metrics (Recall and MRR).
   - Compare baseline results with enhanced outcomes to understand measurable gains.

3. Enhance Item Descriptions with Metadata
   - Integrate rich metadata into item descriptions for better semantic matching.
   - Leverage enriched data to drive more accurate and relevant retrieval results.

By the end of this notebook, you'll have a clear, data-driven approach to elevating your retrieval system—transforming it from a simple search mechanism into a finely tuned, precision tool that consistently delivers improved performance.

## Using Metadata Fields for filtering

In this section, we'll see how we can leverage query understanding to map user queries into metadata filters that we can apply on our retrieved items. To make things easier, we'll be using the `instructor` library which provides structured outputs from LLM responses. With its in-built jinja support, we can use the same values for validating the generated filters and formatting our prompt itself. This makes it easy to iterate on our prompt and build in complex validation.

It's worth noting that metadata filtering is just one piece of building robust retrieval systems. While we focus on product search here, similar principles apply whether you're searching documentation, routing support tickets, or selecting tools - you need structured ways to narrow results based on user intent.


### Defining Our Metadata Taxonomy

In order for us to map our user queries to a set of known fields, we'll be using a taxonomy that we've defined at `taxonomy.yml`.  This taxonomy lists categories (like ‘Tops’ and ‘Bottoms’), product types, and key attributes (such as ‘Sleeve Length’). This structure helps us filter and refine search results to better match what the user is looking for.

For instance, a T-shirt with a crew neck might be classified as

- Category : Top
- Product Type : T-Shirt
- Attributes : ['crew neck', 'short sleeve']

By mapping a user query to a set of these known fields that our items have been annotated with, we can then use the known fields to filter the retrieved items on. By doing so, we can narrow down the scope of the search and improve the precision of our retrieval system.

We'll start by loading in the taxonomy data and printing out the keys to see what we have available to us.

In [1]:
from helpers.taxonomy import process_taxonomy_file

taxonomy_map = process_taxonomy_file("./data/taxonomy.yml")
taxonomy_map.keys()

dict_keys(['Tops', 'Bottoms', 'Dresses', 'Outerwear'])

We'll then print out the `Tops` category to see what we have available to us.

In [3]:
from rich import print

print(taxonomy_map["Tops"])

Now that we've understood what our taxonomy looks like, let's take a look at how we can define this taxonomy as a response model.

### Defining Our Response Model

We can see that for each individual category - `Tops`, `Bottoms`, `Dresses`, `Skirts` etc, we have a set of metadata fields that we can use to filter our items.

Defining these in a `.yaml` file is a flexible way to leverage the expertise of domain experts to define these metdata fields. We can then read these fields and values in at run-time and them to generate metadata filters that we can then apply on our retrieval system.

In the code below, we've defined a `QueryFilters` model that we'll use to generate metadata filters from a user query, ensuring that we conform to the taxonomy we've defined with the aid of a `field_validator` that we'll use to check the extracted metadata filters. 

Notice here how we're using the `info:ValidationInfo` context to pass in the taxonomy data to the LLM when we're generating the metadata filters.

In [4]:
from typing import Optional
from pydantic import BaseModel, model_validator, ValidationInfo


class Attribute(BaseModel):
    name: str
    values: list[str]


class QueryFilters(BaseModel):
    attributes: list[Attribute]
    min_price: Optional[float] = None
    max_price: Optional[float] = None
    category: str
    product_type: list[str]

    @model_validator(mode="after")
    def validate_attributes(self, info: ValidationInfo):
        taxonomy_data = info.context["taxonomy_data"]
        # Validate category exists in taxonomy
        if self.category not in taxonomy_data:
            raise ValueError(
                f"Invalid category: {self.category}. Valid categories are {taxonomy_data.keys()}"
            )

        # Validate product types
        valid_types = taxonomy_data[self.category]["product_type"]
        for product_type in self.product_type:
            if product_type not in valid_types:
                raise ValueError(
                    f"Invalid product type: {product_type}. Valid product types are {valid_types}"
                )

        # Validate attribute exists in taxonomy
        valid_attrs = taxonomy_data[self.category]["attributes"]
        for attr in self.attributes:
            if attr.name not in valid_attrs:
                raise ValueError(f"Invalid attribute name: {attr.name}")
            for value in attr.values:
                if value not in valid_attrs[attr.name]:
                    raise ValueError(
                        f"Invalid value {value} for attribute {attr.name}. Valid values are {valid_attrs[attr.name]}"
                    )

        return self

With our response model and taxonomy in place, we can now tackle the core challenge - translating natural language queries into structured filters. Let's see this in action in the next section.

### Query Understanding

Now that we've defined our response model, we can use it to generate a metadata filter for our user queries. 

We'll see an example below where we're using the `context` parameter to pass in the taxonomy data as a variable called `taxonomy_data`. We can then reference this in both our prompt and our field_validator above with the help of `instructor` that ships with inbuilt support for jinja templating.


In [5]:
from openai import AsyncOpenAI
import instructor
from helpers.taxonomy import process_taxonomy_file
from rich import print

# Import in our taxonomy data
taxonomy_data = process_taxonomy_file("./data/taxonomy.yml")


async def extract_query_filters(
    client: instructor.AsyncInstructor, query: str
) -> QueryFilters:
    """
    Extract structured filters from a natural language query using LLM.

    Args:
        query (str): Natural language query from user

    Returns:
        QueryFilters: Structured filters extracted from the query
    """
    return await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """
You are a helpful assistant that extracts user requirements from a query.

Use this following taxonomy as a reference for what fields are available to you. Only use values from the provided taxonomy.

<taxonomy>
{{ taxonomy_data }}
</taxonomy>
            """,
            },
            {"role": "user", "content": query},
        ],
        context={
            "taxonomy_data": taxonomy_data,
        },
        response_model=QueryFilters,
    )


client = instructor.from_openai(AsyncOpenAI())
resp = await extract_query_filters(client, "Need a new skirt around 40 bucks")
print(resp)

Let's see how this work with another query here that uses specific attributes - in this case a short sleeved top.

In [6]:
query = "I'm looking for a t-shirt suitable for summer that's got a short sleeve"
resp = await extract_query_filters(client, query)
print(resp)

Since these attributes are specific to the product type itself, we'll need to do a mix of post-filtering and pre-filtering to get the best results. As a result, even though we're evaluating the performance of our retrieval system up to a maximum value of 25, we'll fetch 100 items for each query and then filter them down.

This helps us cast a wide enough net to ensure that we're able to retrieve the most relevant items for the user's query. 

Let's now see how we can combine pre-filtering (using metadata fields) with post-filtering (refining attributes) to ensure that our retrieval system returns the most relevant items.

In [8]:
import json
from lancedb import connect
import pandas as pd


def retrieve_and_filter(query: str, table, filters: QueryFilters, max_k=100):
    query_parts = []

    # We do a prefilter on category,price and material since these will always be provided
    query_parts.append(f"category='{filters.category}'")

    if filters.min_price:
        query_parts.append(f"price >= {filters.min_price}")
    if filters.max_price:
        query_parts.append(f"price <= {filters.max_price}")

    query_string = " AND ".join(query_parts)
    items = (
        table.search(query=query)
        .where(query_string, prefilter=True)
        .limit(max_k)
        .to_list()
    )

    items = [
        {
            **item,
            "attributes": json.loads(item["attributes"]),
        }
        for item in items
    ]

    if filters.product_type:
        items = [item for item in items if item["product_type"] in filters.product_type]

    if filters.attributes:
        for attr in filters.attributes:
            if not attr.values:
                continue
            curr_items = []
            for item in items:
                attr_name = attr.name
                attr_values = attr.values
                item_attr_values = item["attributes"]
                for item_attr in item_attr_values:
                    if (
                        item_attr["name"] == attr_name
                        and item_attr["value"] in attr_values
                    ):
                        curr_items.append(item)
                        break

            items = curr_items

    return items

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
# Define test query and setup clients
test_query = "I want a t-shirt that is at most 70 bucks"
client = instructor.from_openai(AsyncOpenAI())

# Connect to database
db = connect("./lancedb")
table = db.open_table("items")

# Generate filters and retrieve results
generated_filter = await extract_query_filters(client, test_query)
results = retrieve_and_filter(test_query, table, generated_filter)

# Display results as dataframe
pd.DataFrame(results).loc[
    :,
    [
        "title",
        "description",
        "category",
        "product_type",
        "price",
        "attributes",
        "in_stock",
    ],
]


Unnamed: 0,title,description,category,product_type,price,attributes,in_stock
0,Graphic Print T-Shirt,Embrace casual comfort with the classic Tommy ...,Tops,T-Shirts,27.37,"[{'name': 'Sleeve Length', 'value': 'Short Sle...",True
1,Women's Short Sleeve Logo T-Shirt,This casual short sleeve t-shirt features the ...,Tops,T-Shirts,38.77,"[{'name': 'Sleeve Length', 'value': 'Short Sle...",False
2,Classic Women's Crew Neck T-Shirt,This timeless crew neck t-shirt offers a relax...,Tops,T-Shirts,14.6,"[{'name': 'Sleeve Length', 'value': 'Short Sle...",True
3,Classic White Crew Neck Tee,Elevate your everyday wardrobe with this class...,Tops,T-Shirts,18.12,"[{'name': 'Sleeve Length', 'value': 'Short Sle...",True
4,Heart Graphic Maternity T-Shirt,Embrace your pregnancy with this chic maternit...,Tops,T-Shirts,60.99,"[{'name': 'Sleeve Length', 'value': 'Short Sle...",False
5,Navy Short Sleeve T-Shirt,Experience the perfect blend of comfort and st...,Tops,T-Shirts,63.09,"[{'name': 'Sleeve Length', 'value': 'Short Sle...",False
6,Ribbed Short Sleeve Top,Elevate your casual wardrobe with this stylish...,Tops,T-Shirts,56.94,"[{'name': 'Sleeve Length', 'value': 'Short Sle...",True
7,Long Sleeve Scoop Neck Top,Upgrade your basics with this chic long sleeve...,Tops,T-Shirts,15.64,"[{'name': 'Sleeve Length', 'value': 'Long Slee...",True


Now that we've seen how we can integrate metadata filtering with our existing retrieval system, let's benchmark its performance against our baseline semantic search system.

## Evaluating Retrieval Performance

Now taht we've seen how we can integrate these metadata filters into our retrieval system , let's quantify their impact on the performance of our retrieval system. We'll do so by leveraging the same two metrics that we used in the previous notebook - recall and MRR before seeing how we might be able to improve our query understanding above.

We'll do so in 3 parts

1. First, we'll compute baseline MRR and Recall metrics for our current metdata filter prompt. 
2. Then we'll find specific edge cases where the filters are not working as expected and then iterate on our prompt to improve the performance of our retrieval system.
3. Finally, we'll see how we'll be able to use metadata filtering to improve the performance of our retrieval system.

### Computing a Baseline

Let's start by computing the baseline performance of our current retrieval system. We'll use the same queries that we generated previously. We'll use a ThreadPoolExecutor to run our retrieval in parallel since we're using the synchronous LanceDB client.

We'll then identify low-performing queries and iterate on our prompt to improve the performance of our retrieval system.

In [10]:
import instructor
from tqdm.asyncio import tqdm_asyncio
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import json

# Load in queries that we generated previously
with open("./data/queries.json", "r") as f:
    queries = [json.loads(line) for line in f]


# Then run retrieval in parallel
def retrieve_item(args):
    query, generated_filter = args
    results = retrieve_and_filter(query["query"], table, generated_filter)
    return {
        "query": query["query"],
        "retrieved_items": results,
        "expected_items": [query["id"]],
        "filters": generated_filter,
    }


# First generate all filters
client = instructor.from_openai(AsyncOpenAI())
filters = await tqdm_asyncio.gather(
    *[
        extract_query_filters(client, query["query"])
        for query in tqdm(queries, desc="Generating filters")
    ]
)


with ThreadPoolExecutor() as executor:
    results = list(
        tqdm(
            executor.map(retrieve_item, zip(queries, filters)),
            total=len(queries),
            desc="Retrieving results",
        )
    )


Generating filters: 100%|██████████| 38/38 [00:00<00:00, 926648.56it/s]
100%|██████████| 38/38 [00:18<00:00,  2.06it/s]
Retrieving results: 100%|██████████| 38/38 [00:00<00:00, 46.58it/s]


Now that we've done so, we can then compute our baseline metrics for our initial prompt using mrr and recall.

In [14]:
from helpers.metrics import get_metrics_at_k

retrieved_items = [
    [item["id"] for item in result["retrieved_items"]] for result in results
]
labels = [item["expected_items"] for item in results]

metrics = get_metrics_at_k(metrics=["mrr", "recall"], sizes=[5, 10, 15, 20, 25])
computed_metrics = [
    {
        metric: score_fn(retrieved_item_ids, desired_item)
        for metric, score_fn in metrics.items()
    }
    for retrieved_item_ids, desired_item in zip(retrieved_items, labels)
]
df = pd.DataFrame(computed_metrics)
df.mean().round(2)

mrr@5        0.62
mrr@10       0.63
mrr@15       0.63
mrr@20       0.63
mrr@25       0.63
recall@5     0.66
recall@10    0.71
recall@15    0.74
recall@20    0.74
recall@25    0.74
dtype: float64

By computing these metrics, we can see that by using Metadata Filters, we've been able to improve the MRR of our retrieval system by a significant margin. However, we've also seen ar oughly 10% decrease in the recall, especially at higher k values.

| Metric | Semantic Search | Semantic Search + Metadata Filters |
|--------|----------------|-----------------------------------|
| MRR@5 | 0.47 | 0.62 (+31.91%) |
| MRR@10 | 0.49 | 0.63 (+28.57%) |
| MRR@15 | 0.49 | 0.63 (+28.57%) |
| MRR@20 | 0.49 | 0.63 (+28.57%) |
| MRR@25 | 0.49 | 0.63 (+28.57%) |
| Recall@5 | 0.63 | 0.66 (+4.76%) |
| Recall@10 | 0.76 | 0.71 (-6.58%) |
| Recall@15 | 0.79 | 0.74 (-6.33%) |
| Recall@20 | 0.84 | 0.74 (-11.90%) |
| Recall@25 | 0.84 | 0.74 (-11.90%) |

Even though we're seeing gains of ~28% in MRR, we're also seeing a drop in recall at higher k values. This highlights a broader pattern in retrieval systems: improvements often come with trade-offs that need to be measured and weighed carefully. Rather than assuming more complexity always helps, we need clear metrics to guide our investments.

In the next section, we'll identify some edge cases where the filters are not working as expected and then iterate on our prompt to improve the performance of our retrieval system.

### Iterating On Our Prompt

Since we've saved the filters that we've generated previously, we can use them to iterate on our prompt to improve the performance of our retrieval system. Let's find an example where the filters are not working as expected and then iterate on our prompt to improve the performance of our retrieval system.





In [15]:
df.head().round(2)

Unnamed: 0,mrr@5,mrr@10,mrr@15,mrr@20,mrr@25,recall@5,recall@10,recall@15,recall@20,recall@25
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.11,0.11,0.11,0.11,0.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can see that the first and 5th item seem to have relatively low recall values. Let's take a look at the queries for these items.

In [16]:
results[0]["query"], results[0]["filters"], results[0]["expected_items"]

('Searching for a trendy sleeveless top with lace details for casual outings and dinner dates. Needs to be cotton and comfortable, ideally under $200.',
 QueryFilters(attributes=[Attribute(name='Sleeve Length', values=['Sleeveless']), Attribute(name='Fit', values=['Regular', 'Slim', 'Oversized'])], min_price=None, max_price=200.0, category='Tops', product_type=['T-Shirts', 'Blouses', 'Tank Tops']),
 [1])

Let's see the metadata for this specific item.

In [17]:
row = table.to_pandas()
item = row[row["id"] == 1].to_dict("records")[0]
print(item)

We can see here that we've correctly identified the category of the item as a tank-top but we've mistakenly added an additional attribute of `fit` to the item here which has resulted in it being filtered out. Let's try modifying our prompt to ensure that we don't add any additional attributes to the item.

In [18]:
import instructor


async def improved_extract_query_filters(
    client: instructor.AsyncInstructor, query: str
) -> QueryFilters:
    return await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """
You are a helpful assistant that extracts user requirements from a query.

Use this following taxonomy as a reference for what fields are available to you. Only use values from the provided taxonomy.

<taxonomy>
{{ taxonomy_data }}
</taxonomy>

Guidelines:

- If a specific filter isn't needed, just return an empty list or null value for that
- If the attribute exists on multiple types, make sure that you only look at the specific types listed under the category you have chosen
- Make sure that you've chosen from the right attribute values for each attribute type. This is very important.

Here are some general rules about how to generate these filters
1. Dresses and Skirts should always go together
2. Potential Filters should only be added if the user has explicitly mentioned it. When selecting filter values, aim to make them more flexible. For instance, if the user is asking for a well fitting top, we can consider both regular and relaxed fit. If another attribute value might be a good match for the user's query, include it too. 
3. If the user hasn't mentioned the attribute for the product type in his query, don't include that attribute in the filter. For instance if the user only mentions that they want something that's comfortable, don't include a filter for the fit of the product.
4. If the user mentions a rough range ( eg. around 50 bucks), let's just use a buffer of 30 bucks on each side ( Eg. 20-80)
5. If the user mentions a vague price (Eg. I have a high budget), just set max price to 1000
6. Make sure to look carefully at the user's query to determine if they've specified a specific fit - eg. regular, relaxed, cropped. ( Relaxed and Relaxed should always go together)

""",
            },
            {"role": "user", "content": query},
        ],
        context={
            "taxonomy_data": taxonomy_data,
        },
        response_model=QueryFilters,
    )


In [24]:
from rich import print

query = "Searching for a trendy sleeveless top with lace details for casual outings and dinner dates. Needs to be cotton and comfortable, ideally under $200."
print(await improved_extract_query_filters(client, query))

We can see that with this new prompt, we're able to generate a more accurate filter. More specifically, we're no longer applying the `fit` filter without the user explicitly specifying it. 

Let's now take this new and improved prompt and see how it performs on our dataset.

In [25]:
import instructor
from tqdm.asyncio import tqdm_asyncio
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import json

# Load in queries that we generated previously
with open("./data/queries.json", "r") as f:
    queries = [json.loads(line) for line in f]


# Then run retrieval in parallel
def retrieve_item(args):
    query, generated_filter = args
    results = retrieve_and_filter(query["query"], table, generated_filter)
    return {
        "query": query["query"],
        "retrieved_items": results,
        "expected_items": [query["id"]],
        "filters": generated_filter,
    }


# First generate all filters
client = instructor.from_openai(AsyncOpenAI())
filters = await tqdm_asyncio.gather(
    *[
        improved_extract_query_filters(client, query["query"])
        for query in tqdm(queries, desc="Generating filters")
    ]
)


with ThreadPoolExecutor() as executor:
    results = list(
        tqdm(
            executor.map(retrieve_item, zip(queries, filters)),
            total=len(queries),
            desc="Retrieving results",
        )
    )


Generating filters: 100%|██████████| 38/38 [00:00<00:00, 777480.74it/s]
100%|██████████| 38/38 [00:06<00:00,  5.78it/s]
Retrieving results: 100%|██████████| 38/38 [00:05<00:00,  6.76it/s]


In [21]:
from helpers.metrics import get_metrics_at_k

retrieved_items = [
    [item["id"] for item in result["retrieved_items"]] for result in results
]
labels = [item["expected_items"] for item in results]

metrics = get_metrics_at_k(metrics=["mrr", "recall"], sizes=[5, 10, 15, 20, 25])
computed_metrics = [
    {
        metric: score_fn(retrieved_item_ids, desired_item)
        for metric, score_fn in metrics.items()
    }
    for retrieved_item_ids, desired_item in zip(retrieved_items, labels)
]
df = pd.DataFrame(computed_metrics)
df.mean()

mrr@5        0.728070
mrr@10       0.739140
mrr@15       0.739140
mrr@20       0.739140
mrr@25       0.739140
recall@5     0.789474
recall@10    0.868421
recall@15    0.868421
recall@20    0.868421
recall@25    0.868421
dtype: float64

With this new and improved prompt, we've been able to improve the MRR of our retrieval system by a significant margin.

| Metric     | Semantic Search | Semantic Search + Metadata Filters | Semantic Search + Improved Metadata Filters |
|------------|----------------|----------------------|-------------------------------|
| mrr@5      | 0.47          | 0.62 (+31.91%)      | 0.73 (+55.32%)              |
| mrr@10     | 0.49          | 0.63 (+28.57%)      | 0.74 (+51.02%)              |
| mrr@15     | 0.49          | 0.63 (+28.57%)      | 0.74 (+51.02%)              |
| mrr@20     | 0.49          | 0.63 (+28.57%)      | 0.74 (+51.02%)              |
| mrr@25     | 0.49          | 0.63 (+28.57%)      | 0.74 (+51.02%)              |
| recall@5   | 0.63          | 0.66 (+4.76%)       | 0.79 (+25.40%)              |
| recall@10  | 0.76          | 0.71 (-6.58%)       | 0.87 (+14.47%)              |
| recall@15  | 0.79          | 0.74 (-6.33%)       | 0.87 (+10.05%)              |
| recall@20  | 0.84          | 0.74 (-11.90%)      | 0.87 (+3.57%)               |
| recall@25  | 0.84          | 0.74 (-11.90%)      | 0.87 (+3.57%)               |

By having clear objective metrics for our retrieval system, we were able to identify specific issues with our retrieval pipeline and iterate on our prompt to tackle those specific issues. We could then quantify the improvement that each of these individual changes had in terms of Recal and MRR.

This is a huge step forward, and really highlights the importance of having a system in place to be able to understand the impact of each component of our system. As our pipelines grow in complexity, we need to know clearly the latency and performance impact of each component. 

In our case here, if the latency hit of the metadata filters were too high, we could have just used semantic search alone if we wre ok with the drop in recall. We could use a larger value of `k` for instance, using k=25 which almost matches the k=5 of metadata filters and semantic search.

## Using LLMs to generate Metadata

In many real-world cases, beyond basic fields like date or author, there may not be detailed tags or summaries available. This is where large language models (LLMs) can step in to generate metadata and enrich your document representations. This is particularly useful when expanding our taxonomy to new fields such as occasions without extensive manual labeling

While human annotators can be used to generate metadata for our items, they are often expensive and time consuming to train. Additionally, ensuring consistecy across annotators can be a challenge. We can combine the best of both worlds by using LLMs to generate metadata for our items initially, and then using human annotators to validate the labels. Over time as we gain more confidence in the LLM's ability to generate accurate metadata, we can use it to generate metadata for a larger portion of our items.

### Defining Our Taxonomy

By using a visual language model like GPT-4o, we can automatically generate metadata—such as identifying the occasions an outfit is suitable for. This extra layer of information can improve search relevance and help users find items that match their needs, even when manual tagging is limited

We'll start with a limited set of occasions - Weddings, Casual Night Outs, Date Nights and Work etc but this could be scaled out to a  much larger set of potential other types of metadata fields.

In [26]:
occasions = [
    "Weddings",
    "Casual Outing",
    "Date Night",
    "Work",
    "Party",
    "Brunch",
    "Travel",
    "Interview",
    "Cocktail Event",
]

### Vision Language Models

Let's first take a look at the two outfits that we're looking to add to our collection. Notice how each image corresponds to a different outfit style. Since a Vision Language Model is able to understand the visual content of the image, it's able to generate a label for the occasion(s) that each of these outfits are suitable for.


In [32]:
from IPython.display import Image, display
from IPython.display import HTML

# Display the two images side by side using HTML
display(
    HTML("""
    <div style="display: flex; justify-content: center; gap: 100px;">
        <div>
            <p>Image 1:</p>
            <img src="assets/img1.jpg" width="300" height="400"/>
        </div>
        <div>
            <p>Image 2:</p>
            <img src="assets/img2.jpg" width="300" height="400"/>
        </div>
    </div>
""")
)


We can see that the two outfits above are suitable for two very different occasions. 

One is definitely much more casual than the other. Let's see how we can use a visual language model like GPT-4o to help us generate metadata for our items.

In [37]:
from pydantic import BaseModel
from typing import Literal


class Occasions(BaseModel):
    occasions: list[
        Literal[
            "Weddings",
            "Casual Outing",
            "Date Night",
            "Work",
            "Party",
            "Brunch",
            "Travel",
            "Interview",
            "Cocktail Event",
        ]
    ]

In [38]:
from openai import OpenAI

client = instructor.from_openai(OpenAI())

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that generates metadata for items. Identify what occasions the following outfit could be worn for",
        },
        {
            "role": "user",
            "content": [
                instructor.Image.from_path("assets/img1.jpg"),
            ],
        },
    ],
    response_model=Occasions,
)

print(resp)

In [39]:
from openai import OpenAI

client = instructor.from_openai(OpenAI())

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that generates metadata for items. Identify what occasions the following outfit could be worn for",
        },
        {
            "role": "user",
            "content": [
                instructor.Image.from_path("assets/img2.jpg"),
            ],
        },
    ],
    response_model=Occasions,
)

print(resp)

This approach not only scales the metadata annotation process but also opens the door to richer embeddings and more nuanced search filters

## Conclusion

In our first notebook, we talked briefly about why a systematic approach to building out RAG systems was important and some of the limitations of naive semantic search. We then expanded on this in the second notebook where we looked at key metrics that we could use to evaluate and improve our retrieval system. In this specific notebook itself, we then expanded on that by exploring the use of LLM-generated metadata to enhance retrieval quanity by aligning results with implicit user filters, quantifying this improvement using the same metrics against the baseline we computed for semantic search.

This underscores the importance of a systematic approach to building out RAG systems. It's not enough to just throw a bunch of data at an embedding model and hope for the best. Instead, we need to be able to make decisions about what features to invest in based on objective metrics and data. 

Rather than relying on intuition, we need systematic ways to measure impact and guide investments. This notebook showed one piece of the puzzle - check out [improvingrag.com](https://improvingrag.com) for a comprehensive guide to building better RAG systems. We'll cover everything from smart query routing to UX design for feedback collection to leveraging contrastive learning for improved retrieval.

Stop settling for "Looks Good to Me" and start building better systems today.