# Notebook 3 : Metadata-Enhanced Retrieval

> **Note** : This notebook is a preview of [Systematically Improving Your RAG Application](https://maven.com/applied-llms/rag-playbook) which shows you how to turns RAG from a risky experiment into a structured, data-driven practice. You'll learn how to pinpoint what's working, diagnose what's not, and steadily raise the bar on performance and user satisfaction. 
> 
> For a preview of the course, please check out [improvingrag.com](https://improvingrag.com) which provides a free preview of the course material.

In the previous notebook, we established a performance baseline for our retrieval system using synthetic queries and key metrics like Recall and Mean Reciprocal Rank (MRR). In this notebook, we build on that foundation by incorporating metadata filtering to further enhance retrieval precision and relevance.


## Why this matters

Even a robust retrieval system can benefit from additional context. By leveraging structured metadata—such as product categories, types, and attributes — we can:

1. Improve Relevance: Narrow down search results to closely match the user's intent.
2. Enhance Precision: Filter out extraneous items, ensuring that only the most pertinent information is returned.
3. Quantify Impact: Use measurable metrics to validate improvements and guide further refinements.

## What you'll learn

In this notebook, you will learn to:

1. Implement Metadata Filtering
   - Map user queries to structured metadata using a predefined taxonomy.
   - Validate and apply filters to refine the set of candidate items for retrieval.

2. Quantify Retrieval Improvements
   - Evaluate the impact of metadata filtering on performance using key metrics (Recall and MRR).
   - Compare baseline results with enhanced outcomes to understand measurable gains.

3. Enhance Item Descriptions with Metadata
   - Integrate rich metadata into item descriptions for better semantic matching.
   - Leverage enriched data to drive more accurate and relevant retrieval results.

By the end of this notebook, you'll have a clear, data-driven approach to elevating your retrieval system—transforming it from a simple search mechanism into a finely tuned, precision tool that consistently delivers improved performance.

## Using Metadata Fields for filtering

For us to use metadata fields for filtering, we need to be able to map a user query to a set of metadata fields that our database items have been annotated with. 

To make things easier, we'll be using the `instructor` library which provides structured outputs from LLM responses. It also makes things easy to use with in-built jinja support, allowing us to use the same values for validating the generated filters and formatting our prompt itself.

This is often known as a taxonomy and we've defined a `taxonomy.yml` ahead of time that defines the metdata fields which we've used to annotate our dataset that we previously ingested. 

Let's see what this taxonomy looks like

In [2]:
from helpers.taxonomy import process_taxonomy_file

taxonomy_map = process_taxonomy_file("./taxonomy.yml")

In [3]:
taxonomy_map.keys()

dict_keys(['Tops', 'Bottoms', 'Dresses', 'Outerwear'])

In [4]:
from rich import print

print(taxonomy_map["Tops"])

### Defining Our Response Model

We can see that for each individual category - `Tops`, `Bottoms`, `Dresses`, `Skirts` etc, we have a set of metadata fields that we can use to filter our items.

Defining these in a `.yaml` file is a flexible way to leverage the expertise of domain experts to define these metdata fields. We can then read these fields and values in at run-time and them to generate metadata filters that we can then apply on our retrieval system.

In the code below, we've defined a `QueryFilters` model that we'll use to generate metadata filters from a user query, ensuring that we conform to the taxonomy we've defined with the aid of a `field_validator` that we'll use to check the extracted metadata filters. 

Notice here how we're using the `info:ValidationInfo` context to pass in the taxonomy data to the LLM when we're generating the metadata filters.

In [20]:
from typing import Optional
from pydantic import BaseModel, model_validator, ValidationInfo


class Attribute(BaseModel):
    name: str
    values: list[str]


class QueryFilters(BaseModel):
    attributes: list[Attribute]
    min_price: Optional[float] = None
    max_price: Optional[float] = None
    category: str
    product_type: list[str]

    @model_validator(mode="after")
    def validate_attributes(self, info: ValidationInfo):
        taxonomy_data = info.context["taxonomy_data"]
        # Validate category exists in taxonomy
        if self.category not in taxonomy_data:
            raise ValueError(
                f"Invalid category: {self.category}. Valid categories are {taxonomy_data.keys()}"
            )

        # Validate product types
        valid_types = taxonomy_data[self.category]["product_type"]
        for product_type in self.product_type:
            if product_type not in valid_types:
                raise ValueError(
                    f"Invalid product type: {product_type}. Valid product types are {valid_types}"
                )

        # Validate attribute exists in taxonomy
        valid_attrs = taxonomy_data[self.category]["attributes"]
        for attr in self.attributes:
            if attr.name not in valid_attrs:
                raise ValueError(f"Invalid attribute name: {attr.name}")
            for value in attr.values:
                if value not in valid_attrs[attr.name]:
                    raise ValueError(
                        f"Invalid value {value} for attribute {attr.name}. Valid values are {valid_attrs[attr.name]}"
                    )

        return self

### Query Understanding

Now that we've defined our response model, we can use it to generate a metadata filter for our user queries. We'll see an example below.


In [60]:
from openai import OpenAI
import instructor
from helpers.taxonomy import process_taxonomy_file
from rich import print

# Wrap the OpenAI client in the relevant instructor method
client = instructor.from_openai(OpenAI())

# Import in our taxonomy data
taxonomy_data = process_taxonomy_file("taxonomy.yml")


query = "I want a Tank-Top that's got a short sleeve or sleeveless which is under 100 bucks for an interview"


resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": """
          You are a helpful assistant that extracts user requirements from a query.
                    
Use these references:
- Taxonomy: {{ taxonomy }}

Guidelines:
- If a filter isn't needed, return an empty list
- Only add attributes and filters that a user has mentioned explicitly
- Only use values from the provided taxonomy. 
- If the attribute exists on multiple types, make sure that you only look at the specific types listed under the category you have chosen
- If the user hasn't mentioned a specific product type, lets just use all of them
- if the user gives a range (Eg. around 50), just give a buffer of 20 on each side (Eg. 30-70)
- if the user gives a vague price (Eg. I have a high budget), just set max price to 1000
- only classify an item as unisex if the user has explicitly mentioned it and default to Women's categories by default.
- If you're looking at blouses, make sure to include tank tops along the way and vice versa
- if the user mentions user bottoms and doesn't specify a specific length - let's include both short and long bottoms such as jeans, shorts and pants
- make sure to look carefully at the user's query to determine if they've specified a specific fit - eg. regular, relaxed, cropped. ( Relaxed and Relaxed should always go together)


Extract the requirements and format them according to the QueryFilters model.
            """,
        },
        {"role": "user", "content": query},
    ],
    context={
        "taxonomy_data": taxonomy_data,
    },
    response_model=QueryFilters,
)

print(resp)

We can see that given the user query, the LLM has generated a set

In [34]:
import instructor
from openai import AsyncOpenAI


async def extract_query_filters(
    client: instructor.AsyncInstructor, query: str
) -> QueryFilters:
    """
    Extract structured filters from a natural language query using LLM.

    Args:
        query (str): Natural language query from user

    Returns:
        QueryFilters: Structured filters extracted from the query
    """
    return await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that extracts user requirements from a query. Refer to this taxonomy for valid categories, subcategories, product types and attributes: {{ taxonomy_data }}. If a filter isn't needed, just return an empty list. If the user is looking for a specific attribute, just return the attribute name and the values that the user is looking for",
            },
            {"role": "user", "content": query},
        ],
        context={
            "taxonomy_data": taxonomy_data,
        },
        response_model=QueryFilters,
    )


client = instructor.from_openai(AsyncOpenAI())
print(await extract_query_filters(client, "Need a new skirt around 40 bucks"))

Let's now see how we can use this metadata filtering with LanceDB.

In [120]:
import json
from lancedb import connect
import pandas as pd


def retrieve_and_filter(query: str, table, filters: QueryFilters, max_k=75):
    query_parts = []

    # We do a prefilter on category,price and material since these will always be provided
    query_parts.append(f"category='{filters.category}'")

    if filters.min_price:
        query_parts.append(f"price >= {filters.min_price}")
    if filters.max_price:
        query_parts.append(f"price <= {filters.max_price}")

    query_string = " AND ".join(query_parts)
    items = (
        table.search(query=query)
        .where(query_string, prefilter=True)
        .limit(max_k)
        .to_list()
    )

    items = [
        {
            **item,
            "attributes": json.loads(item["attributes"]),
        }
        for item in items
    ]

    if filters.product_type:
        items = [item for item in items if item["product_type"] in filters.product_type]

    if filters.attributes:
        for attr in filters.attributes:
            if not attr.values:
                continue
            curr_items = []
            for item in items:
                attr_name = attr.name
                attr_values = attr.values
                item_attr_values = item["attributes"]
                for item_attr in item_attr_values:
                    if (
                        item_attr["name"] == attr_name
                        and item_attr["value"] in attr_values
                    ):
                        curr_items.append(item)
                        break

            items = curr_items

    return items

In [None]:
test_query = "I want a skirt that is at most 300 bucks"
client = instructor.from_openai(AsyncOpenAI())
db = connect("./lancedb")
table = db.open_table("items")
generated_filter = await extract_query_filters(client, test_query)
results = retrieve_and_filter(test_query, table, generated_filter)
pd.DataFrame(results).loc[
    :,
    [
        "title",
        "description",
        "category",
        "product_type",
        "price",
        "attributes",
        "in_stock",
    ],
]


Unnamed: 0,title,description,category,product_type,price,attributes,in_stock
0,High-Waisted Pencil Skirt,This elegant high-waisted pencil skirt is desi...,Bottoms,Skirts,64.18,"[{'name': 'Rise', 'value': 'High Rise'}, {'nam...",True
1,White Eyelet Mini Skirt,"Featuring a delicate eyelet design, this white...",Bottoms,Skirts,102.31,"[{'name': 'Length', 'value': 'Mini'}, {'name':...",True
2,Plaid Pencil Skirt,This plaid pencil skirt is a versatile additio...,Bottoms,Skirts,275.42,"[{'name': 'Fit', 'value': 'Straight'}, {'name'...",False
3,Black Denim Skirt,A versatile black denim skirt with front pocke...,Bottoms,Skirts,289.67,"[{'name': 'Rise', 'value': 'Mid Rise'}, {'name...",True
4,Green Plaid Mini Skirt,Add a pop of pattern to your outfit with this ...,Bottoms,Skirts,191.17,"[{'name': 'Rise', 'value': 'High Rise'}, {'nam...",True


We can see that the filter is working as expected. Let's now apply this new filtering to our retrieval system and see how it performs.

In [72]:
import json

# Load in queries that we generated previously
with open("queries.json", "r") as f:
    queries = [json.loads(line) for line in f]


In [73]:
import instructor
from lancedb.table import Table
from tqdm.asyncio import tqdm_asyncio


async def retrieve_item(client: instructor.AsyncInstructor, item: dict, table: Table):
    generated_filter = await extract_query_filters(client, item["query"])
    results = retrieve_and_filter(item["query"], table, generated_filter)
    return {
        "query": item["query"],
        "retrieved_items": results,
        "expected_items": [item["id"]],
        "filters": generated_filter,
    }


client = instructor.from_openai(AsyncOpenAI())
coros = [retrieve_item(client, query, table) for query in queries]
results = await tqdm_asyncio.gather(*coros)


100%|██████████| 38/38 [01:48<00:00,  2.86s/it]


We'll now use this to evaluate the performance of our retrieval system

In [67]:
retrieved_items = [
    [item["id"] for item in result["retrieved_items"]] for result in results
]
labels = [item["expected_items"] for item in results]

In [74]:
from helpers.metrics import get_metrics_at_k

metrics = get_metrics_at_k(metrics=["mrr", "recall"], sizes=[5, 10, 25])
computed_metrics = [
    {
        metric: score_fn(retrieved_item_ids, desired_item)
        for metric, score_fn in metrics.items()
    }
    for retrieved_item_ids, desired_item in zip(retrieved_items, labels)
]
df = pd.DataFrame(computed_metrics)

In [75]:
df.mean()

mrr@5        0.500000
mrr@10       0.527778
mrr@25       0.527778
recall@5     0.500000
recall@10    0.750000
recall@25    0.750000
dtype: float64

### Braintrust

When working with these larger evaluation datasets, it's important to have a way to visualise the results easily. For this, we'll switch over to using Braintrust to visualise these results so that we can iterate on the prompts for our metadata filters better.

We'll redefine our `extract_query_filters` function below again and iteratively modify our prompt to get better results.

In [130]:
import instructor


async def extract_query_filters(
    client: instructor.AsyncInstructor, query: str
) -> QueryFilters:
    return await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """
You are a helpful assistant that extracts user requirements from a query.

Use this following taxonomy as a reference for what fields are available to you. Only use values from the provided taxonomy.

<taxonomy>
{{ taxonomy_data }}
</taxonomy>

Guidelines:

- If a specific filter isn't needed, just return an empty list or null value for that
- If the attribute exists on multiple types, make sure that you only look at the specific types listed under the category you have chosen
- Make sure that you've chosen from the right attribute values for each attribute type. This is very important.

Here are some general rules about how to generate these filters
1. Dresses and Skirts should always go together
2. Potential Filters should only be added if the user has explicitly mentioned it. When selecting filter values, aim to make them more flexible. For instance, if the user is asking for a well fitting top, we can consider both regular and relaxed fit. If another attribute value might be a good match for the user's query, include it too. 
3. If the user hasn't mentioned the attribute for the product type in his query, don't include that attribute in the filter. For instance if the user only mentions that they want something that's comfortable, don't include a filter for the fit of the product.
4. If the user mentions a rough range ( eg. around 50 bucks), let's just use a buffer of 30 bucks on each side ( Eg. 20-80)
5. If the user mentions a vague price (Eg. I have a high budget), just set max price to 1000
6. Make sure to look carefully at the user's query to determine if they've specified a specific fit - eg. regular, relaxed, cropped. ( Relaxed and Relaxed should always go together)

""",
            },
            {"role": "user", "content": query},
        ],
        context={
            "taxonomy_data": taxonomy_data,
        },
        response_model=QueryFilters,
    )


Now let's define our braintrust `AsyncEval` and then use it to iterate on our prompt.

In [None]:
from braintrust import EvalAsync, Score
from helpers.metrics import get_metrics_at_k
import lancedb
import openai


def evaluate_braintrust(input, output, **kwargs):
    metrics = get_metrics_at_k(metrics=["mrr", "recall"], sizes=[1, 3, 5, 10, 15, 25])
    return [
        Score(
            name=metric,
            score=score_fn(output, kwargs["expected"]),
            metadata={"query": input, "result": output, **kwargs["metadata"]},
        )
        for metric, score_fn in metrics.items()
    ]


client = instructor.from_openai(openai.AsyncOpenAI())
taxonomy_data = process_taxonomy_file("taxonomy.yml")
with open("queries.json", "r") as f:
    queries = [json.loads(line) for line in f]
db = lancedb.connect("./lancedb")
table = db.open_table("items")


async def generate_filters_and_retrieve_items(query: dict, hooks) -> dict:
    generated_filter = await extract_query_filters(client, query["query"])
    results = retrieve_and_filter(query["query"], table, generated_filter)
    items_without_vector = [
        {k: v for k, v in item.items() if k != "vector"} for item in results
    ]

    hooks.meta(filters=generated_filter.model_dump(), items=items_without_vector)
    return [item["id"] for item in results]


await EvalAsync(
    "query-generation",
    data=lambda: [{"input": query, "expected": [query["id"]]} for query in queries],
    task=generate_filters_and_retrieve_items,
    scores=[evaluate_braintrust],
)

Skipping git metadata. This is likely because the repository has not been published to a remote yet. Remote named 'origin' didn't exist
Experiment main-1739390828 is running at https://www.braintrust.dev/app/567/p/query-generation/experiments/main-1739390828
query-generation (data): 38it [00:00, 74792.84it/s]


query-generation (tasks):   0%|          | 0/38 [00:00<?, ?it/s]

  hooks.meta(



main-1739390828 compared to main-1739390737:
73.68% (-01.99%) 'mrr@1'     score	(1 improvements, 1 regressions)
78.07% (-02.11%) 'mrr@3'     score	(1 improvements, 1 regressions)
78.07% (-02.11%) 'mrr@5'     score	(1 improvements, 1 regressions)
79.18% (-02.14%) 'mrr@10'    score	(1 improvements, 1 regressions)
79.18% (-02.14%) 'mrr@15'    score	(1 improvements, 1 regressions)
79.18% (-02.14%) 'mrr@25'    score	(1 improvements, 1 regressions)
73.68% (-01.99%) 'recall@1'  score	(1 improvements, 1 regressions)
84.21% (-02.28%) 'recall@3'  score	(1 improvements, 1 regressions)
84.21% (-02.28%) 'recall@5'  score	(1 improvements, 1 regressions)
92.11% (-02.49%) 'recall@10' score	(1 improvements, 1 regressions)
92.11% (-02.49%) 'recall@15' score	(1 improvements, 1 regressions)
92.11% (-02.49%) 'recall@25' score	(1 improvements, 1 regressions)

1739390828.04s start
1739390845.33s end
11.27s (+287.71%) 'duration'	(11 improvements, 27 regressions)

See results for main-1739390828 at https://ww

EvalResultWithSummary(summary="...", results=[...])

We can see that the filters are working as expected. By implementing metadata filters, we've been able to improve the recall and MRR of our retrieval system by a significant margin as seen below.

| Metric | Semantic Search (Baseline) | Semantic Search + Metadata Filters |
|---------|--------------------------|----------------------------------|
| MRR@5   | 0.47 | 0.78 (+65.43%) |
| MRR@10  | 0.49 | 0.79 (+61.67%) |
| MRR@15  | 0.49 | 0.79 (+61.06%) |
| MRR@20  | 0.49 | 0.79 (+60.05%) |
| MRR@25  | 0.49 | 0.79 (+60.05%) |
| Recall@5  | 0.63 | 0.84 (+33.33%) |
| Recall@10 | 0.76 | 0.92 (+20.70%) |
| Recall@15 | 0.79 | 0.92 (+16.67%) |
| Recall@20 | 0.84 | 0.92 (+9.38%) |
| Recall@25 | 0.84 | 0.92 (+9.38%) |

Most importantly, we've been able to quantify the improvement in our retrieval at each step. This allows us to know the impact of each step in our retrieval pipeline. 