# Week 1 : Systematically Improving Your RAG Application

## Benchmarking Retrieval Methods

> If you have not already, please run the `1. synthetic_questions.ipynb` notebook to generate the synthetic questions we'll be using in this notebook before proceeding.

After generating the synthetic questions, we'll use them to benchmark the recall and mrr of various retrievel methods. We want to do so because these metrics give us an objective measure of how well a retrieval method is performing.

This allows us to make an informed decision about whether a retrieval method might be worth the cost and latency to implement. For instance, if we get a 1% improvement in recall@10 but latency increase by 10x, we might not want to use that method.

We'll do this in two steps

1. First, we'll take our original dataset of sql snippets from `567-labs/bird-rag` and ingest it into a local lancedb instance.
2. Then we'll show you how to measure recall and mrr at different levels of k

Recall that in our original dataset, each row had the following columns

- `id`: The unique identifier for each row
- `query`: The SQL snippet
- `difficulty`: The difficulty of the question

When we query our database with a synthetic question, we'll retrieve a list of chunks and their chunk ids that match the query. Our goal here is to verify that we're able to retrieve the correct chunk id for each question. 

We'll be using `braintrust` to collect the data and discuss the trade offs of each method. We like `braintrust` because it allows us to easily run multiple experiments using a simple `Eval` object and share the results easily if you're working with a team.


## Setting up Our Rag Pipeline

In this example, we're using a local `lancedb` instance. We're doing so because of 3 reasons.

1. LanceDB handles the embeddings of our data for us 
2. It provides embedding search, hybrid search and other re-ranking methods all within a single api.
3. We can use Pydantic to define our table schema and easily ingest our data.

This makes it quick and easy for us to compare the performance of each method.

In [10]:
import datasets
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

# Create LanceDB Instance
db = lancedb.connect("./lancedb")

# Define and create Table using Pydantic
func = get_registry().get("openai").create(name="text-embedding-3-small")


class Chunk(LanceModel):
    id: str
    query: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()


table = db.create_table("chunks", schema=Chunk, mode="overwrite")

# Ingest dataset into table
dataset = datasets.load_dataset("567-labs/bird-rag")["train"]
formatted_dataset = [{"id": item["id"], "query": item["query"]} for item in dataset]

table.add(formatted_dataset)

table.create_fts_index("query", replace=True)
print(f"{table.count_rows()} chunks ingested into the database")

1528 chunks ingested into the database


## Defining Metrics

Let's now start by evaluating the retrieval performance of our model. We'll do so by measuring the recall and MRR at different levels of k.

$$ \text{Recall} = \frac{\text{Number of Relevant Items Retrieved}}{\text{Total Number of Relevant Items}} $$ 

$$ \text{MRR} = \frac{\sum_{i=1}^{n} \frac{1}{rank(i)}}{n} $$ 

As models improve, their context window and reasoning abilities improve. This means that their ability to select relevant information in response to a user query will improve. By optimizing for recall, we ensure that the language model has access to all necessary information, which can lead to more accurate and reliable generated responses.

MRR@K is a useful metric if we want to display retrieved results as citations to users. We normally show a smaller list of retrieved results to users and we want to make sure that the correct result is ranked highly during retrieval so that it's more likely to be selected.

<TODO: Jason please verify or add on here >

In [10]:
def calculate_mrr(predictions: list[str], gt: list[str]):
    mrr = 0
    for label in gt:
        if label in predictions:
            # Find the relevant item that has the smallest index
            mrr = max(mrr, 1 / (predictions.index(label) + 1))
    return mrr


def calculate_recall(predictions: list[str], gt: list[str]):
    # Calculate the proportion of relevant items that were retrieved
    return len([label for label in gt if label in predictions]) / len(gt)

### Computing Metrics

We want to determine whether we can retrieve the relevant SQL snippet for each synthetic question. We can do so by querying our LanceDB instance. Since the `search` api provides easy support for re-rankers and different search methods, using it allows us to quickly test out different retrieval methods.


Remember that in our earlier notebook, we generated the question below for the corresponding SQL snippet.

Question : Which schools in France are locally funded and have a greater difference between their total K-12 
enrollment and enrollment for ages 5-17 compared to the average difference for locally funded schools?

Snippet 
```sql
SELECT T2.School, T2.DOC FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode WHERE
T2.FundingType = 'Locally funded' AND (T1.`Enrollment (K-12)` - T1.`Enrollment (Ages 5-17)`) > (SELECT 
AVG(T3.`Enrollment (K-12)` - T3.`Enrollment (Ages 5-17)`) FROM frpm AS T3 INNER JOIN schools AS T4 ON T3.CDSCode = 
T4.CDSCode WHERE T4.FundingType = 'Locally funded')
```

Let's see this in action by fetching the top 25 results from our database that match this query using embedding search. Notice here that we're using the chunk id to calculate recall and mrr.


In [16]:
from rich import print

question = "Which schools in France are locally funded and have a greater difference between their total K-12 enrollment and enrollment for ages 5-17 compared to the average difference for locally funded schools?"

query = "SELECT T2.School, T2.DOC FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode WHERE T2.FundingType = 'Locally funded' AND (T1.`Enrollment (K-12)` - T1.`Enrollment (Ages 5-17)`) > (SELECT AVG(T3.`Enrollment (K-12)` - T3.`Enrollment (Ages 5-17)`) FROM frpm AS T3 INNER JOIN schools AS T4 ON T3.CDSCode = T4.CDSCode WHERE T4.FundingType = 'Locally funded')"

retrieved_items = table.search(question).limit(25).to_list()

for item in retrieved_items[:2]:
    print(item["id"], item["query"])

We can see that the first item is a perfect match for our desired chunk. We're going to calculate the recall@25 and MRR@25 for this specific snippet. To do so, we'll use its id. The id of this specific snippet is 28. You can verify it by looking at the filtered subset of the hugging face dataset [here](https://huggingface.co/datasets/567-labs/bird-rag/viewer/default/train?q=T4.FundingType)

In [15]:
predicted_ids = [item["id"] for item in retrieved_items]
ground_truth_ids = ["28"]  # The ID of the expected SQL snippet

# Calculate metrics
mrr_score = calculate_mrr(predicted_ids, ground_truth_ids)
recall_score = calculate_recall(predicted_ids, ground_truth_ids)
rank_position = predicted_ids.index("28") + 1  # Adding 1 because index starts at 0

print(f"MRR@25: {mrr_score}")
print(f"Recall@25: {recall_score}")
print(f"Relevant item found at rank: {rank_position}")

This is a simple query since we have many of the specific column values in the snippet listed in the question - Eg. (K-12) and (Ages 5-17). But we're able to see how well our retrieval system is performing in this case with an objective measure.

We use a wrapper to compute metrics for multiple subsets of the retrieved items without having to hard code functions that take in a subset of the retrieved items. This makes it easy to vary the size of k and the metrics we want to use instead of hard coding them.

<TODO: Jason please verify this>

In [13]:
# Define the names for our metrics
eval_metrics = [["mrr", calculate_mrr], ["recall", calculate_recall]]

# Define the different sizes of k that we want to compute the metrics at
sizes = [1, 3, 5, 10]

# Define wrapper functions that will take a subset of k items and calculate the relevant metric
metrics = {}
for metric_name, metric_fn in eval_metrics:
    for size in sizes:
        key = f"{metric_name}@{size}"
        metrics[key] = lambda predictions, gt, m=metric_fn, s=size: m(
            predictions[:s], gt
        )

We can see this in action below with a simple example.

Our MRR@1 and Recall@1 will be 0 since there are no relevant items at k = 1. However, at k=3, we see that we have `a` in the top 3 retrieved items. This gives us a MRR@3 of 1/3 and a Recall@3 of 1/2 ( since we retrieved 1 of 2 relevant items).

Once we look at the MRR@5, we see that we have `a` and `b` in the top 5 retrieved items. This gives us a MRR@5 of 1/3 and a Recall@5 of 2/2 (since we retrieved all of the relevant items). We can see that we have the same values for MRR@10 and Recall@10.

In [14]:
labels = ["a", "b"]
preds = ["x", "y", "a", "c", "b"]

{metric: score_fn(preds, labels) for metric, score_fn in metrics.items()}

{'mrr@1': 0,
 'mrr@3': 0.3333333333333333,
 'mrr@5': 0.3333333333333333,
 'mrr@10': 0.3333333333333333,
 'recall@1': 0.0,
 'recall@3': 0.5,
 'recall@5': 1.0,
 'recall@10': 1.0}

## Running your first Benchmark

We use braintrust here because it's easy to setup complex experiments. We can use the `Eval` object to define a single experiment and run it with different configurations. 

We can do so by defining a 

- `Task` : This is a function that takes in the input and returns an expected output
- `Scorer` : This is a function that looks at the output and compares it to the expected output and returns a score. In our case, since we're comparing the recall and mrr at different levels of k, we'll be returning a list of scores.

Since our task object is going to be running the retrieval with different methods, we'll also be redefining our `retrieve` function to take in additional parameters to be reconfigured for each experiment.

We implemented (1) in the previous section. Now, we want to use a single function to retrieve the top k items from our database with different retrieval methods because it's significantly easier to modify a single function with configuration parameters than to keep track of individual functions for each method.

In [5]:
from braintrust import Score
from lancedb.rerankers import CohereReranker
import lancedb
from typing import Literal

db = lancedb.connect("./lancedb")
table = db.open_table("chunks")
# Configure a cohere reranker
reranker = CohereReranker(model_name="rerank-multilingual-v3.0", column="query")


def retrieve(
    question: str,
    max_k=25,
    mode: Literal["vector", "fts", "hybrid"] = "vector",
    use_reranker: bool = False,
):
    results = table.search(question, query_type=mode).limit(max_k)
    if use_reranker:
        results = results.rerank(reranker=reranker)
    return [result for result in results.to_list()]


# Similar to our previous section, we can use the id of each item to compute the recall and MRR metrics.
def evaluate_braintrust(input, output, **kwargs):
    predictions = [item["id"] for item in output]
    labels = [kwargs["metadata"]["chunk_id"]]
    return [
        Score(
            name=metric,
            score=score_fn(predictions, labels),
            metadata={"query": input, "result": output, **kwargs["metadata"]},
        )
        for metric, score_fn in metrics.items()
    ]

In [7]:
# TODO: Need to wait for braintrust to fix the issue with their api before I can run this

from itertools import product
from braintrust import Eval, init_dataset

modes = ["vector", "fts", "hybrid"]
reranker = [True, False]

results = {}
for mode, use_reranker in product(modes, reranker):
    results[(mode, use_reranker)] = await Eval(
        "Text-2-SQL",
        data=init_dataset(project="Text-2-SQL", name="Bird-Bench-Questions"),
        task=lambda input: retrieve(input, 25, mode=mode, use_reranker=use_reranker),
        scores=[evaluate_braintrust],
    )

print(results)