# Benchmarking Retrieval Methods

> If you have not already, please run the `1. synthetic_questions.ipynb` notebook to generate the synthetic questions and chunks before proceeding.

Remember that in our previous notebook, we had a dataset that had the following three fields:

- `id` : This is a unique identifier for each query
- `query` : This is a sample SQL query 
- `difficulty` : This is a label that indicates how difficult the query is to generate. It can be either `simple`, `moderate` or `challenging`. 

We then generated a dataset of synthetic questions for each query that was marked as `challenging` in the dataset before storing it in braintrust. 

With these new questions, we can now benchmark the performance of embedding search vs hybrid search. Ultimately, we'll use braintrust to collect all the data and discuss the tradeoffs of each method compared to performance and complexity.

## Evaluation Method

We'll do so in two steps

1. First, we'll ingest all of the queries in our dataset into a single lancedb database table. 
2. Next, we'll write a function that will take in a query and return the top k results from our database.

We can then measure recall@k and MRR@k by verifying that using the `question` we can retrieve the `chunk_id` correctly from our datbase when we look at the top k most relevant results.


## Setting up Our Rag Pipeline

In this example, we're using a local `lancedb` instance. We're doing so because of 3 reasons.

1. LanceDB handles the embeddings of our data for us 
2. It provides embedding search, hybrid search and other re-ranking methods all within a single api.
3. We can use Pydantic to handle the creation and ingestion of our data.

This makes it quick and easy for us to compare the performance of each method.

In [26]:
import datasets
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

# Create LanceDB Instance
db = lancedb.connect("./lancedb")

# Define and create Table using Pydantic
func = get_registry().get("openai").create(name="text-embedding-3-small")


class Chunk(LanceModel):
    id: str
    query: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()


table = db.create_table("chunks", schema=Chunk, mode="overwrite")

# Ingest dataset into table
dataset = datasets.load_dataset("567-labs/bird-rag")["train"]
formatted_dataset = [{"id": item["id"], "query": item["query"]} for item in dataset]

table.add(formatted_dataset)
print(f"{table.count_rows()} chunks ingested into the database")

1528 chunks ingested into the database


## Defining Metrics

Let's now start by evaluating the retrieval performance of our model. We'll do so by measuring the recall and MRR at different levels of k.

<TODO: Redefine metrics here>

In [4]:
def calculate_mrr(predictions: list[str], gt: list[str]):
    mrr = 0
    for label in gt:
        if label in predictions:
            mrr = max(mrr, 1 / (predictions.index(label) + 1))
    return mrr


def calculate_recall(predictions: list[str], gt: list[str]):
    return len([label for label in gt if label in predictions]) / len(gt)

### Computing Metrics

Now let's see how we can compute these using braintrust and lancedb. Let's start by writing a function that will take a query and return the top k results from our database.

In [28]:
# TODO: Make this more complex to support a larger max_k, re-rankers etc + mode
def retrieve(query: str, k: int = 10) -> list[str]:
    results = table.search(query).limit(k).to_list()
    return [result["query"] for result in results]


retrieve("What is the capital of France?")[:2]

["SELECT T1.driverId FROM lapTimes AS T1 INNER JOIN races AS T2 on T1.raceId = T2.raceId WHERE T2.name = 'French Grand Prix' AND T1.lap = 3 ORDER BY T1.time DESC LIMIT 1",
 'SELECT T2.City FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode GROUP BY T2.City ORDER BY SUM(T1.`Enrollment (K-12)`) ASC LIMIT 5']

Let's now define a simple list of functions that will take in these returned results and compute the recall and MRR

In [6]:
import itertools

eval_metrics = [["mrr", calculate_mrr], ["recall", calculate_recall]]
sizes = [1,3, 5, 10]

metrics = {
    f"{metric_name}@{size}": lambda predictions, gt, m=metric_fn, s=size: (
        lambda p, g: m(p[:s], g)
    )(predictions, gt)
    for (metric_name, metric_fn), size in itertools.product(eval_metrics, sizes)
}

In [30]:
from braintrust import Score, init_dataset, Eval


def evaluate_braintrust(input, output, **kwargs):
    hashed_queries = [hash_query(query) for query in output]
    hashed_expected = [hash_query(query) for query in kwargs["expected"]]
    return [
        Score(
            name=metric,
            score=score_fn(hashed_queries, hashed_expected),
            metadata={"query": input, "result": output, **kwargs["metadata"]},
        )
        for metric, score_fn in metrics.items()
    ]


await Eval(
    "Text-2-SQL",
    data=init_dataset(project="Text-2-SQL", name="Bird-Bench-Questions"),
    task=lambda input: retrieve(input, 25),
    scores=[evaluate_braintrust],
)

Experiment add-braintrust-support-1729604364 is running at https://www.braintrust.dev/app/567/p/Text-2-SQL/experiments/add-braintrust-support-1729604364
Text-2-SQL (data): 290it [00:01, 169.57it/s]


Text-2-SQL (tasks):   0%|          | 0/290 [00:00<?, ?it/s]


add-braintrust-support-1729604364 compared to add-braintrust-support-1729604072:
58.39% 'mrr@3'     score
59.79% 'mrr@5'     score
61.35% 'mrr@10'    score
61.70% 'mrr@15'    score
61.85% 'mrr@25'    score
73.45% 'recall@3'  score
79.31% 'recall@5'  score
90.69% 'recall@10' score
95.17% 'recall@15' score
97.93% 'recall@25' score

7.12s duration

See results for add-braintrust-support-1729604364 at https://www.braintrust.dev/app/567/p/Text-2-SQL/experiments/add-braintrust-support-1729604364


EvalResultWithSummary(summary="...", results=[...])

TODO: Add in more complex metrics here for different stuff