# Context

In `make_synthetic_questions.ipynb`, we generated synthetic questions to bootstrap evaluation of the retrieval system in our hardware store's Q&A system.

This notebook shows the first step in calculating precision and recall with different retrieval parameters. We will run more advanced experiments in future notebooks after we have these baseline scores.

## Data

Here is a brief review of the data.

In [1]:
import json
import lancedb
import os
import pandas as pd
from typing import List, Dict
from concurrent.futures import ThreadPoolExecutor

pd.set_option("display.max_colwidth", 160)

db = lancedb.connect("./lancedb")
reviews_table = db.open_table("reviews")
reviews_table.to_pandas().head()

Unnamed: 0,id,product_title,product_description,review,vector
0,0,Cordless Drill,"This powerful cordless drill features a lightweight design and a 2-speed transmission, allowing you to tackle various tasks with ease. Ideal for both profes...","I've owned several cordless drills over the years, but this one is exceptional. It is lightweight, making it easy to use for extended periods without fatigu...","[0.012451614, 0.009677136, -0.013976053, -0.02875701, 0.017634707, 0.0003763458, -0.01216502, 0.0825392, 0.03673287, -0.021500682, 0.023537332, 0.03043999, ..."
1,1,Cordless Drill,"This powerful cordless drill features a lightweight design and a 2-speed transmission, allowing you to tackle various tasks with ease. Ideal for both profes...","As a professional contractor, I rely on my tools every day. This cordless drill has exceeded my expectations with its powerful motor and ergonomic design. T...","[0.028722303, -0.00047558517, -0.0009903691, -0.019788183, 0.018789813, 0.0009975688, -0.035096504, 0.07188256, 0.015909903, -0.010137284, 0.028875899, -0.0..."
2,2,Cordless Drill,"This powerful cordless drill features a lightweight design and a 2-speed transmission, allowing you to tackle various tasks with ease. Ideal for both profes...",I'm a DIY enthusiast and bought this cordless drill for home projects. It's perfect for everything from hanging shelves to assembling furniture. The drill i...,"[-0.011681931, 0.0018038707, -0.02288322, -0.030081118, 0.042812254, 0.0008550435, -0.035051655, 0.07488628, 0.0060666325, 0.0003996797, 0.019682853, 0.0155..."
3,3,Cordless Drill,"This powerful cordless drill features a lightweight design and a 2-speed transmission, allowing you to tackle various tasks with ease. Ideal for both profes...","After using this cordless drill for several months, I can confidently say it's one of the best I've owned. The lightweight design makes it comfortable to us...","[0.018682713, 0.02133356, -0.0092318645, -0.031441357, 0.02241695, 0.0017821187, 0.0009205933, 0.06265221, 0.05357017, 0.0011827967, 0.046009492, 0.00949118..."
4,4,Cordless Drill,"This powerful cordless drill features a lightweight design and a 2-speed transmission, allowing you to tackle various tasks with ease. Ideal for both profes...","This cordless drill has become my go-to tool for all my DIY projects. The lightweight design reduces strain on my wrist, and the 2-speed transmission is inc...","[0.016187381, 0.004412105, -0.019909551, -0.01881957, 0.019747214, -0.013334877, -0.027829308, 0.060018543, 0.0031047072, -0.008151668, 0.006702225, 0.02581..."


In [2]:
with open("synthetic_eval_dataset.json", "r") as f:
    synthetic_questions = json.load(f)
synthetic_questions[:5]

[{'question': 'How good is the battery life on this cordless drill?',
  'answer': 'It comes with two included batteries, ensuring that you never run out of power on the job.',
  'chunk_id': '0'},
 {'question': 'Is this cordless drill easy to handle for long tasks?',
  'answer': 'Yes, its lightweight design makes it easy to use for extended periods without fatigue.',
  'chunk_id': '0'},
 {'question': 'How powerful is the motor in this cordless drill?',
  'answer': 'The cordless drill features a powerful motor that exceeds expectations for professional use.',
  'chunk_id': '1'},
 {'question': 'What design features make this drill suitable for overhead tasks?',
  'answer': 'The cordless drill has a lightweight design and ergonomic build, making it perfect for overhead tasks.',
  'chunk_id': '1'},
 {'question': 'How durable are the batteries for this cordless drill?',
  'answer': 'The batteries charge quickly and last a long time, which is a huge plus.',
  'chunk_id': '2'}]

## Set Up Evaluation

Load the evaluation questions into a structured format.

In [3]:
from pydantic import BaseModel


class EvalQuestion(BaseModel):
    question: str
    answer: str
    chunk_id: str


eval_questions = [EvalQuestion(**question) for question in synthetic_questions]

Build a simple search function

In [4]:
def run_simple_request(q: EvalQuestion, n_return_vals=5):
    results = (
        reviews_table.search(q.question).select(["id"]).limit(n_return_vals).to_list()
    )
    return [str(q.chunk_id) == str(r["id"]) for r in results]

Now do the benchmarking. For simplicity, we just compare retrieval sizes with a simple semantic search in this cell.

In [5]:
def score(hits):
    # This implementation assumes
    n_retrieval_requests = len(hits)
    total_retrievals = sum(len(l) for l in hits)
    true_positives = sum(sum(sublist) for sublist in hits)
    precision = true_positives / total_retrievals if total_retrievals > 0 else 0
    recall = true_positives / n_retrieval_requests if n_retrieval_requests > 0 else 0
    return {"precision": precision, "recall": recall}


def score_simple_search(n_to_retrieve: List[int]) -> Dict[str, float]:
    # parallelize to speed this up 5-10X
    with ThreadPoolExecutor() as executor:
        hits = list(
            executor.map(lambda q: run_simple_request(q, n_to_retrieve), eval_questions)
        )
    return score(hits)


k_to_retrieve = [5, 10, 20]
scores = pd.DataFrame([score_simple_search(n) for n in k_to_retrieve])
scores["n_retrieved"] = k_to_retrieve
scores

Unnamed: 0,precision,recall,n_retrieved
0,0.10141,0.507048,5
1,0.070749,0.707489,10
2,0.044361,0.887225,20


If you have Cohere set up, you can see uf a reranker improves results (we'll talk more about rerankers in the coming weeks).

In [6]:
try:
    import cohere
    from diskcache import Cache
    cohere_api_key = os.environ["COHERE_API_KEY"]

    # Use diskcache to reduce re-running in case of error (or addition of new data)
    cache = Cache("./cohere_cache")
    
    def run_reranked_request(q: EvalQuestion, n_return_vals=5, n_to_rerank=40) -> List[bool]:
        # First, get more results than we need
        initial_results = reviews_table.search(q.question) \
            .select(["id", "review"]) \
            .limit(n_to_rerank) \
            .to_list()
        
        # Prepare texts for reranking
        texts = [r["review"] for r in initial_results]
        
        cache_key = f"{q.question}_{n_return_vals}".replace("?", "")
        # Try to get the result from cache
        cached_result = cache.get(cache_key)
        if cached_result is not None:
            return cached_result
        
        # Rerank using Cohere
        co = cohere.Client(cohere_api_key)
        reranked = co.rerank(
            query=q.question,
            documents=texts,
            top_n=n_return_vals
        )
        
        # Map reranked results back to original IDs
        reranked_ids = [initial_results[r.index]["id"] for r in reranked.results]
        result = [str(q.chunk_id) == str(r) for r in reranked_ids]
        cache.set(cache_key, result)
        return result

    def score_reranked_search(n_to_retrieve: List[int], n_to_rerank: int = 40) -> Dict[str, float]:
        with ThreadPoolExecutor() as executor:
            hits = list(executor.map(
                lambda q: run_reranked_request(q, n_to_retrieve, n_to_rerank), 
                eval_questions
            ))
        return score(hits)

    k_to_retrieve = [5, 10, 20]
    reranked_scores = pd.DataFrame([score_reranked_search(n) for n in k_to_retrieve])
    reranked_scores["n_retrieved"] = k_to_retrieve
    print(reranked_scores)
except Exception as e:
    print(f"Could not run reranker.\n{e}")
    print("Ensure COHERE_API_KEY env is set... and cohere library diskcache are installed.")
    print("Connection reset by peer is likely rate limiting from Cohere")

   precision    recall  n_retrieved
0   0.125198  0.625991            5
1   0.081806  0.818062           10
2   0.046960  0.939207           20
