# Context

In `make_synthetic_questions.ipynb`, we generated synthetic questions to bootstrap evaluation of the retrieval system in our hardware store's Q&A system.

This notebook shows the first step in calculating precision and recall with different retrieval parameters. We will run more advanced experiments in future notebooks after we have these baseline scores.

## Data

Here is a brief review of the data.

In [1]:
import json
import lancedb
import os
import pandas as pd
from typing import List, Dict
from concurrent.futures import ThreadPoolExecutor

pd.set_option("display.max_colwidth", 160)

db = lancedb.connect("./lancedb")
reviews_table = db.open_table("reviews")
reviews_table.to_pandas().head()

Unnamed: 0,id,product_title,product_description,review,vector
0,0,Cordless Drill,"This lightweight cordless drill features a powerful lithium-ion battery, making it perfect for both indoor and outdoor projects. With multiple speed setting...","I've been using this cordless drill for about six months now, and it has exceeded my expectations. The lithium-ion battery lasts a long time, allowing me to...","[0.0031058379, 0.0037674536, 0.0050993524, -0.02172642, 0.01334788, 0.0073500015, -0.01471156, 0.04703539, 0.047220293, -0.004362619, 0.017080665, -0.030393..."
1,1,Cordless Drill,"This lightweight cordless drill features a powerful lithium-ion battery, making it perfect for both indoor and outdoor projects. With multiple speed setting...","This cordless drill is a game-changer for my DIY projects. The powerful battery charges quickly and holds its charge well, even when not in use. I appreciat...","[0.032876715, -0.010506455, -0.018205797, -0.0066769994, 0.015029025, 0.0103158485, -0.009143331, 0.0552643, 0.023843126, -0.02174068, 0.026176611, -0.00222..."
2,2,Cordless Drill,"This lightweight cordless drill features a powerful lithium-ion battery, making it perfect for both indoor and outdoor projects. With multiple speed setting...","I purchased this cordless drill for some home renovation work, and it has been fantastic. The battery life is impressive, and the drill itself is very power...","[-0.019077417, 0.011568582, -0.0027465487, -0.027230835, -0.009889271, -0.017392453, -0.025828583, 0.07703341, 0.0393083, 0.005713613, 0.021825379, -0.00500..."
3,3,Cordless Drill,"This lightweight cordless drill features a powerful lithium-ion battery, making it perfect for both indoor and outdoor projects. With multiple speed setting...","As a professional contractor, I need reliable tools, and this cordless drill fits the bill perfectly. The lithium-ion battery provides consistent power, and...","[0.012545761, 0.017653596, 0.00077406655, -0.018319337, 0.035100583, -0.005041835, -0.044489816, 0.04885156, 0.005320183, -0.006657403, 0.021039689, -0.0318..."
4,4,Cordless Drill,"This lightweight cordless drill features a powerful lithium-ion battery, making it perfect for both indoor and outdoor projects. With multiple speed setting...","This is by far the best cordless drill I've ever owned. The battery life is outstanding, and it recharges quickly. The multiple speed settings are perfect f...","[0.0071220105, -0.004285458, -0.008581569, -0.015916655, -0.005252283, 0.0040084613, -0.015597044, 0.063964926, 0.0049886033, 0.012081316, 0.020455139, -0.0..."


In [2]:
with open("synthetic_eval_dataset.json", "r") as f:
    synthetic_questions = json.load(f)
synthetic_questions[:5]

[{'question': 'How long does the battery last on this cordless drill?',
  'answer': 'The lithium-ion battery lasts a long time, allowing for multiple projects on a single charge.',
  'chunk_id': '0'},
 {'question': 'Is this cordless drill comfortable to use for long periods?',
  'answer': 'Yes, the comfortable grip means it can be used for extended periods without fatigue.',
  'chunk_id': '0'},
 {'question': 'How long does the battery last on this cordless drill?',
  'answer': 'The powerful battery charges quickly and holds its charge well, even when not in use.',
  'chunk_id': '1'},
 {'question': 'Is this drill comfortable to use for extended periods?',
  'answer': 'The ergonomic design and lightweight build make it comfortable to use for hours on end.',
  'chunk_id': '1'},
 {'question': "How's the battery life on this cordless drill?",
  'answer': 'The battery life is impressive.',
  'chunk_id': '2'}]

## Set Up Evaluation

Load the evaluation questions into a structured format.

In [3]:
from pydantic import BaseModel


class EvalQuestion(BaseModel):
    question: str
    answer: str
    chunk_id: str


eval_questions = [EvalQuestion(**question) for question in synthetic_questions]

Build a simple search function

In [4]:
def run_simple_request(q: EvalQuestion, n_return_vals=5):
    results = (
        reviews_table.search(q.question).select(["id"]).limit(n_return_vals).to_list()
    )
    return [str(q.chunk_id) == str(r["id"]) for r in results]

Now do the benchmarking. For simplicity, we just compare retrieval sizes with a simple semantic search in this cell.

In [5]:
def score(hits):
    n_retrieval_requests = len(hits)
    total_retrievals = sum(len(l) for l in hits)
    true_positives = sum(sum(sublist) for sublist in hits)
    precision = true_positives / total_retrievals if total_retrievals > 0 else 0
    recall = true_positives / n_retrieval_requests if n_retrieval_requests > 0 else 0
    return {"precision": precision, "recall": recall}


def score_simple_search(n_to_retrieve: List[int]) -> Dict[str, float]:
    # parallelize to speed this up 5-10X
    with ThreadPoolExecutor() as executor:
        hits = list(
            executor.map(lambda q: run_simple_request(q, n_to_retrieve), eval_questions)
        )
    return score(hits)


k_to_retrieve = [5, 10, 20]
scores = pd.DataFrame([score_simple_search(n) for n in k_to_retrieve])
scores["n_retrieved"] = k_to_retrieve
scores

Unnamed: 0,precision,recall,n_retrieved
0,0.103765,0.518826,5
1,0.072674,0.726744,10
2,0.045903,0.918051,20


If you have Cohere set up, you can see uf a reranker improves results (we'll talk more about rerankers in the coming weeks).

In [6]:
try:
    import cohere
    from diskcache import Cache
    cohere_api_key = os.environ["COHERE_API_KEY"]

    # Use diskcache to reduce re-running in case of error (or addition of new data)
    cache = Cache("./cohere_cache")
    
    def run_reranked_request(q: EvalQuestion, n_return_vals=5, n_to_rerank=40) -> List[bool]:
        # First, get more results than we need
        initial_results = reviews_table.search(q.question) \
            .select(["id", "review"]) \
            .limit(n_to_rerank) \
            .to_list()
        
        # Prepare texts for reranking
        texts = [r["review"] for r in initial_results]
        
        cache_key = f"{q.question}_{n_return_vals}".replace("?", "")
        # Try to get the result from cache
        cached_result = cache.get(cache_key)
        if cached_result is not None:
            return cached_result
        
        # Rerank using Cohere
        co = cohere.Client(cohere_api_key)
        reranked = co.rerank(
            query=q.question,
            documents=texts,
            top_n=n_return_vals
        )
        
        # Map reranked results back to original IDs
        reranked_ids = [initial_results[r.index]["id"] for r in reranked.results]
        result = [str(q.chunk_id) == str(r) for r in reranked_ids]
        cache.set(cache_key, result)
        return result

    def score_reranked_search(n_to_retrieve: List[int], n_to_rerank: int = 40) -> Dict[str, float]:
        with ThreadPoolExecutor() as executor:
            hits = list(executor.map(
                lambda q: run_reranked_request(q, n_to_retrieve, n_to_rerank), 
                eval_questions
            ))
        return score(hits)

    k_to_retrieve = [5, 10, 20]
    reranked_scores = pd.DataFrame([score_reranked_search(n) for n in k_to_retrieve])
    reranked_scores["n_retrieved"] = k_to_retrieve
    print(reranked_scores)
except Exception as e:
    print(f"Could not run reranker.\n{e}")
    print("Ensure COHERE_API_KEY env is set... and cohere library diskcache are installed.")
    print("Connection reset by peer is likely rate limiting from Cohere")

   precision    recall  n_retrieved
0   0.127409  0.637043            5
1   0.082669  0.826689           10
2   0.047148  0.942968           20
