# Goal

This notebook shows how to make synthetic data to bootstrap evaluation of your retrieval system. 

This synthetic data contains many triplets of `(RAG system input, system output, desired chunk to retrieve)`. For this example, we will work on a hardware retailer's system to answer user questions based on existing product previous. So the synthetic data will look like

```
Q: How frequently do I need to replace the blades on this saw?
A: A customer reported getting 7-10 hours of active use between blade replacements.
Chunk ID: 3
```

Once you have many of these triplets, you can experiments with different retrieval strategies (e.g. different embedding models, embedding vs keyword search, etc) to determine which strategies most consistently retrieve the desired chunks.

# A Starting Point

A simple approach would follow pseudo code:

```
synth_data = []
for chunk in corpus:
    response = call_llm(f"Give me a JSON array of 10 question/answer pairs. The questions should be things someone might ask about a product before purchase. The answer should be something contained in this text: {chunk}")
    q_a = json.loads(response.content)
    q_a_c = [{'question': q, 'answer': a, 'chunk': chunk} for (q, a) in q_a_pairs]
    synth_data.extend(q_a_c)
```

A practical implementation should address three issues that arise in the naive pseudo code.

| Issue | Solution |
|---------|----------|
| Inconsistent formatting of LLM response (e.g. different keys) | Instructor library |
| Bad questions | Guidance/examples in prompt |
| Time waiting for LLM responses when iterating over many chunks | Async LLM calls|

# Reusable Code to Bootstrap Evals

The code in this notebook addresses these issues. The code is also available as [this script](https://gist.github.com/jxnl/5627c9d463ffe0b085896f7890fab1bf).

## Data

This course uses synthetic data based on the use-case of answering questions on a hardware retailer's website based on product reviews. We have created this data in `make_product_reviews.ipynb`. Here is a small sample of the data.

In [1]:
import lancedb
import pandas as pd

pd.set_option("display.max_colwidth", 160)

db = lancedb.connect("./lancedb")
reviews_table = db.open_table("reviews")
sample_reviews = reviews_table.to_pandas()
sample_reviews.review

0      I've been using this hammer for a few months now, and it's become my go-to tool for all my DIY projects. The 16 oz weight is perfect for driving nails witho...
1      This hammer is a solid addition to my toolbox. The balance between the handle and the head makes it easy to control, and the 16 oz weight is just right for ...
2      I purchased this hammer for some home renovation work, and it has exceeded my expectations. The steel head is tough and has withstood a lot of heavy use wit...
3      As a professional carpenter, I rely on my tools daily, and this hammer has not disappointed. The 16 oz weight is perfect for driving nails quickly and effic...
4      This hammer is a great value for the price. The 16 oz weight is perfect for general carpentry and DIY projects. The grip is comfortable and doesn't slip, ev...
                                                                                    ...                                                                              

## Structure The Data

We use Pydantic & Instructor for a reliable interface between our LLMs and the structured data formats we need to run code on LLM output

In [2]:
from pydantic import BaseModel


class Review(BaseModel):
    id: str
    product_title: str
    product_description: str
    review: str


sample_chunks = [
    Review(
        id=str(row.id),
        product_title=row.product_title,
        product_description=row.product_description,
        review=row.review,
    )
    for _, row in sample_reviews.iterrows()
]

n_questions = 2  # number of questions to get in each LLM call
example_questions = [
    "What does the reviewer like about the product?",
    "What does the reviewer think could be improved?",
]

Now see how we build questions on a single chunk

In [3]:
from typing import List
import instructor
from openai import AsyncOpenAI

# Patch the AsyncOpenAI client
client = instructor.from_openai(AsyncOpenAI())


class QuestionAnswer(BaseModel):
    question: str
    answer: str


class ChunkEval(QuestionAnswer):
    chunk_id: str


async def generate_evals(
    review: Review, n_questions: int, example_questions: List[str]
) -> List[ChunkEval]:

    prompt = f"""
        Generate `{n_questions}` question-answer pairs about a {review.product_title}. The answers should primarily be derived from information in this product review:

        <content>
        {review.review}
        </content>

        While they should contain information from the product review, you may also find it helpful context to see a product description:
        <content>
        {review.product_description}
        </content>

        Example questions:
        {chr(10).join(f'- {q}' for q in example_questions)}

        Provide a concise and specific answer for each question.
        Do not use the exact example questions. Use them only as inspiration for the types of more specific questions to generate.
        Do not include answers that are not in the content.
        Questions should ask about product characteristics (e.g. durability) and answers should refer to product characteristics without referring to the reviewer specifically.
        Stylistically, the questions should resemble what people would ask a RAG-based answer bot on a retailer's website. So they can be a little informal, messy or scattered.
        """

    try:
        pairs = client.chat.completions.create_iterable(
            model="gpt-4o-mini",
            response_model=QuestionAnswer,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
        )
        return [
            ChunkEval(question=pair.question, answer=pair.answer, chunk_id=review.id)
            async for pair in pairs
        ]
    except Exception as e:
        print(f"Error generating evals: {str(e)}")
        return []


first_chunk_res = await generate_evals(sample_chunks[0], n_questions, example_questions)
first_chunk_res

[ChunkEval(question='How heavy is the hammer and is it good for driving nails?', answer='The hammer weighs 16 oz, which is perfect for driving nails without too much effort.', chunk_id='0'),
 ChunkEval(question='Is the grip comfortable for long use?', answer="Yes, the grip is comfortable even during extended use, and there hasn't been any noticeable wear on it.", chunk_id='0')]

To run `generate_evals` for many chunks in parallel, wrap it with a function that also takes a semaphore. 

In [4]:
import asyncio


class ChunkProcessingError(Exception):
    pass


async def process_chunk(
    review: Review,
    n_questions: int,
    example_questions: List[str],
    semaphore: asyncio.Semaphore,
) -> List[ChunkEval]:
    async with semaphore:
        try:
            return await generate_evals(review, n_questions, example_questions)
        except Exception as e:
            print(f"Unexpected error processing chunk {review.id}: {str(e)}")
            raise ChunkProcessingError(f"Failed to process chunk {review.id}") from e


# Test that we get the same results as directly calling generate_evals
await process_chunk(
    sample_chunks[0], n_questions, example_questions, asyncio.Semaphore(1)
)

[ChunkEval(question='How heavy is the hammer and is it good for driving nails?', answer='The hammer weighs 16 oz, which is perfect for driving nails without too much effort.', chunk_id='0'),
 ChunkEval(question='Is the grip comfortable for long use?', answer='Yes, the grip is comfortable even during extended use, and there has been no noticeable wear on it.', chunk_id='0')]

Now you can call `process_chunks` with all chunks to build the full dataset

In [5]:
import json


async def create_synthetic_dataset(
    reviews: List[Review],
    n_questions: int,
    example_questions: List[str],
    max_concurrency: int = 10,
) -> List[ChunkEval]:
    semaphore = asyncio.Semaphore(max_concurrency)
    tasks = [
        process_chunk(review, n_questions, example_questions, semaphore)
        for review in reviews
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    dataset = []
    for result in results:
        if isinstance(result, ChunkProcessingError):
            print(result)
        elif isinstance(result, list):
            dataset.extend(result)
        else:
            print(f"Unexpected result type: {type(result)}")

    return dataset


def save_dataset(dataset: List[ChunkEval], filename: str):
    with open(filename, "w") as f:
        json.dump([chunk_eval.model_dump() for chunk_eval in dataset], f, indent=2)


synthetic_dataset = await create_synthetic_dataset(
    sample_chunks, n_questions, example_questions
)
save_dataset(synthetic_dataset, "synthetic_eval_dataset.json")

print(f"Generated {len(synthetic_dataset)} ChunkEvals.")
print("Dataset saved as 'synthetic_eval_dataset.json'")

Generated 1800 ChunkEvals.
Dataset saved as 'synthetic_eval_dataset.json'


View the data as a DataFrame

In [6]:
data = [(i.question, i.answer, i.chunk_id) for i in synthetic_dataset]
pd.DataFrame(data, columns=["question", "answer", "chunk_id"]).head()

Unnamed: 0,question,answer,chunk_id
0,How heavy is this hammer and is it good for driving nails?,"The hammer weighs 16 oz, which is perfect for driving nails without too much effort.",0
1,Is the grip comfortable for long use?,"Yes, the grip is comfortable even during extended use, and there hasn't been any noticeable wear on it.",0
2,How does the hammer feel in terms of balance and control?,The balance between the handle and the head makes it easy to control.,1
3,Is the grip comfortable for long use?,"The grip is ergonomic and reduces hand fatigue, which is a big plus during long projects.",1
4,How durable is the hammer's steel head?,The steel head is tough and has withstood a lot of heavy use without any dents or chips.,2
