# Goal

This notebook shows how to make synthetic data to bootstrap evaluation of your retrieval system. This synthetic data will contain many triplets of `(RAG system input, system output, desired chunk to retrieve)`. For the sake of example, the system input will be a question and the output will be an answer. A single triplet of generated data might be:

```
Q: When should we use retrieval augmented generation? 
A: We use retrieval augmented generation when we want an LLM to answer a question while considering information it does not already "know." This information could be non-pubic or it might be created after the model's training data cutoff.
Chunk ID: 3
```

Once you have many of these triplets, you can experiments with different retrieval strategies (e.g. different embedding models, embedding vs keyword search, etc) to determine which strategies most consistently retrieve the desired chunks.

# A Starting Point

A simple approach would follow pseudo code:

```
synth_data = []
for chunk in corpus:
    response = call_llm(f"Give me a JSON array of 10 question/answer pairs derived from this text: {chunk}")
    q_a = json.loads(response.content)
    q_a_c = [{'question': q, 'answer': a, 'chunk': chunk} for (q, a) in q_a_pairs]
    synth_data.extend(q_a_c)
```

A practical implementation would address three issues that would arise in the naive pseudo code.

| Issue | Solution |
|---------|----------|
| Inconsistent formatting of LLM response (e.g. different keys) | Instructor library |
| Bad questions | Guidance/examples in prompt |
| Time waiting for LLM responses when iterating over many chunks | Async LLM calls|

# Reusable Code to Bootstrap Evals

The code in this notebook addresses these issues. The code is also available as [this script](https://gist.github.com/jxnl/5627c9d463ffe0b085896f7890fab1bf).

## Data

The following data is here as an example. You will replace it with your data.

In [1]:
from pydantic import BaseModel

class TextChunk(BaseModel):
    id: str
    content: str

sample_chunks = [
        TextChunk(
            id="chunk1",
            content="Machine learning is a method of data analysis that automates analytical model building.",
        ),
        TextChunk(
            id="chunk2",
            content="Python is a high-level, interpreted programming language known for its simplicity and readability.",
        ),
        TextChunk(
            id="chunk3",
            content="Climate change refers to long-term shifts in temperatures and weather patterns, mainly caused by human activities.",
        ),
    ]

n_questions = 3 # number of questions to get in each LLM call
example_questions = [
    "What is the main topic of this text?",
    "Can you summarize the key points in this content?",
    "How does this information relate to current trends in the field?",
]

Now see how we build questions on a single chunk

In [2]:
from typing import List
import instructor
from openai import AsyncOpenAI

# Patch the AsyncOpenAI client
client = instructor.from_openai(AsyncOpenAI())


class QuestionAnswer(BaseModel):
    question: str
    answer: str

class ChunkEval(QuestionAnswer):
    chunk_id: str

async def generate_evals(
    chunk: TextChunk, n_questions: int, example_questions: List[str]
) -> List[ChunkEval]:

    prompt = f"""
        Generate `{n_questions}` question-answer pairs based on the following content:

        <content>
        {chunk.content}
        </content>

        Example questions:
        {chr(10).join(f'- {q}' for q in example_questions)}

        Generate diverse questions that probe different aspects of the content. 
        Provide a concise answer for each question.
        Do not use the exact example questions, but use them as inspiration for the types of questions to generate.
        Do not include answers that are not in the content.
        """

    try:
        pairs = client.chat.completions.create_iterable(
            model="gpt-4o",
            response_model=QuestionAnswer,
            messages=[{"role": "user", "content": prompt}],
        )
        return [
            ChunkEval(question=pair.question, answer=pair.answer, chunk_id=chunk.id)
            async for pair in pairs
        ]
    except Exception as e:
        print(f"Error generating evals: {str(e)}")
        return []


first_chunk_res = await generate_evals(sample_chunks[0], n_questions, example_questions)
first_chunk_res

[ChunkEval(question='What is machine learning?', answer='Machine learning is a method of data analysis that automates analytical model building.', chunk_id='chunk1'),
 ChunkEval(question='What does machine learning automate?', answer='Machine learning automates analytical model building.', chunk_id='chunk1'),
 ChunkEval(question='What is machine learning used for?', answer='Machine learning is used for data analysis.', chunk_id='chunk1')]

To run `generate_evals` for many chunks in parallel, wrap it with a function that also takes a semaphore. 

In [3]:
import asyncio

class ChunkProcessingError(Exception):
    pass

async def process_chunk(
    chunk: TextChunk,
    n_questions: int,
    example_questions: List[str],
    semaphore: asyncio.Semaphore
) -> List[ChunkEval]:
    async with semaphore:
        try:
            return await generate_evals(chunk, n_questions, example_questions)
        except Exception as e:
            print(f"Unexpected error processing chunk {chunk.id}: {str(e)}")
            raise ChunkProcessingError(f"Failed to process chunk {chunk.id}") from e

# Test that we get the same results as directly calling generate_evals
await process_chunk(sample_chunks[0], n_questions, example_questions, asyncio.Semaphore(1))

[ChunkEval(question='What is machine learning?', answer='Machine learning is a method of data analysis that automates analytical model building.', chunk_id='chunk1'),
 ChunkEval(question='What does machine learning automate?', answer='Machine learning automates the building of analytical models.', chunk_id='chunk1'),
 ChunkEval(question='What field does machine learning belong to?', answer='Machine learning belongs to the field of data analysis.', chunk_id='chunk1')]

Now you can call `process_chunks` with all chunks to build the full dataset

In [4]:
import json

async def create_synthetic_dataset(
    chunks: List[TextChunk],
    n_questions: int,
    example_questions: List[str],
    max_concurrency: int = 10,
) -> List[ChunkEval]:
    semaphore = asyncio.Semaphore(max_concurrency)
    tasks = [
        process_chunk(chunk, n_questions, example_questions, semaphore)
        for chunk in chunks
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    dataset = []
    for result in results:
        if isinstance(result, ChunkProcessingError):
            print(result)
        elif isinstance(result, list):
            dataset.extend(result)
        else:
            print(f"Unexpected result type: {type(result)}")

    return dataset


def save_dataset(dataset: List[ChunkEval], filename: str):
    with open(filename, "w") as f:
        json.dump([chunk_eval.model_dump() for chunk_eval in dataset], f, indent=2)

synthetic_dataset = await create_synthetic_dataset(sample_chunks, n_questions, example_questions)
save_dataset(synthetic_dataset, "synthetic_eval_dataset.json")

print(f"Generated {len(synthetic_dataset)} ChunkEvals.")
print("Dataset saved as 'synthetic_eval_dataset.json'")


Generated 9 ChunkEvals.
Dataset saved as 'synthetic_eval_dataset.json'


View the data as a DataFrame

In [5]:
import pandas as pd
data = [(i.question, i.answer, i.chunk_id) for i in synthetic_dataset]
pd.DataFrame(data, columns=["question", "answer", "chunk_id"])

Unnamed: 0,question,answer,chunk_id
0,What is machine learning?,Machine learning is a method of data analysis ...,chunk1
1,What does machine learning automate?,Machine learning automates analytical model bu...,chunk1
2,What can the method of data analysis mentioned...,It can be used for automating analytical model...,chunk1
3,What type of programming language is Python?,"Python is a high-level, interpreted programmin...",chunk2
4,What are two notable features of Python mentio...,Python is known for its simplicity and readabi...,chunk2
5,How is Python executed?,Python is an interpreted programming language.,chunk2
6,What is climate change?,Climate change refers to long-term shifts in t...,chunk3
7,What is mainly causing climate change accordin...,Human activities are mainly causing climate ch...,chunk3
8,What does the term 'long-term shifts' refer to...,It refers to changes in temperatures and weath...,chunk3
