# Introduction

In this notebook, we'll use a simple pipeline to generate synthetic questions. We'll start with a open source hugging face dataset of SQL queries and then generate questions from them. Once we've done so, we can use a RAG system to answer these questions and then evaluate the recall of the RAG system.


## Formatting Our Data

Let's start by defining the data types we'll use throughout the notebook.

1. Chunk: This represents the data that we'll be using to generate questions with
2. QuestionAnswer: This represents a question and answer pair generated by a language model
3. ChunkEval: This represents a single evaluation that we'll be using to recall the RAG system

In [26]:
from pydantic import BaseModel

class Chunk(BaseModel):
    chunk_id: str
    text:str
    metadata:dict

class QuestionAnswer(BaseModel):
    chain_of_thought:str
    question:str
    answer:str

class ChunkEval(BaseModel):
    chunk_id:str
    question:str
    answer:str
    chunk:str

For this specific example, we'll be using the Bird-Bench dataset. We've uploaded it ahead of time to Hugging Face so let's load it in.

In [10]:
import datasets

dataset = datasets.load_dataset("567-labs/bird-dev-snippets")

# We only take challenging questions
challenging_questions = [item for item in dataset["original"] if item['metadata']['difficulty'] == 'challenging']
challenging_questions[0]

{'question': 'Consider the average difference between K-12 enrollment and 15-17 enrollment of schools that are locally funded, list the names and DOC type of schools which has a difference above this average.',
 'labels': ["SELECT T2.School, T2.DOC FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode WHERE T2.FundingType = 'Locally funded' AND (T1.`Enrollment (K-12)` - T1.`Enrollment (Ages 5-17)`) > (SELECT AVG(T3.`Enrollment (K-12)` - T3.`Enrollment (Ages 5-17)`) FROM frpm AS T3 INNER JOIN schools AS T4 ON T3.CDSCode = T4.CDSCode WHERE T4.FundingType = 'Locally funded')"],
 'query': "SELECT T2.School, T2.DOC FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode WHERE T2.FundingType = 'Locally funded' AND (T1.`Enrollment (K-12)` - T1.`Enrollment (Ages 5-17)`) > (SELECT AVG(T3.`Enrollment (K-12)` - T3.`Enrollment (Ages 5-17)`) FROM frpm AS T3 INNER JOIN schools AS T4 ON T3.CDSCode = T4.CDSCode WHERE T4.FundingType = 'Locally funded')",
 'metadata': {'db_id':

Now that we have our dataset, let's format it into our Chunk data type. Once we've done so, we can start generating synthetic questions that we can use to evaluate our RAG system.

In [13]:
import hashlib


def hash_query(query:str) -> str:
    return hashlib.sha256(query.encode()).hexdigest()

chunks = [
    Chunk(chunk_id=hash_query(item['query']), text=item['query'], metadata=item['metadata'])
    for item in challenging_questions
]

print(chunks[0].model_dump_json(indent=2))

{
  "chunk_id": "0c19c282e65f21e5acb0809c95b3fffcb077b434b6ee137390068454a01d8b6a",
  "text": "SELECT T2.School, T2.DOC FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode WHERE T2.FundingType = 'Locally funded' AND (T1.`Enrollment (K-12)` - T1.`Enrollment (Ages 5-17)`) > (SELECT AVG(T3.`Enrollment (K-12)` - T3.`Enrollment (Ages 5-17)`) FROM frpm AS T3 INNER JOIN schools AS T4 ON T3.CDSCode = T4.CDSCode WHERE T4.FundingType = 'Locally funded')",
  "metadata": {
    "db_id": "california_schools",
    "difficulty": "challenging",
    "evidence": "Difference between K-12 enrollment and 15-17 enrollment can be computed by `Enrollment (K-12)` - `Enrollment (Ages 5-17)`",
    "query": "SELECT T2.School, T2.DOC FROM frpm AS T1 INNER JOIN schools AS T2 ON T1.CDSCode = T2.CDSCode WHERE T2.FundingType = 'Locally funded' AND (T1.`Enrollment (K-12)` - T1.`Enrollment (Ages 5-17)`) > (SELECT AVG(T3.`Enrollment (K-12)` - T3.`Enrollment (Ages 5-17)`) FROM frpm AS T3 INNER JOIN schools 

## Generating Questions

Now that we've formatted our data, we can start generating some synthetic questions. We'll be using Open AI's gpt-4o model to generate these questions. Note here that we're using a relatively simple prompt to generate questions, you can definitely experiment with more complex prompts to generate better questions. But it's always best to start simple.

Because we're going to be generating a large amount of questions at once, we'll be using the `instructor` library and running our requests in parallel to speed things up.

In [15]:
import openai
import instructor
from asyncio import Semaphore
from tqdm.asyncio import tqdm_asyncio as asyncio

client = instructor.from_openai(openai.AsyncOpenAI())

`instructor` supports complex prompt templating using the `jinja` templating language. This makes it easy for us to format our prompts by using `jinja` variables and using the `context` object to pass in the relevant data.

In [27]:
async def generate_questions(chunk:Chunk,sem:Semaphore) -> ChunkEval:
    async with sem:
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role":"user",
                "content":"""
                Generate a question and answer pair that the following SQL snippet below will be able to answer. 

                The question should 
                - Be answerable only by the data that will be returned by the SQL snippet
                - Not mention specific information in the SQL snippet directly

                SQL Snippet:
                {{ snippet }}
                """
            }],
            response_model=QuestionAnswer,
            context = {
                "snippet":chunk.text
            },
            timeout=20
        )

        return ChunkEval(
            chunk_id=chunk.chunk_id,
            question=resp.question,
            answer=resp.answer,
            chunk=chunk.text
        )

sem = Semaphore(10)
coros = [generate_questions(chunk,sem) for chunk in chunks]
questions:list[ChunkEval] = await asyncio.gather(*coros)


100%|██████████| 145/145 [01:49<00:00,  1.33it/s]


In [25]:
print(questions[0].model_dump_json(indent=2))


{
  "chunk_id": "0c19c282e65f21e5acb0809c95b3fffcb077b434b6ee137390068454a01d8b6a",
  "question": "Which locally funded schools have a significantly higher proportion of enrollment outside the ages 5-17 compared to the average locally funded school?",
  "answer": "The selected schools and their DOCs that have a greater difference between total K-12 enrollment and ages 5-17 enrollment than the average among locally funded schools."
}


## Uploading our Dataset

Now that we've generated our questions, let's upload them to Braintrust so that we can use it in our Evaluation later on

In [33]:
import braintrust


dataset = braintrust.init_dataset(project="Retrieval", name="Synthetic Questions")
for question in questions:
    dataset.insert(
        input=question.question,
        expected=[question.chunk],
        metadata={

            "chunk_id":question.chunk_id,
            "chunk":question.chunk
        }
    )

print(dataset.summarize())


Total records: 145 (145 new or updated records)
See results for all datasets in Retrieval at https://www.braintrust.dev/app/567/p/Retrieval
See results for Synthetic Questions at https://www.braintrust.dev/app/567/p/Retrieval/datasets/Synthetic%20Questions
