# Context

After releasing your RAG-based product, you will want to characterize the types of questions you receive and how you perform on each cluster of questions. This notebook generates synthetic data that is a stand-in for your production data. You won't use code from this notebook, and instead it creates assets used in analyze_clusters (which is a notebook you can reuse in your work.)

# Code

## Question Type Data

Classification code won't have access to the true data generating process for the frequency of each question or the fraction that have thumbs up. So we separate them from other data in `question_types.py`. The frequency and satisfaction data is here

In [1]:
from question_types import QuestionTypes

# Recency factor creates a temporal trend. Numbers <1 cause trend to more recent queries
question_type_stats = {
    QuestionTypes.COMPARISON: {
        "avg_questions_per_item": 7,
        "frac_thumbs_up": 0.1,
        "recency_factor": 1,
    },
    QuestionTypes.VAGUE: {
        "avg_questions_per_item": 5,
        "frac_thumbs_up": 0.7,
        "recency_factor": 1,
    },
    QuestionTypes.TYPICAL_PRICE: {
        "avg_questions_per_item": 1,
        "frac_thumbs_up": 0.3,
        "recency_factor": 2,
    },
    QuestionTypes.CUSTOMER_SERVICE: {
        "avg_questions_per_item": 2,
        "frac_thumbs_up": 0.7,
        "recency_factor": 1,
    },
    QuestionTypes.VISUAL: {
        "avg_questions_per_item": 2,
        "frac_thumbs_up": 0.1,
        "recency_factor": 1,
    },
    QuestionTypes.ACCESSORIES: {
        "avg_questions_per_item": 2,
        "frac_thumbs_up": 0.3,
        "recency_factor": 1,
    },
    QuestionTypes.COMPATIBILITY: {
        "avg_questions_per_item": 1,
        "frac_thumbs_up": 0.8,
        "recency_factor": 1,
    },
    QuestionTypes.COUNTRY_OF_ORIGIN: {
        "avg_questions_per_item": 1,
        "frac_thumbs_up": 0.6,
        "recency_factor": 0.7,
    },
    QuestionTypes.ENVIRONMENTAL: {
        "avg_questions_per_item": 1,
        "frac_thumbs_up": 0.1,
        "recency_factor": 1,
    },
    QuestionTypes.AUTHENTIC: {
        "avg_questions_per_item": 1,
        "frac_thumbs_up": 0.1,
        "recency_factor": 0.5,
    },
    QuestionTypes.MATERIALS: {
        "avg_questions_per_item": 1,
        "frac_thumbs_up": 0.2,
        "recency_factor": 1,
    },
    QuestionTypes.TIME_SENSITIVE: {
        "avg_questions_per_item": 3,
        "frac_thumbs_up": 0.1,
        "recency_factor": 1,
    },
    QuestionTypes.TREND: {
        "avg_questions_per_item": 1,
        "frac_thumbs_up": 0.1,
        "recency_factor": 1,
    },
}


## Load Product Data
We embed product details in the prompt to generate questions.

In [2]:
import lancedb
from question_types import Product

try:
    db = lancedb.connect("../../week1_bootstrap_evals/lancedb")
    products = db.open_table("products").to_pandas()[["title", "description"]]
    products = [
        Product(title=row["title"], description=row["description"])
        for _, row in products.iterrows()
    ]
except Exception as e:
    print(f"Error loading product data. Run the week1 course notebooks first to create the products DB")
    print(f"Error: {str(e)}")
    products = []

## Main Generation Code

In [3]:
import asyncio
from typing import List
import instructor
import json
from openai import AsyncOpenAI
from numpy.random import poisson, uniform
from question_types import UntypedQuestion, Question, Product, question_type_details
client = instructor.from_openai(AsyncOpenAI())

async def generate_questions(
    n_questions: int, product: Product, question_type: QuestionTypes, semaphore: asyncio.Semaphore, recency_factor: float = 1.0
) -> List[UntypedQuestion]:
    async with semaphore:
        question_type_info = question_type_details[question_type]
        question_type_title = question_type_info.title
        question_type_description = question_type_info.description
        question_type_examples = question_type_info.examples
        recency_factor = question_type_stats[question_type]["recency_factor"]

        prompt = f"""Create {n_questions} questions that someone might ask about a {product.title} before buying it online.

The description of the {product.title} is: {product.description}

Your questions should specifically be in the following category of questions: `{question_type_title}`.
This category of questions questions is described as: `{question_type_description}`.

Here are examples of questions in that category:
`{question_type_examples[0]}`
`{question_type_examples[1]}`

The questions should be varied. Do not have duplicates.
Use creative license to make the questions specific and concrete (e.g. you can make up other product names or product details to make concrete questions)
Respond only with the list of questions."""

        frac_thumbs_up = question_type_stats[question_type]["frac_thumbs_up"]
        try:
            questions = client.chat.completions.create_iterable(
                model="gpt-4o-mini",
                response_model=Question,
                messages=[{"role": "user", "content": prompt}],
            )
            return [
                UntypedQuestion(
                    question=Question(text=q.text),
                    product=product,
                    thumbs_up=uniform() < frac_thumbs_up,
                    days_ago=int(uniform(0, 30) * recency_factor)
                )
                async for q in questions
            ]
        except Exception as e:
            print(f"Error generating evals: {str(e)}")
            return []


async def all_questions_for_qtype(question_type: QuestionTypes, semaphore: asyncio.Semaphore) -> List[UntypedQuestion]:
    avg_questions_per_item = question_type_stats[question_type.value]["avg_questions_per_item"]
    tasks = []
    for product in products:
        n_questions = poisson(avg_questions_per_item)
        task = generate_questions(n_questions, product, question_type, semaphore)
        tasks.append(task)
    
    results = await asyncio.gather(*tasks)
    return [question for sublist in results for question in sublist]

async def generate_all_questions():
    semaphore = asyncio.Semaphore(50)
    tasks = []
    for qt in QuestionTypes:
        task = all_questions_for_qtype(qt, semaphore)
        tasks.append(task)
    
    results = await asyncio.gather(*tasks)
    return [question for sublist in results for question in sublist]

questions = await generate_all_questions()

* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
* 'smart_union' has been removed


Error generating evals: 1 validation error for Question
  Invalid JSON: EOF while parsing a string at line 1 column 73 [type=json_invalid, input_value='{"text":"Are the 12-poin... on a Toyota Prius?”}', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/json_invalid


## Save Questions

We could also save it in LanceDB. But use JSON for simplicity.

In [4]:
def serialize_question(q: UntypedQuestion):
    return {
        "question": q.question.text,
        "product": q.product.dict(),
        "thumbs_up": q.thumbs_up,
        "days_ago": q.days_ago
    }

with open("prod_questions.json", "w") as f:
    serialized = [serialize_question(q) for q in questions]
    json.dump(serialized, f, default=str)