# Context

Projects have a fixed amount of time/effort available to improve the retrieval pipeline, so you need to prioritize where you spend your effort. You'll want to spend effort on improvements that
1. Affect many queries
2. Affect queries with room for improvement
3. Affect high value queries

This notebook shows how to monitor production traffic and identify what areas to improve (focusing on criteria 1 and 2 above).

Specifically, we use an LLM to categorize queries into different topics or functionality areas. We can do basic analytics to detect
1. Which categories have many queries
2. Which categories have measures of low customer satisfaction

We could do further analytics to look at how these stats change over time (e.g. as we bring in new types of users).

## Raw Data

In [1]:
import asyncio
from typing import List
import instructor
import json
from openai import AsyncOpenAI
import pandas as pd

from question_types import (
    UntypedQuestion,
    TypedQuestion,
    Question,
    Product,
    QuestionTypes,
    question_type_details,
)

client = instructor.from_openai(AsyncOpenAI())

def read_to_question(q: dict) -> UntypedQuestion:
    question = Question(text=q["question"])
    product = Product(title=q["product"]["title"], description=q["product"]["description"])
    return UntypedQuestion(question=question, product=product, thumbs_up=q["thumbs_up"])

with open("prod_questions.json", "r") as f:
    prod_questions = json.load(f)
    untyped_questions = [read_to_question(q) for q in prod_questions]

prod_questions[:3]

[{'question': 'How does the weight of this claw hammer compare to the ProHammer 2000?',
  'product': {'title': 'Hammer',
   'description': 'A versatile claw hammer for general carpentry and home repair. Features an ergonomic grip and balanced weight for efficient and comfortable use.'},
  'thumbs_up': False},
 {'question': 'Is the ergonomic grip of this hammer more comfortable than the GripMaster 300?',
  'product': {'title': 'Hammer',
   'description': 'A versatile claw hammer for general carpentry and home repair. Features an ergonomic grip and balanced weight for efficient and comfortable use.'},
  'thumbs_up': False},
 {'question': 'In terms of versatility, how does this hammer stack up against the MultiTool Hammer Pro?',
  'product': {'title': 'Hammer',
   'description': 'A versatile claw hammer for general carpentry and home repair. Features an ergonomic grip and balanced weight for efficient and comfortable use.'},
  'thumbs_up': False}]

# Classifying Queries

The raw query types are defined in `question_types.py`. We include these in our prompt and ask an LLM to categorize each question. We could either categorize each question into a single category or into multiple categories. This example categorizes into a single category.

In [2]:
async_client = instructor.from_openai(AsyncOpenAI())

q_type_explanation_list = [
    f"NAME: {q.title}\nDESCRIPTION: {q.description}\nEXAMPLE: {q.example}"
    for q in question_type_details.values()
]

q_type_explanation_str = "\n---\n".join(q_type_explanation_list)


async def categorize_question(
    question: UntypedQuestion, semaphore: asyncio.Semaphore = asyncio.Semaphore(1)
) -> TypedQuestion:
    async with semaphore:
        question_text = question.question.text
        prompt = f"""
        Classify the attached question into one of the following categories:
        {', '.join([q.value for q in QuestionTypes])}

        Here are descriptions of each category:
        {q_type_explanation_str}

        Here is the question:
        Question: {question_text}

        For your context, here the product is on a hardware store website with the following description:
        {question.product.description}

        Respond with only the category name.
        """

        try:
            result = await async_client.chat.completions.create(
                model="gpt-4o-mini",
                response_model=str,
                messages=[{"role": "user", "content": prompt}],
            )

            # Convert the string result to the corresponding QuestionTypes enum
            question_type = QuestionTypes(result)

            return TypedQuestion(
                question=question.question,
                question_type=question_type,
                product=question.product,
                thumbs_up=question.thumbs_up,
            )
        except Exception as e:
            print(f"Error classifying question: {str(e)}")
            return None

Run this for all questions (using async patterns since we have many questions, and our time will be spent primarily waiting for API responses).

In [None]:

async def categorize_questions(max_concurrency: int = 20) -> List[TypedQuestion]:
    out = []
    semaphore = asyncio.Semaphore(max_concurrency)
    tasks = [categorize_question(o, semaphore) for o in untyped_questions]
    categorized_questions = await asyncio.gather(*tasks, return_exceptions=True)
    for cq in categorized_questions:
        if not isinstance(cq, Exception):
            out.append(cq)
        else:
            print(f"Error categorizing question: {str(cq)}")
    return out


categorized_questions = await categorize_questions()

# Analytics

Convert the data into a DataFrame and calculate basic statistics

In [3]:
clustered_questions = pd.DataFrame(
    [
        {
            "question_text": q.question.text,
            "question_type": q.question_type.value,
            "product_title": q.product.title,
            "thumbs_up": q.thumbs_up,
        }
        for q in categorized_questions
        if q is not None
    ]
)
clustered_questions.head()

Unnamed: 0,question_text,question_type,product_title,thumbs_up
0,How does the weight of this claw hammer compar...,Comparison,Hammer,False
1,Is the ergonomic grip of this hammer more comf...,Comparison,Hammer,False
2,"In terms of versatility, how does this hammer ...",Comparison,Hammer,False
3,Does this hammer provide better balance compar...,Comparison,Hammer,False
4,How does the durability of this claw hammer co...,Comparison,Hammer,False


In [4]:
cluster_stats = (
    clustered_questions.groupby("question_type")
    .agg(
        num_questions=("question_text", "size"),
        fraction_thumbs_up=("thumbs_up", "mean"),
        count_not_thumbs_up=("thumbs_up", lambda x: x.size - x.sum()),
    )
    .reset_index()
)

cluster_stats.round(2).sort_values("num_questions", ascending=False)

Unnamed: 0,question_type,num_questions,fraction_thumbs_up,count_not_thumbs_up
2,Comparison,404,0.16,341
8,Materials,275,0.23,212
9,Time Sensitive,263,0.06,247
3,Compatibility,255,0.75,64
4,Country of Origin,248,0.61,96
1,Authenticity and counterfeits,244,0.08,224
6,Environmental Impact,218,0.13,189
7,General,201,0.69,62
5,Customer Service,154,0.63,57
0,Accessories,94,0.29,67


# Conclusion

What areas would you prioritize?

Some candidates would be
- Bring in reviews from multiple products since it's so common for people to ask for cross-product comparisons and they are served poorly right now
- Include metadata filtering by review date or add other temporal data since `Time Sensitive` queries are common and are extremely poorly served.
- If you run a platform with many sellers (e.g. Amazon), you might allow filtering by seller within a given SKU. This may help address `Counterfeits` which is also a large group that is very poorly served.

The exact prioritization would depend on how much effort you thought each potential improvement requires, how effective you expect it to be, etc. But now you have a starting point to inform these decisions.