# Query -> Category

<small>
(from <a href="http://maven.com/softwaredoug/cheat-at-search">Cheat at Search with LLMs</a> training course by Doug Turnbull.)
</small>

We learned that we want to try to model aspects of the user's _information need_ not just make queries better.

One such common dimension is query -> category classification.

The Wayfair dataset has an existing classification scheme. We'll use that. It has entries such as:

```
Furniture / Living Room Furniture / Chairs & Seating / Accent Chairs,
Rugs / Area Rugs,
...
```

The top level we'll call "category" the second level we'll call "subcategory"


In this notebook, we'll use OpenAI to classify our queries into a category and subcategory per query.

## Boilerplate

Install deps, mount GDrive, prompt for your OpenAI Key (placed in your GDrive), and import needed cheat at search helpers.

We cover this extensively in the [synonyms notebook](https://colab.research.google.com/drive/1aUCvcBa1YdmsbIgYc74jlknl9_iRotp1) walkthrough

In [None]:
!pip install git+https://github.com/softwaredoug/cheat-at-search.git
from cheat_at_search.data_dir import mount
mount(use_gdrive=True)
from cheat_at_search.search import run_strategy, graded_bm25, ndcgs, ndcg_delta, vs_ideal
from cheat_at_search.wands_data import products

products

Collecting git+https://github.com/softwaredoug/cheat-at-search.git
  Cloning https://github.com/softwaredoug/cheat-at-search.git to /tmp/pip-req-build-_a699r87
  Running command git clone --filter=blob:none --quiet https://github.com/softwaredoug/cheat-at-search.git /tmp/pip-req-build-_a699r87
  Resolved https://github.com/softwaredoug/cheat-at-search.git to commit 8b45aa193bb58c42c89e1aed1c6279be0be54afa
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pystemmer<4.0.0,>=3.0.0 (from cheat_at_search==0.1.0)
  Using cached PyStemmer-3.0.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting searcharray<0.0.74,>=0.0.73 (from cheat_at_search==0.1.0)
  Downloading searcharray-0.0.73-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading PyStemmer-3.0.0-cp

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,features,category,sub_category,cat_subcat
0,0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0,"[overallwidth-sidetoside:64.7, dsprimaryproduc...",Furniture,Bedroom Furniture,Furniture / Bedroom Furniture
1,1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0,"[capacityquarts:7, producttype : slow cooker, ...",Kitchen & Tabletop,Small Kitchen Appliances,Kitchen & Tabletop / Small Kitchen Appliances
2,2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0,"[features : keep warm setting, capacityquarts:...",Kitchen & Tabletop,Small Kitchen Appliances,Kitchen & Tabletop / Small Kitchen Appliances
3,3,all-clad all professional tools pizza cutter,"Slicers, Peelers And Graters",Browse By Brand / All-Clad,this original stainless tool was designed to c...,overallwidth-sidetoside:3.5|warrantylength : l...,69.0,4.5,42.0,"[overallwidth-sidetoside:3.5, warrantylength :...",Browse By Brand,All-Clad,Browse By Brand / All-Clad
4,4,baldwin prestige alcott passage knob with roun...,Door Knobs,Home Improvement / Doors & Door Hardware / Doo...,the hardware has a rich heritage of delivering...,compatibledoorthickness:1.375 '' |countryofori...,70.0,5.0,42.0,"[compatibledoorthickness:1.375 '' , countryofo...",Home Improvement,Doors & Door Hardware,Home Improvement / Doors & Door Hardware
...,...,...,...,...,...,...,...,...,...,...,...,...,...
42989,42989,malibu pressure balanced diverter fixed shower...,Shower Panels,Home Improvement / Bathroom Remodel & Bathroom...,the malibu pressure balanced diverter fixed sh...,producttype : shower panel|spraypattern : rain...,3.0,4.5,2.0,"[producttype : shower panel, spraypattern : ra...",Home Improvement,Bathroom Remodel & Bathroom Fixtures,Home Improvement / Bathroom Remodel & Bathro...
42990,42990,emmeline 5 piece breakfast dining set,Dining Table Sets,Furniture / Kitchen & Dining Furniture / Dinin...,,basematerialdetails : steel| : gray wood|ofhar...,1314.0,4.5,864.0,"[basematerialdetails : steel, : gray wood, of...",Furniture,Kitchen & Dining Furniture,Furniture / Kitchen & Dining Furniture
42991,42991,maloney 3 piece pub table set,Dining Table Sets,Furniture / Kitchen & Dining Furniture / Dinin...,this pub table set includes 1 counter height t...,additionaltoolsrequirednotincluded : power dri...,49.0,4.0,41.0,[additionaltoolsrequirednotincluded : power dr...,Furniture,Kitchen & Dining Furniture,Furniture / Kitchen & Dining Furniture
42992,42992,fletcher 27.5 '' wide polyester armchair,Teen Lounge Furniture|Accent Chairs,Furniture / Living Room Furniture / Chairs & S...,"bring iconic , modern style to your space in a...",legmaterialdetails : rubberwood|backheight-sea...,1746.0,4.5,1226.0,"[legmaterialdetails : rubberwood, backheight-s...",Furniture,Living Room Furniture,Furniture / Living Room Furniture


## Query -> Category classification baseline

We'll setup a task of classifying query to category and subcategory. Here we have a first-pass baseline that might make sense.

### Define allowed output



You'll notice we constrain the allowed output of the LLM to the allowed categories. We do this in Pydantic with a `Literal` type.

In [None]:
from pydantic import BaseModel, Field
from typing import List, Literal
from cheat_at_search.enrich import AutoEnricher


Categories = Literal['Furniture',
                     'Home Improvement',
                     'Décor & Pillows',
                     'Outdoor',
                     'Storage & Organization',
                     'Lighting',
                     'Rugs',
                     'Bed & Bath',
                     'Kitchen & Tabletop',
                     'Baby & Kids',
                     'School Furniture and Supplies',
                     'Appliances',
                     'Holiday Décor',
                     'Commercial Business Furniture',
                     'Pet',
                     'Contractor',
                     'Sale',
                     'Foodservice ',
                     'Reception Area',
                     'Clips']

SubCategories = Literal['Bedroom Furniture',
 'Small Kitchen Appliances',
 'All-Clad',
 'Doors & Door Hardware',
 'Bathroom Remodel & Bathroom Fixtures',
 'Home Accessories',
 'Living Room Furniture',
 'Outdoor Décor',
 'Flooring, Walls & Ceiling',
 'Garage & Outdoor Storage & Organization',
 'Cookware & Bakeware',
 'Bedding',
 'Kitchen Utensils & Tools',
 'Shower Curtains & Accessories',
 'Wall Shelving & Organization',
 'Clocks',
 'Bedding Essentials',
 'Kitchen & Dining Furniture',
 'Office Furniture',
 'Tableware & Drinkware',
 'Nursery Bedding',
 'Cat',
 'Outdoor Shades',
 'Outdoor & Patio Furniture',
 'Ceiling Lights',
 'Area Rugs',
 'Outdoor Lighting',
 'Window Treatments',
 'Garden',
 'Closet Storage & Organization',
 'Wall Décor',
 'Mirrors',
 'Shoe Storage',
 'Toddler & Kids Playroom',
 'Game Tables & Game Room Furniture',
 'Decorative Pillows & Blankets',
 'School Furniture',
 'Wall Lights',
 'Bathroom Storage & Organization',
 'Commercial Office Furniture',
 'Flowers & Plants',
 'Mattresses & Foundations',
 'Area Rugs',
 'Cleaning & Laundry Organization',
 'Kitchen Organization',
 'Candles & Holders',
 'Christmas',
 'Toddler & Kids Bedroom Furniture',
 'Front Door Décor & Curb Appeal',
 'Storage Furniture',
 'School Spaces',
 'Hardware',
 'Light Bulbs & Hardware',
 'Ceiling Fans',
 'Doormats',
 'Entry & Hallway',
 'Storage Containers & Drawers',
 'Holiday Lighting',
 'Kitchen Mats',
 'Facilities & Maintenance',
 'Table & Floor Lamps',
 'Bird',
 'Kitchen Appliances',
 'Building Equipment',
 'Art',
 'Picture Frames & Albums',
 'Outdoor Heating',
 'Outdoor Recreation',
 'Bathroom Accessories & Organization',
 'School Boards & Technology',
 'Closeout',
 'Reception Seating',
 'Foodservice Tables',
 'Kitchen Remodel & Kitchen Fixtures',
 'Hot Tubs & Saunas',
 'Teen Bedroom Furniture',
 'Outdoor Fencing & Flooring',
 'Chairs',
 'Bath Rugs & Towels',
 'Fish',
 'Dog',
 'Chicken',
 'Boards & Tech Accessories',
 'Commercial Contractor',
 'Clamps',
 'Jewelry Organization',
 'Entry & Mudroom Furniture',
 'Outdoor Cooking & Tableware',
 'Seasonal Décor',
 'Nursery Furniture',
 'Storage & Organization Sale',
 'Washers & Dryers',
 'Baby & Kids Décor & Lighting',
 'Outdoor Remodel',
 'Plumbing',
 'Birch Lane™',
 'Office Organization',
 'Kitchen & Dining Sale',
 'Holiday Lighting',
 'Baby & Kids Storage',
 'Shop All Characters',
 'Commercial Kitchen',
 'Guest Room Amenities',
 'Charlton Home',
 'Wade Logan®',
 'Heating, Cooling & Air Quality',
 'Thanksgiving',
 'Fourth of July',
 'Vacuums & Deep Cleaners',
 'Stair Tread Rugs',
 'Small Spaces',
 'Toddler & Kids Bedding & Bath',
 'Classroom Décor',
 'Early Education Play Area',
 'Zoomie Kids',
 'Fryers',
 'August Grove',
 'Dorm Décor & Back to School Essentials',
 'Symple Stuff',
 'Wayfair Basics®',
 'The Holiday Aisle',
 'Chair Pads & Cushions',
 'The Monogram Shop',
 'Wedding',
 'Wedding',
 'Reception Desks & Tables',
 'Rug Pads',
 'Latitude Run',
 'Accommodations Furniture',
 'Easter',
 'Furniture Sale',
 'Shop All Characters',
 'Novelty Lights',
 "Valentine's Day",
 'Outdoor Sale',
 'Classroom & Training Furniture',
 'Rebrilliant',
 'Rug Pads',
 'Commercial Kitchen Storage',
 'Teen Bedding',
 'Tommy Bahama Home',
 'Appliances Sale',
 'Massage Products']


class Query(BaseModel):
    """
    Base model for search queries, containing common query attributes.
    """
    keywords: str = Field(
        ...,
        description="The original search query keywords sent in as input"
    )
    query_intent: str = Field(
        description="Explain the intent of this query"
    )

class QueryCategory(Query):
    """
    Representation of the category and subcategory classification of a search query
    """
    category: Categories = Field(
        description="Category of the product"
    )
    sub_category: SubCategories = Field(
        description="Sub-category of the product"
    )
    labeling_explanation: str = Field(
        description="Why did you label this the way you did?"
    )

    @property
    def classification(self) -> str:
        return f"{self.category} / {self.sub_category}"



### Query classification code

Here we define a prompt and setup the enricher.

In [None]:
enricher = AutoEnricher(
     model="openai/gpt-4o",
     system_prompt="You are a helpful furniture shopping agent that helps users construct search queries.",
     response_model=QueryCategory
)

def get_prompt(query):
    prompt = f"""
        As a helpful agent, you'll recieve requests from users looking for furniture products.

        Your task is to search with a structured query against a furniture product catalog.

        Here is the users request:

        {query}

        Return Category / Subcategory:

        * Category - the allowed categories (as listed in schema) for the product.
        * SubCategory - the allowed subcategories (as listed in schema) for the product.
    """
    return prompt


def categorized(query):
    prompt = get_prompt(query)
    return enricher.enrich(prompt)


categorized("tv stand")

QueryCategory(keywords='tv stand', query_intent='The user is looking for a piece of furniture to hold a television, typically used in living rooms or entertainment areas.', category='Furniture', sub_category='Living Room Furniture', labeling_explanation="A TV stand is a piece of furniture commonly used in living rooms to support televisions and related media equipment, fitting under the 'Living Room Furniture' subcategory.")

### Category search strategy

Below is how we'll use category in search, we

1. Tokenize category / sub category into their own fields
2. Predict the category / subcategory for the query
3. Apply a constant boost (`category_boost` / `sub_category_boost`) to the results below

In [None]:
from searcharray import SearchArray
from cheat_at_search.tokenizers import snowball_tokenizer
from cheat_at_search.strategy.strategy import SearchStrategy
import numpy as np


class CategorySearch(SearchStrategy):
    def __init__(self, products, query_to_cat,
                 name_boost=9.3,
                 description_boost=4.1,
                 category_boost=10,
                 sub_category_boost=5):
        super().__init__(products)
        self.index = products
        self.index['product_name_snowball'] = SearchArray.index(
            products['product_name'], snowball_tokenizer)
        self.index['product_description_snowball'] = SearchArray.index(
            products['product_description'], snowball_tokenizer)

        cat_split = products['category hierarchy'].fillna('').str.split("/")

        products['category'] = cat_split.apply(
            lambda x: x[0].strip() if len(x) > 0 else ""
        )
        products['subcategory'] = cat_split.apply(
            lambda x: x[1].strip() if len(x) > 1 else ""
        )
        self.index['category_snowball'] = SearchArray.index(
            products['category'], snowball_tokenizer
        )
        self.index['subcategory_snowball'] = SearchArray.index(
            products['subcategory'], snowball_tokenizer
        )

        self.query_to_cat = query_to_cat
        self.name_boost = name_boost
        self.description_boost = description_boost
        self.category_boost = category_boost
        self.sub_category_boost = sub_category_boost

    def search(self, query, k=10):
        """Dumb baseline lexical search, but add a constant boost when
           the desired category or subcategory"""
        bm25_scores = np.zeros(len(self.index))
        structured = self.query_to_cat(query)
        tokenized = snowball_tokenizer(query)

        # ****
        # Baseline BM25 search from before
        for token in tokenized:
            bm25_scores += self.index['product_name_snowball'].array.score(token) * self.name_boost
            bm25_scores += self.index['product_description_snowball'].array.score(
                token) * self.description_boost

        # ****
        # If there's a subcategory, boost that by a constant amount
        if structured.sub_category and structured.sub_category != "No SubCategory Fits":
            tokenized_subcategory = snowball_tokenizer(structured.sub_category)
            subcategory_match = np.ones(len(self.index))
            if tokenized_subcategory:
                subcategory_match = self.index['subcategory_snowball'].array.score(tokenized_subcategory) > 0
            bm25_scores[subcategory_match] += self.sub_category_boost

        # ****
        # If there's a category, boost that by a constant amount
        if structured.category and structured.category != "No Category Fits":
            tokenized_category = snowball_tokenizer(structured.category)
            category_match = np.ones(len(self.index))
            if tokenized_category:
                category_match = self.index['category_snowball'].array.score(tokenized_category) > 0
            bm25_scores[category_match] += self.category_boost

        top_k = np.argsort(-bm25_scores)[:k]
        scores = bm25_scores[top_k]

        return top_k, scores


In [None]:
categorized_search = CategorySearch(products, categorized)
graded_categorized = run_strategy(categorized_search)
graded_categorized

2025-10-13 20:28:48,281 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-10-13 20:28:48,358 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-10-13 20:28:48,364 - searcharray.indexing - INFO - Tokenizing 42994 documents


INFO:searcharray.indexing:Tokenizing 42994 documents


2025-10-13 20:28:50,128 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)


INFO:searcharray.indexing:Tokenized 10000 (23.259059403637718%)


2025-10-13 20:28:51,702 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)


INFO:searcharray.indexing:Tokenized 20000 (46.518118807275435%)


2025-10-13 20:28:52,449 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)


INFO:searcharray.indexing:Tokenized 30000 (69.77717821091315%)


2025-10-13 20:28:53,845 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)


INFO:searcharray.indexing:Tokenized 40000 (93.03623761455087%)


2025-10-13 20:28:54,082 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-10-13 20:28:54,094 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-10-13 20:28:54,105 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-10-13 20:28:54,158 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-10-13 20:28:54,229 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-10-13 20:28:54,231 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-10-13 20:28:54,268 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2025-10-13 20:28:54,296 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-10-13 20:28:54,308 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-10-13 20:28:54,311 - searcharray.indexing - INFO - Tokenizing 42994 documents


INFO:searcharray.indexing:Tokenizing 42994 documents


2025-10-13 20:28:55,428 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)


INFO:searcharray.indexing:Tokenized 10000 (23.259059403637718%)


2025-10-13 20:28:56,697 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)


INFO:searcharray.indexing:Tokenized 20000 (46.518118807275435%)


2025-10-13 20:28:58,098 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)


INFO:searcharray.indexing:Tokenized 30000 (69.77717821091315%)


2025-10-13 20:28:59,441 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)


INFO:searcharray.indexing:Tokenized 40000 (93.03623761455087%)


2025-10-13 20:29:00,091 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-10-13 20:29:00,166 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-10-13 20:29:00,210 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-10-13 20:29:00,875 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-10-13 20:29:01,221 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-10-13 20:29:01,224 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-10-13 20:29:01,457 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2025-10-13 20:29:01,997 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-10-13 20:29:02,011 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-10-13 20:29:02,015 - searcharray.indexing - INFO - Tokenizing 42994 documents


INFO:searcharray.indexing:Tokenizing 42994 documents


2025-10-13 20:29:02,320 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)


INFO:searcharray.indexing:Tokenized 10000 (23.259059403637718%)


2025-10-13 20:29:02,622 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)


INFO:searcharray.indexing:Tokenized 20000 (46.518118807275435%)


2025-10-13 20:29:02,914 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)


INFO:searcharray.indexing:Tokenized 30000 (69.77717821091315%)


2025-10-13 20:29:03,189 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)


INFO:searcharray.indexing:Tokenized 40000 (93.03623761455087%)


2025-10-13 20:29:03,470 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-10-13 20:29:03,472 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-10-13 20:29:03,477 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-10-13 20:29:03,486 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-10-13 20:29:03,498 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-10-13 20:29:03,503 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-10-13 20:29:03,530 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2025-10-13 20:29:03,542 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2025-10-13 20:29:03,556 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2025-10-13 20:29:03,558 - searcharray.indexing - INFO - Tokenizing 42994 documents


INFO:searcharray.indexing:Tokenizing 42994 documents


2025-10-13 20:29:03,890 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)


INFO:searcharray.indexing:Tokenized 10000 (23.259059403637718%)


2025-10-13 20:29:04,226 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)


INFO:searcharray.indexing:Tokenized 20000 (46.518118807275435%)


2025-10-13 20:29:04,589 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)


INFO:searcharray.indexing:Tokenized 30000 (69.77717821091315%)


2025-10-13 20:29:04,984 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)


INFO:searcharray.indexing:Tokenized 40000 (93.03623761455087%)


2025-10-13 20:29:05,262 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2025-10-13 20:29:05,265 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2025-10-13 20:29:05,271 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2025-10-13 20:29:05,291 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2025-10-13 20:29:05,305 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2025-10-13 20:29:05,309 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2025-10-13 20:29:05,345 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete
Searching: 100%|██████████| 480/480 [27:11<00:00,  3.40s/it]


Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,features,...,query_id,rank,query_class,id,label,grade,discounted_gain,idcg,dcg,ndcg
0,7465,hair salon chair,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,offers a wide selection of professional salon ...,fauxleathertype : pu|legheight-toptobottom:18|...,69.0,4.5,53.0,"[fauxleathertype : pu, legheight-toptobottom:1...",...,0,1,Massage Chairs,80.0,Exact,2.0,3.00,8.786905,8.10119,0.921962
1,25431,barberpub salon massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,salon chairs are a wonderful avenue for hairst...,supplierintendedandapproveduse : non residenti...,4.0,5.0,4.0,[supplierintendedandapproveduse : non resident...,...,0,2,Massage Chairs,29.0,Exact,2.0,1.50,8.786905,8.10119,0.921962
2,7468,mercer41 hair salon chair hydraulic styling ch...,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,mercer41 beauty offers a wide selection profes...,seatfillmaterial : foam|waterrepellant : no re...,1.0,5.0,1.0,"[seatfillmaterial : foam, waterrepellant : no ...",...,0,3,Massage Chairs,104.0,Exact,2.0,1.00,8.786905,8.10119,0.921962
3,39461,professional salon reclining massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,new and in a good condition . first-rate metal...,overalldepth-fronttoback:39.4|warrantylength:1...,,,,"[overalldepth-fronttoback:39.4, warrantylength...",...,0,4,Massage Chairs,114.0,Exact,2.0,0.75,8.786905,8.10119,0.921962
4,9234,beauty salon task chair,,Furniture / Office Furniture / Office Chairs,"applicable scene : office , home life , beauty...",overallheight-toptobottom:37|backcolor : brown...,,,,"[overallheight-toptobottom:37, backcolor : bro...",...,0,5,Massage Chairs,32.0,Partial,1.0,0.20,8.786905,8.10119,0.921962
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4795,22194,wine glass rack,Kitchen Sink Storage,Kitchen & Tabletop / Kitchen Organization / Co...,drip-dry up to eight wineglasses with this cle...,glasscapacity:8|countryoforigin : united state...,5.0,4.5,3.0,"[glasscapacity:8, countryoforigin : united sta...",...,487,6,,,,0.0,0.00,8.786905,0.00000,0.000000
4796,40243,madisen hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,complement your farmhouse kitchen decor with t...,producttype : wine glass rack|overallwidth-sid...,29.0,5.0,20.0,"[producttype : wine glass rack, overallwidth-s...",...,487,7,,,,0.0,0.00,8.786905,0.00000,0.000000
4797,40244,kena hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,spruce up your farmhouse kitchen decor with th...,warrantylength:1 year|producttype : wine glass...,23.0,5.0,18.0,"[warrantylength:1 year, producttype : wine gla...",...,487,8,,,,0.0,0.00,8.786905,0.00000,0.000000
4798,39976,wall mounted wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,"the latest addition to this collection , this ...",overallheight-toptobottom:4|design : wall moun...,34.0,4.5,18.0,"[overallheight-toptobottom:4, design : wall mo...",...,487,9,,,,0.0,0.00,8.786905,0.00000,0.000000


In [None]:
graded_categorized['query'].unique()

array(['salon chair', 'smart coffee table', 'dinosaur',
       'turquoise pillows', 'chair and a half recliner',
       'sofa with ottoman', 'acrylic clear chair', 'driftwood mirror',
       'home sweet home sign', 'coffee table fire pit', 'king poster bed',
       'ombre rug', 'large spoon and fork wall decor',
       'outdoor privacy wall', 'beds that have leds',
       'black 5 drawer dresser by guilford', 'blk 18x18 seat cushions',
       'closet storage with zipper',
       'chrome bathroom 4 light vanity light', 'gurney  slade 56',
       'foutains with brick look', 'living curtains pearl',
       'light and navy blue decorative pillow',
       'stoneford end tables white and wood',
       'wood coffee table set by storage', 'sunflower', 'leather chairs',
       'outdoor welcome rug', 'rooster decor', 'bathroom vanity knobs',
       '3 1/2 inch drawer pull', 'burnt orange curtains',
       'dark gray dresser', 'non slip shower floor tile',
       'bar stool with backrest', 'enclo

### Analyze the change

1. Look at the mean NDCG between the BM25 baseline and this search strategy. We note a nice improvement.

2. We then take a look at what changes were significantly improved or harmed to ask whether we would ship this to prod?

3. An we look at individual queries

In [None]:
ndcgs(graded_bm25).mean(), ndcgs(graded_categorized).mean()

In [None]:
deltas = ndcg_delta(graded_categorized, graded_bm25)
deltas

In [None]:
sig_improved = len(deltas[deltas > 0.1])
print(f"Num Significatly Improved: {sig_improved}")
deltas[deltas > 0.1]

In [None]:
sig_harmed = len(deltas[deltas < -0.1])
print(f"Num Significatly Harmed: {sig_harmed}")
print(f"Prop improved/harmed: {sig_improved / (sig_harmed + sig_improved)} | {sig_harmed / (sig_harmed + sig_improved)}")
deltas[deltas < -0.1]

In [None]:
categorized("chair pillow cushion")

### Look at a query

In [None]:
QUERY = "bathroom vanity knobs"
categorized(QUERY)

In [None]:
graded_categorized[graded_categorized['query'] == QUERY][['product_name', 'category hierarchy', 'grade']]

In [None]:
graded_bm25[graded_bm25['query'] == QUERY][['product_name', 'category hierarchy', 'grade']]

## Define a ground truth for categories / subcategories

We're going to need to focus in on our classifier's specific performance, so we can understand how its performance relates to NDCG improvements, etc.

This will let us debug our classifier's errors more carefully.

In [None]:
CUTOFF = 0.8

from cheat_at_search.wands_data import labeled_query_products, queries

# Get relevant products per query
top_products = labeled_query_products[labeled_query_products['grade'] == 2]
top_products

In [None]:
# Aggregate top categories
categories_per_query_ideal = top_products.groupby('query')['category'].value_counts().reset_index()
categories_per_query_ideal

In [None]:
# Get as percentage of all categories for this query
top_cat_proportion = categories_per_query_ideal.groupby(['query', 'category']).sum() / categories_per_query_ideal.groupby('query').sum()
top_cat_proportion = top_cat_proportion.drop(columns='category').reset_index()

# Only look at cases where the category is > 0.8
top_cat_proportion = top_cat_proportion[top_cat_proportion['count'] > CUTOFF]
top_cat_proportion['category'].fillna('No Category Fits', inplace=True)
ground_truth_cat = top_cat_proportion
ground_truth_cat

In [None]:
# Give No Category Fits to all others without dominant category
ground_truth_cat = ground_truth_cat.merge(queries, how='right', on='query')[['query', 'category', 'count']]
ground_truth_cat['category'].fillna('No Category Fits', inplace=True)
ground_truth_cat

### Category prediction prec of baseline

In [None]:
def get_pred(cat, column):
    if column == 'category':
        return cat.category
    elif column == 'sub_category':
        return cat.sub_category
    else:
        raise ValueError(f"Unknown column {column}")


def prec_cat(ground_truth, column, no_fit_label, categorized, N=500):
    hits = []
    misses = []
    for _, row in ground_truth.sample(frac=1).iterrows():
        query = row['query']
        expected_category = row[column]

        cat = categorized(query)
        pred = get_pred(cat, column)
        if pred == no_fit_label:
            print(f"Skipping {query}")
            continue
        if pred == expected_category.strip():
            hits.append((expected_category, cat))
        else:
            print("***")
            print(f"{query} -- predicted:{cat.category} != expected:{expected_category.strip()}")
            misses.append((expected_category, cat))
            num_so_far = len(hits) + len(misses)
            print(f"prec (N={num_so_far}) -- {len(hits) / (len(hits) + len(misses))}")
            print(f"coverage {num_so_far / len(ground_truth)}")

        if len(hits) + len(misses) > N:
            break
    return len(hits) / (len(hits) + len(misses)), hits, misses

prec, hits, misses = prec_cat(ground_truth_cat, 'category', 'No Category Fits', categorized, N=500)
prec

In [None]:
from cheat_at_search.wands_data import labeled_query_products, queries

def get_top_category(column, no_fit_label, cutoff=0.8):
    # Get relevant products per query
    top_products = labeled_query_products[labeled_query_products['grade'] == 2]

    # Aggregate top categories
    categories_per_query_ideal = top_products.groupby('query')[column].value_counts().reset_index()

    # Get as percentage of all categories for this query
    top_cat_proportion = categories_per_query_ideal.groupby(['query', column]).sum() / categories_per_query_ideal.groupby('query').sum()
    top_cat_proportion = top_cat_proportion.drop(columns=column).reset_index()

    # Only look at cases where the category is > 0.8
    top_cat_proportion = top_cat_proportion[top_cat_proportion['count'] > CUTOFF]
    top_cat_proportion[column].fillna(no_fit_label, inplace=True)
    ground_truth_cat = top_cat_proportion
    # Give No Category Fits to all others without dominant category
    ground_truth_cat = ground_truth_cat.merge(queries, how='right', on='query')[['query', column, 'count']]
    ground_truth_cat[column].fillna(no_fit_label, inplace=True)
    return ground_truth_cat

ground_truth_sub_cat = get_top_category('sub_category', 'No SubCategory Fits')
ground_truth_sub_cat

In [None]:
prec, hits, misses = prec_cat(ground_truth_sub_cat, 'sub_category', 'No SubCategory Fits', categorized, N=500)
prec

In [None]:
impacted_queries = [
 'drum picture',
 'bathroom freestanding cabinet',
 'outdoor lounge chair',
 'wood rack wide',
 'outdoor light fixtures',
 'bathroom vanity knobs',
 'door jewelry organizer',
 'beds that have leds',
 'non slip shower floor tile',
 'turquoise chair',
 'modern outdoor furniture',
 'podium with locking cabinet',
 'closet storage with zipper',
 'barstool patio sets',
 'ayesha curry kitchen',
 'led 60',
 'wisdom stone river 3-3/4',
 'liberty hardware francisco',
 'french molding',
 'glass doors for bath',
 'accent leather chair',
 'dark gray dresser',
 'wainscoting ideas',
 'floating bed',
 'dining table vinyl cloth',
 'entrance table',
 'storage dresser',
 'almost heaven sauna',
 'toddler couch fold out',
 'outdoor welcome rug',
 'wooden chair outdoor',
 'emma headboard',
 'outdoor privacy wall',
 'driftwood mirror',
 'white abstract',
 'bedroom accessories',
 'bathroom lighting',
 'light and navy blue decorative pillow',
 'gnome fairy garden',
 'medium size chandelier',
 'above toilet cabinet',
 'odum velvet',
 'ruckus chair',
 'modern farmhouse lighting semi flush mount',
 'teal chair',
 'bedroom wall decor floral, multicolored with some teal (prints)',
 'big basket for dirty cloths',
 'milk cow chair',
 'small wardrobe grey',
 'glow in the dark silent wall clock',
 'medium clips',
 'desk for kids tjat ate 10 year old',
 'industrial pipe dining  table',
 'itchington butterfly',
 'midcentury tv unit',
 'gas detector',
 'fleur de lis living candle wall sconce bronze',
 'zodiac pillow',
 'papasan chair frame only',
 'bed side table']
prec, hits, misses = prec_cat(ground_truth_cat[ground_truth_cat['query'].isin(impacted_queries)],
                              'category', 'No Category Fits', categorized, N=500)
prec

In [None]:
products['category hierarchy'].unique()

In [None]:
labeled_query_products[labeled_query_products['grade'] == 2][['query', 'category hierarchy', 'grade']]
# What category / subcategory occurs in > 80% of the relevant results for each query