## WANDS - Baseline

This notebook just runs a simple BM25 relevance baseline on a tiny set of test queries + their labeled documents without any fancy openai or anything :)

## Setup

### Download the WANDS e-commerce search dataset

[WANDS is a dataset from Wayfair](https://github.com/wayfair/WANDS) for search experimentation

In [None]:
!git clone https://github.com/wayfair/WANDS.git
!ls WANDS/dataset

Cloning into 'WANDS'...
remote: Enumerating objects: 40, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (37/37), done.[K
remote: Total 40 (delta 7), reused 23 (delta 3), pack-reused 0 (from 0)[K
Receiving objects: 100% (40/40), 33.32 MiB | 8.20 MiB/s, done.
Resolving deltas: 100% (7/7), done.
Updating files: 100% (19/19), done.
label.csv  product.csv	query.csv


### Install dependencies

* SearchArray - a BM25 index
* Pystemmer - for simple stemming
* openai - for openai API access

In [None]:
!pip install SearchArray==0.0.70 pystemmer openai

Collecting SearchArray==0.0.70
  Downloading searcharray-0.0.70.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pystemmer
  Downloading PyStemmer-2.2.0.1.tar.gz (303 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.0/303.0 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting openai
  Downloading openai-1.45.0-py3-none-any.whl.metadata (22 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.

In [None]:
import pandas as pd
import numpy as np
import Stemmer
import string
from searcharray import SearchArray

### Load WANDS dataset

* products: all the products metadata
* queries: the list of queries
* labels: 0, 1, 2 relevance labels for a subset of products for each query

In [None]:
products = pd.read_csv("WANDS/dataset/product.csv",
                       delimiter="\t")
queries = pd.read_csv("WANDS/dataset/query.csv",
                       delimiter="\t")
labels = pd.read_csv("WANDS/dataset/label.csv",
                      delimiter="\t")

In [None]:
products

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count
0,0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0
1,1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0
2,2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0
3,3,all-clad all professional tools pizza cutter,"Slicers, Peelers And Graters",Browse By Brand / All-Clad,this original stainless tool was designed to c...,overallwidth-sidetoside:3.5|warrantylength : l...,69.0,4.5,42.0
4,4,baldwin prestige alcott passage knob with roun...,Door Knobs,Home Improvement / Doors & Door Hardware / Doo...,the hardware has a rich heritage of delivering...,compatibledoorthickness:1.375 '' |countryofori...,70.0,5.0,42.0
...,...,...,...,...,...,...,...,...,...
42989,42989,malibu pressure balanced diverter fixed shower...,Shower Panels,Home Improvement / Bathroom Remodel & Bathroom...,the malibu pressure balanced diverter fixed sh...,producttype : shower panel|spraypattern : rain...,3.0,4.5,2.0
42990,42990,emmeline 5 piece breakfast dining set,Dining Table Sets,Furniture / Kitchen & Dining Furniture / Dinin...,,basematerialdetails : steel| : gray wood|ofhar...,1314.0,4.5,864.0
42991,42991,maloney 3 piece pub table set,Dining Table Sets,Furniture / Kitchen & Dining Furniture / Dinin...,this pub table set includes 1 counter height t...,additionaltoolsrequirednotincluded : power dri...,49.0,4.0,41.0
42992,42992,fletcher 27.5 '' wide polyester armchair,Teen Lounge Furniture|Accent Chairs,Furniture / Living Room Furniture / Chairs & S...,"bring iconic , modern style to your space in a...",legmaterialdetails : rubberwood|backheight-sea...,1746.0,4.5,1226.0


In [None]:
queries

Unnamed: 0,query_id,query,query_class
0,0,salon chair,Massage Chairs
1,1,smart coffee table,Coffee & Cocktail Tables
2,2,dinosaur,Kids Wall Décor
3,3,turquoise pillows,Accent Pillows
4,4,chair and a half recliner,Recliners
...,...,...,...
475,483,rustic twig,Faux Plants and Trees
476,484,nespresso vertuo next premium by breville with...,Espresso Machines
477,485,pedistole sink,Kitchen Sinks
478,486,54 in bench cushion,Furniture Cushions


In [None]:
labels
labels.loc[labels['label'] == 'Exact', 'grade'] = 2
labels.loc[labels['label'] == 'Partial', 'grade'] = 1
labels.loc[labels['label'] == 'Irrelevant', 'grade'] = 0
labels = labels.merge(queries, how='left', on='query_id')
labels

Unnamed: 0,id,query_id,product_id,label,grade,query,query_class
0,0,0,25434,Exact,2.0,salon chair,Massage Chairs
1,1,0,12088,Irrelevant,0.0,salon chair,Massage Chairs
2,2,0,42931,Exact,2.0,salon chair,Massage Chairs
3,3,0,2636,Exact,2.0,salon chair,Massage Chairs
4,4,0,42923,Exact,2.0,salon chair,Massage Chairs
...,...,...,...,...,...,...,...
233443,234010,478,15439,Partial,1.0,worn leather office chair,Office Chairs
233444,234011,478,451,Partial,1.0,worn leather office chair,Office Chairs
233445,234012,478,30764,Irrelevant,0.0,worn leather office chair,Office Chairs
233446,234013,478,16796,Partial,1.0,worn leather office chair,Office Chairs


## Downsample to just labeled results per query

Just to run a quick, educational experiment let's play with just the labeled (relevant / irrelevant) results per query.

In [None]:
merged = labels.merge(products, on="product_id", how="inner")
downsample = merged.groupby('product_id').first().drop(columns=['grade', 'id', 'label', 'query_id',
                                                                'query', 'query_class'])
downsample

Unnamed: 0_level_0,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0
1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0
2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0
3,all-clad all professional tools pizza cutter,"Slicers, Peelers And Graters",Browse By Brand / All-Clad,this original stainless tool was designed to c...,overallwidth-sidetoside:3.5|warrantylength : l...,69.0,4.5,42.0
4,baldwin prestige alcott passage knob with roun...,Door Knobs,Home Improvement / Doors & Door Hardware / Doo...,the hardware has a rich heritage of delivering...,compatibledoorthickness:1.375 '' |countryofori...,70.0,5.0,42.0
...,...,...,...,...,...,...,...,...
42989,malibu pressure balanced diverter fixed shower...,Shower Panels,Home Improvement / Bathroom Remodel & Bathroom...,the malibu pressure balanced diverter fixed sh...,producttype : shower panel|spraypattern : rain...,3.0,4.5,2.0
42990,emmeline 5 piece breakfast dining set,Dining Table Sets,Furniture / Kitchen & Dining Furniture / Dinin...,,basematerialdetails : steel| : gray wood|ofhar...,1314.0,4.5,864.0
42991,maloney 3 piece pub table set,Dining Table Sets,Furniture / Kitchen & Dining Furniture / Dinin...,this pub table set includes 1 counter height t...,additionaltoolsrequirednotincluded : power dri...,49.0,4.0,41.0
42992,fletcher 27.5 '' wide polyester armchair,Teen Lounge Furniture|Accent Chairs,Furniture / Living Room Furniture / Chairs & S...,"bring iconic , modern style to your space in a...",legmaterialdetails : rubberwood|backheight-sea...,1746.0,4.5,1226.0


## Index downsample (lexical)

The sort of lexical indexing we would do in a search engine like Solr, OpenSearch, or Elasticsearch

In [None]:
stemmer = Stemmer.Stemmer('english', maxCacheSize=0)

fold_to_ascii = dict([(ord(x), ord(y)) for x, y in zip(u"‘’´“”–-", u"'''\"\"--")])
punct_trans = str.maketrans({key: ' ' for key in string.punctuation})
all_trans = {**fold_to_ascii, **punct_trans}


def stem_word(word):
    return stemmer.stemWord(word)


def snowball_tokenizer(text):
    if type(text) == float:
        return ''
    if text is None:
        return ''
    text = text.translate(all_trans).replace("'", " ")
    split = text.lower().split()
    return [stem_word(token)
            for token in split]


def ws_tokenizer(text):
    if type(text) == float:
        return ''
    if text is None:
        return ''
    text = text.translate(all_trans)
    split = text.lower().split()
    return split

In [None]:
downsample['product_name_snowball'] = SearchArray.index(downsample['product_name'],
                                                        tokenizer=snowball_tokenizer)
downsample['product_description_snowball'] = SearchArray.index(downsample['product_description'],
                                                               tokenizer=snowball_tokenizer)
downsample['product_class_snowball'] = SearchArray.index(downsample['product_class'],
                                                         tokenizer=snowball_tokenizer)

downsample['product_name_ws'] = SearchArray.index(downsample['product_name'],
                                                  tokenizer=ws_tokenizer)
downsample['product_description_ws'] = SearchArray.index(downsample['product_description'],
                                                         tokenizer=ws_tokenizer)
downsample['product_class_ws'] = SearchArray.index(downsample['product_class'],
                                                   tokenizer=ws_tokenizer)

2024-09-15 14:11:59,538 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2024-09-15 14:11:59,551 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2024-09-15 14:11:59,557 - searcharray.indexing - INFO - Tokenizing 42986 documents


INFO:searcharray.indexing:Tokenizing 42986 documents


2024-09-15 14:11:59,884 - searcharray.indexing - INFO - Tokenized 10000 (23.26338807983995%)


INFO:searcharray.indexing:Tokenized 10000 (23.26338807983995%)


2024-09-15 14:12:00,275 - searcharray.indexing - INFO - Tokenized 20000 (46.5267761596799%)


INFO:searcharray.indexing:Tokenized 20000 (46.5267761596799%)


2024-09-15 14:12:00,693 - searcharray.indexing - INFO - Tokenized 30000 (69.79016423951985%)


INFO:searcharray.indexing:Tokenized 30000 (69.79016423951985%)


2024-09-15 14:12:01,095 - searcharray.indexing - INFO - Tokenized 40000 (93.0535523193598%)


INFO:searcharray.indexing:Tokenized 40000 (93.0535523193598%)


2024-09-15 14:12:01,362 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2024-09-15 14:12:01,370 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2024-09-15 14:12:01,381 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2024-09-15 14:12:01,445 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2024-09-15 14:12:01,511 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2024-09-15 14:12:01,517 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2024-09-15 14:12:01,556 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2024-09-15 14:12:01,597 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2024-09-15 14:12:01,613 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2024-09-15 14:12:01,618 - searcharray.indexing - INFO - Tokenizing 42986 documents


INFO:searcharray.indexing:Tokenizing 42986 documents


2024-09-15 14:12:02,914 - searcharray.indexing - INFO - Tokenized 10000 (23.26338807983995%)


INFO:searcharray.indexing:Tokenized 10000 (23.26338807983995%)


2024-09-15 14:12:04,265 - searcharray.indexing - INFO - Tokenized 20000 (46.5267761596799%)


INFO:searcharray.indexing:Tokenized 20000 (46.5267761596799%)


2024-09-15 14:12:05,908 - searcharray.indexing - INFO - Tokenized 30000 (69.79016423951985%)


INFO:searcharray.indexing:Tokenized 30000 (69.79016423951985%)


2024-09-15 14:12:08,104 - searcharray.indexing - INFO - Tokenized 40000 (93.0535523193598%)


INFO:searcharray.indexing:Tokenized 40000 (93.0535523193598%)


2024-09-15 14:12:08,793 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2024-09-15 14:12:08,830 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2024-09-15 14:12:08,866 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2024-09-15 14:12:09,470 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2024-09-15 14:12:09,755 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2024-09-15 14:12:09,760 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2024-09-15 14:12:09,862 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2024-09-15 14:12:09,987 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2024-09-15 14:12:09,999 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2024-09-15 14:12:10,011 - searcharray.indexing - INFO - Tokenizing 42986 documents


INFO:searcharray.indexing:Tokenizing 42986 documents


2024-09-15 14:12:10,234 - searcharray.indexing - INFO - Tokenized 10000 (23.26338807983995%)


INFO:searcharray.indexing:Tokenized 10000 (23.26338807983995%)


2024-09-15 14:12:10,442 - searcharray.indexing - INFO - Tokenized 20000 (46.5267761596799%)


INFO:searcharray.indexing:Tokenized 20000 (46.5267761596799%)


2024-09-15 14:12:10,693 - searcharray.indexing - INFO - Tokenized 30000 (69.79016423951985%)


INFO:searcharray.indexing:Tokenized 30000 (69.79016423951985%)


2024-09-15 14:12:10,946 - searcharray.indexing - INFO - Tokenized 40000 (93.0535523193598%)


INFO:searcharray.indexing:Tokenized 40000 (93.0535523193598%)


2024-09-15 14:12:11,158 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2024-09-15 14:12:11,163 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2024-09-15 14:12:11,170 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2024-09-15 14:12:11,187 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2024-09-15 14:12:11,202 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2024-09-15 14:12:11,209 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2024-09-15 14:12:11,231 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2024-09-15 14:12:11,245 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2024-09-15 14:12:11,255 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2024-09-15 14:12:11,260 - searcharray.indexing - INFO - Tokenizing 42986 documents


INFO:searcharray.indexing:Tokenizing 42986 documents


2024-09-15 14:12:11,513 - searcharray.indexing - INFO - Tokenized 10000 (23.26338807983995%)


INFO:searcharray.indexing:Tokenized 10000 (23.26338807983995%)


2024-09-15 14:12:11,797 - searcharray.indexing - INFO - Tokenized 20000 (46.5267761596799%)


INFO:searcharray.indexing:Tokenized 20000 (46.5267761596799%)


2024-09-15 14:12:12,038 - searcharray.indexing - INFO - Tokenized 30000 (69.79016423951985%)


INFO:searcharray.indexing:Tokenized 30000 (69.79016423951985%)


2024-09-15 14:12:12,285 - searcharray.indexing - INFO - Tokenized 40000 (93.0535523193598%)


INFO:searcharray.indexing:Tokenized 40000 (93.0535523193598%)


2024-09-15 14:12:12,520 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2024-09-15 14:12:12,532 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2024-09-15 14:12:12,544 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2024-09-15 14:12:12,597 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2024-09-15 14:12:12,661 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2024-09-15 14:12:12,667 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2024-09-15 14:12:12,702 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2024-09-15 14:12:12,740 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2024-09-15 14:12:12,757 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2024-09-15 14:12:12,763 - searcharray.indexing - INFO - Tokenizing 42986 documents


INFO:searcharray.indexing:Tokenizing 42986 documents


2024-09-15 14:12:13,496 - searcharray.indexing - INFO - Tokenized 10000 (23.26338807983995%)


INFO:searcharray.indexing:Tokenized 10000 (23.26338807983995%)


2024-09-15 14:12:14,214 - searcharray.indexing - INFO - Tokenized 20000 (46.5267761596799%)


INFO:searcharray.indexing:Tokenized 20000 (46.5267761596799%)


2024-09-15 14:12:14,960 - searcharray.indexing - INFO - Tokenized 30000 (69.79016423951985%)


INFO:searcharray.indexing:Tokenized 30000 (69.79016423951985%)


2024-09-15 14:12:15,693 - searcharray.indexing - INFO - Tokenized 40000 (93.0535523193598%)


INFO:searcharray.indexing:Tokenized 40000 (93.0535523193598%)


2024-09-15 14:12:16,367 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2024-09-15 14:12:16,429 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2024-09-15 14:12:16,474 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2024-09-15 14:12:17,765 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2024-09-15 14:12:18,309 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2024-09-15 14:12:18,321 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2024-09-15 14:12:18,582 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2024-09-15 14:12:19,074 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2024-09-15 14:12:19,122 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2024-09-15 14:12:19,129 - searcharray.indexing - INFO - Tokenizing 42986 documents


INFO:searcharray.indexing:Tokenizing 42986 documents


2024-09-15 14:12:19,834 - searcharray.indexing - INFO - Tokenized 10000 (23.26338807983995%)


INFO:searcharray.indexing:Tokenized 10000 (23.26338807983995%)


2024-09-15 14:12:20,540 - searcharray.indexing - INFO - Tokenized 20000 (46.5267761596799%)


INFO:searcharray.indexing:Tokenized 20000 (46.5267761596799%)


2024-09-15 14:12:21,323 - searcharray.indexing - INFO - Tokenized 30000 (69.79016423951985%)


INFO:searcharray.indexing:Tokenized 30000 (69.79016423951985%)


2024-09-15 14:12:22,096 - searcharray.indexing - INFO - Tokenized 40000 (93.0535523193598%)


INFO:searcharray.indexing:Tokenized 40000 (93.0535523193598%)


2024-09-15 14:12:22,521 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2024-09-15 14:12:22,531 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2024-09-15 14:12:22,544 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2024-09-15 14:12:22,572 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2024-09-15 14:12:22,609 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2024-09-15 14:12:22,622 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2024-09-15 14:12:22,669 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


## Baseline relevance

A naive search over product name and description, where we take the maximum BM25 score per field.

In [None]:
labels

Unnamed: 0,id,query_id,product_id,label,grade,query,query_class
0,0,0,25434,Exact,2.0,salon chair,Massage Chairs
1,1,0,12088,Irrelevant,0.0,salon chair,Massage Chairs
2,2,0,42931,Exact,2.0,salon chair,Massage Chairs
3,3,0,2636,Exact,2.0,salon chair,Massage Chairs
4,4,0,42923,Exact,2.0,salon chair,Massage Chairs
...,...,...,...,...,...,...,...
233443,234010,478,15439,Partial,1.0,worn leather office chair,Office Chairs
233444,234011,478,451,Partial,1.0,worn leather office chair,Office Chairs
233445,234012,478,30764,Irrelevant,0.0,worn leather office chair,Office Chairs
233446,234013,478,16796,Partial,1.0,worn leather office chair,Office Chairs


In [None]:
def labeled_results(scores, query_labels, downsample, debug=None, N=10):
    """Filter to top N labeled results and assign grades."""

    labeled_mask = downsample.index.isin(query_labels['product_id'])
    scores = scores[labeled_mask]
    sorted_idx = np.argsort(scores)[::-1][:N]
    results = downsample[labeled_mask].iloc[sorted_idx].copy()
    if debug:
        for key, dbg_score in debug.items():
            term, field = key
            results[f'debug_{field}_{term}'] = dbg_score[labeled_mask][sorted_idx]
        # Collapse debug cols to dict
        debug_cols = [col for col in results.columns if col.startswith('debug_')]
        results['debug'] = results[debug_cols].apply(lambda row: row.dropna().to_dict(), axis=1)

    results['scores'] = scores[sorted_idx]
    query_id = query_labels['query_id'].iloc[0]
    results =  results.merge(labels[labels['query_id'] == query_id][['product_id', 'grade']],
                             how='left',
                             on='product_id')
    results['grade'].fillna(0.001, inplace=True)

    return results

def search_term_centric(query,
                        downsample,
                        labels,
                        fields=["product_name_snowball", "product_description_snowball"],
                        N=10):
    debug = {}
    scores = np.zeros(len(downsample))
    terms_so_far = set()
    for term in snowball_tokenizer(query):
        if term in terms_so_far:
            continue
        terms_so_far.add(term)
        best_term_match = np.zeros(len(downsample))
        for field in fields:
            field_w_boost = field.split('^')
            if len(field_w_boost) == 1:
                boost = 1
                search_field = field
            elif len(field_w_boost) == 2:
                boost = int(field_w_boost[1])
                search_field = field_w_boost[0]

            term_score = downsample[search_field].array.score(term) * boost
            best_term_match = np.maximum(best_term_match, term_score)
            debug[(term, field)] = term_score

        scores += best_term_match

    query_labels = labels[labels['query'] == query]
    return labeled_results(scores, query_labels, downsample, debug=debug, N=N)


search_term_centric("desk for kids", downsample, labels)

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,product_name_snowball,...,product_class_ws,debug_product_name_snowball_desk,debug_product_description_snowball_desk,debug_product_name_snowball_for,debug_product_description_snowball_for,debug_product_name_snowball_kid,debug_product_description_snowball_kid,debug,scores,grade
0,19145,osterley 31.5 '' w writing desk and chair set,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,height adjustable wooden student desk and chai...,overallproductweight:49.6|overallwidth-sidetos...,2.0,4.0,2.0,"Terms({'w', 'and', '31', 'write', 'desk', 'set...",...,"Terms({'kids', 'desks'})",1.431226,2.997029,0.0,0.332299,0.0,2.190823,{'debug_product_name_snowball_desk': 1.4312262...,5.52015,2.0
1,17053,kids desk,Kids Chairs,Baby & Kids / Toddler & Kids Playroom / Playro...,meals are an important part of family time . e...,seatingtype : chair|overallwidth-sidetoside:18...,,,,"Terms({'desk', 'kid'})",...,"Terms({'chairs', 'kids'})",2.307074,0.0,0.0,0.0,3.058081,0.0,{'debug_product_name_snowball_desk': 2.3070743...,5.365156,2.0
2,18018,cordova kids desk,Kids Chairs,Baby & Kids / Toddler & Kids Playroom / Playro...,this office chair is perfect for the kids and ...,overallheight-toptobottom:34|purposefuldistres...,5.0,5.0,5.0,"Terms({'desk', 'cordova', 'kid'})",...,"Terms({'chairs', 'kids'})",2.121599,1.607222,0.0,0.419173,2.81223,1.781645,{'debug_product_name_snowball_desk': 2.1215991...,5.353002,0.0
3,8434,brister kids desk,Kids Chairs,Baby & Kids / Toddler & Kids Playroom / Playro...,if you are looking for a proper chair set for ...,overallproductweight:7.5|seatdepth-fronttoback...,,,,"Terms({'brister', 'desk', 'kid'})",...,"Terms({'chairs', 'kids'})",2.121599,0.0,0.0,0.417697,2.81223,0.0,{'debug_product_name_snowball_desk': 2.1215991...,5.351526,0.0
4,25007,monarch hill haven kids 47.48 '' w writing des...,Kids Desks,Sale / Closeout / Toddler & Kids Bedroom Furni...,it ’ s what every intrepid explorer dreams of ...,estimatedtimetosetup:120|additionaltoolsrequir...,80.0,4.5,65.0,"Terms({'47', 'w', 'drawer', 'hill', 'with', '4...",...,"Terms({'kids', 'desks'})",1.230949,2.42737,0.0,0.355727,1.631652,2.541056,{'debug_product_name_snowball_desk': 1.2309492...,5.324153,2.0
5,17416,willette kids study desk,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,this kids ' complete desk system is designed t...,shelfheight:0.59|overallwidth-sidetoside:43|to...,51.0,5.0,35.0,"Terms({'desk', 'willett', 'kid', 'studi'})",...,"Terms({'kids', 'desks'})",1.963727,2.273478,0.0,0.417697,2.602966,2.154607,{'debug_product_name_snowball_desk': 1.9637268...,5.294141,2.0
6,38997,keighley kids desk,Kids Chairs,Baby & Kids / Toddler & Kids Playroom / Playro...,this article offers excellent design and quali...,seatingtype : chair|overallproductweight:6|col...,3.0,4.5,3.0,"Terms({'desk', 'kid', 'keighley'})",...,"Terms({'chairs', 'kids'})",2.121599,0.0,0.0,0.342826,2.81223,0.0,{'debug_product_name_snowball_desk': 2.1215991...,5.276655,0.0
7,755,alirra kids desk,Kids Chairs,Baby & Kids / Toddler & Kids Playroom / Playro...,this article offers excellent design and quali...,overallproductweight:24|overallheight-toptobot...,,,,"Terms({'desk', 'alirra', 'kid'})",...,"Terms({'chairs', 'kids'})",2.121599,0.0,0.0,0.342826,2.81223,0.0,{'debug_product_name_snowball_desk': 2.1215991...,5.276655,0.0
8,754,arias kids desk,Kids Chairs,Baby & Kids / Toddler & Kids Playroom / Playro...,this article offers excellent design and quali...,overalldepth-fronttoback:12|overallwidth-sidet...,,,,"Terms({'desk', 'aria', 'kid'})",...,"Terms({'chairs', 'kids'})",2.121599,0.0,0.0,0.341431,2.81223,0.0,{'debug_product_name_snowball_desk': 2.1215991...,5.27526,0.0
9,9611,sesame street kids desk chair with storage com...,Kids Chairs|Licensed Products,Baby & Kids / Toddler & Kids Playroom / Playro...,add a functional appeal to your kid ’ s room w...,seatheight-floortoseat:10|productcare : wipe c...,344.0,5.0,259.0,"Terms({'cup', 'compart', 'and', 'with', 'stora...",...,"Terms({'chairs', 'licensed', 'products', 'kids'})",1.291176,2.219154,0.0,0.333629,1.711484,2.648081,{'debug_product_name_snowball_desk': 1.2911757...,5.200864,2.0


## Issue our searches to generate baseline

In [None]:
NUM_QUERIES = 10
query_products_labeled = downsample[['product_name', 'product_description']]
query_products_labeled = labels.merge(queries, on='query_id', how='left').merge(products, on='product_id', how='left')

test_queries = ['desk for kids', 'jordanna solid wood rocking', 'oriental vanity',
       'nectar queen mattress',
       'bedroom wall decor floral, multicolored with some teal (prints)',
       '48 in entry table with side by side drawer', 'alyse 8 light',
       'kari 2 piece', 'tufted upholstered bed diamond']

print(labels.columns)
product_ids = labels[labels['query'].isin(test_queries)]['product_id']
query_products_labeled = query_products_labeled[query_products_labeled['id'].isin(product_ids)]
print(len(query_products_labeled))


def search_all(labels, queries, downsample, search_fn):
    all_results = []

    for query in queries:
        results = search_fn(query, downsample, labels)
        results['rank'] = np.arange(len(results)) + 1
        results['query'] = query
        all_results.append(results)

    return pd.concat(all_results)

def search_baseline(query, downsample, labels):
    return search_term_centric(query, downsample, labels,
                               fields=['product_name_snowball',
                                       'product_description_snowball'])

results_baseline = search_all(labels,
                              test_queries,
                              downsample, search_baseline)
results_baseline

Index(['id', 'query_id', 'product_id', 'label', 'grade', 'query',
       'query_class'],
      dtype='object')
6021


Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,product_name_snowball,...,debug_product_name_snowball_piec,debug_product_description_snowball_piec,debug_product_name_snowball_tuft,debug_product_description_snowball_tuft,debug_product_name_snowball_upholst,debug_product_description_snowball_upholst,debug_product_name_snowball_bed,debug_product_description_snowball_bed,debug_product_name_snowball_diamond,debug_product_description_snowball_diamond
0,19145,osterley 31.5 '' w writing desk and chair set,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,height adjustable wooden student desk and chai...,overallproductweight:49.6|overallwidth-sidetos...,2.0,4.0,2.0,"Terms({'w', 'and', '31', 'write', 'desk', 'set...",...,,,,,,,,,,
1,17053,kids desk,Kids Chairs,Baby & Kids / Toddler & Kids Playroom / Playro...,meals are an important part of family time . e...,seatingtype : chair|overallwidth-sidetoside:18...,,,,"Terms({'desk', 'kid'})",...,,,,,,,,,,
2,18018,cordova kids desk,Kids Chairs,Baby & Kids / Toddler & Kids Playroom / Playro...,this office chair is perfect for the kids and ...,overallheight-toptobottom:34|purposefuldistres...,5.0,5.0,5.0,"Terms({'desk', 'cordova', 'kid'})",...,,,,,,,,,,
3,8434,brister kids desk,Kids Chairs,Baby & Kids / Toddler & Kids Playroom / Playro...,if you are looking for a proper chair set for ...,overallproductweight:7.5|seatdepth-fronttoback...,,,,"Terms({'brister', 'desk', 'kid'})",...,,,,,,,,,,
4,25007,monarch hill haven kids 47.48 '' w writing des...,Kids Desks,Sale / Closeout / Toddler & Kids Bedroom Furni...,it ’ s what every intrepid explorer dreams of ...,estimatedtimetosetup:120|additionaltoolsrequir...,80.0,4.5,65.0,"Terms({'47', 'w', 'drawer', 'hill', 'with', '4...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5,14834,abbot diamond queen upholstered standard bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,manifest the extraordinary in your bedroom dec...,warrantylength:1 year|overallproductweight:82|...,90.0,4.5,63.0,"Terms({'diamond', 'queen', 'standard', 'abbot'...",...,,,0.000000,1.522166,1.554309,1.467250,1.383881,1.632850,3.163621,3.048794
6,33376,melonie tufted upholstered standard bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,from its arch headboard to the elegant diamond...,overallwidth-sidetoside:78|overallheight-topto...,,,,"Terms({'tuft', 'meloni', 'standard', 'bed', 'u...",...,,,1.791392,2.136845,1.661958,1.536990,1.479726,1.257240,0.000000,2.347469
7,40512,acamar upholstered standard bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,this wingback bed sets a modern feel with its ...,overallheight-toptobottom:56|weightcapacity:50...,115.0,4.5,91.0,"Terms({'upholst', 'standard', 'acamar', 'bed'})",...,,,0.000000,1.522166,1.785627,1.996178,1.589835,1.855856,0.000000,2.240954
8,8360,rita tufted upholstered low profile standard bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,a wingback design pair with diamond-tufted det...,headboarddepth-fronttoback:10|overalllength-he...,144.0,4.5,90.0,"Terms({'tuft', 'standard', 'rita', 'profil', '...",...,,,1.573444,1.748894,1.459757,0.000000,1.299697,1.790775,0.000000,2.574747


In [None]:
def ndcg_m(results):
    max_dcg = np.sum((2**(np.ones(10) + 1)) / np.log(np.arange(10) + 2))
    results['gain'] = (2**results['grade'] - 1) / np.log(results['rank'] + 1)
    dcgs = results.groupby('query')['gain'].sum()
    dcgs = dcgs.sort_values()
    return dcgs / max_dcg

ndcg_ms_baseline = ndcg_m(results_baseline)
ndcg_ms_baseline.mean(), ndcg_ms_baseline

(0.3236988733393219,
 query
 bedroom wall decor floral, multicolored with some teal (prints)    0.094809
 kari 2 piece                                                       0.199784
 nectar queen mattress                                              0.250000
 oriental vanity                                                    0.250000
 alyse 8 light                                                      0.360046
 jordanna solid wood rocking                                        0.360046
 48 in entry table with side by side drawer                         0.366825
 desk for kids                                                      0.439587
 tufted upholstered bed diamond                                     0.592192
 Name: gain, dtype: float64)

### Get full baseline

For comprehensive eval, get a full baseline with every query

In [None]:
results_baseline_full = search_all(labels,
                                   labels['query'].unique(),
                                   downsample, search_baseline)
results_baseline_full

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,product_name_snowball,...,debug_product_name_snowball_drudg,debug_product_description_snowball_drudg,debug_product_name_snowball_report,debug_product_description_snowball_report,debug_product_name_snowball_pedistol,debug_product_description_snowball_pedistol,debug_product_name_snowball_to,debug_product_description_snowball_to,debug_product_name_snowball_fireplac,debug_product_description_snowball_fireplac
0,15612,massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,features heavy duty steel frame . premium chro...,overallheight-toptobottom:35.5|productcare : d...,59.0,4.5,50.0,"Terms({'massag', 'chair'})",...,,,,,,,,,,
1,7465,hair salon chair,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,offers a wide selection of professional salon ...,fauxleathertype : pu|legheight-toptobottom:18|...,69.0,4.5,53.0,"Terms({'salon', 'hair', 'chair'})",...,,,,,,,,,,
2,7467,reclining faux leather massage chair,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,offers a wide selection of professional beauty...,cushionfillmaterial : foam|minimumdoorwidth-si...,2.0,3.5,1.0,"Terms({'faux', 'leather', 'reclin', 'chair', '...",...,,,,,,,,,,
3,7466,reclining massage chair,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,offers a wide selection of professional beauty...,overallproductweight:53|upholsterycolor : yell...,2.0,2.0,2.0,"Terms({'reclin', 'massag', 'chair'})",...,,,,,,,,,,
4,7468,mercer41 hair salon chair hydraulic styling ch...,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,mercer41 beauty offers a wide selection profes...,seatfillmaterial : foam|waterrepellant : no re...,1.0,5.0,1.0,"Terms({'hydraul', 'mercer41', 'style', 'salon'...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8,30566,dyar log rack,Log Storage,Storage & Organization / Garage & Outdoor Stor...,keep the log rack in place with the symple stu...,warrantylength:1 year|woodconstructiontype : o...,116.0,4.5,84.0,"Terms({'log', 'rack', 'dyar'})",...,,,,,,,,,,
9,16930,mccandless 5 bottle wall mounted wine bottle a...,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,thinking of enjoying a glass of your favorite ...,dswoodtone : medium wood|fullorlimitedwarranty...,261.0,4.5,159.0,"Terms({'mount', 'rack', 'wine', 'bottl', 'and'...",...,,,,,,,,,,
0,40258,2 piece adoria hanging picture frame set,Picture Frames,Décor & Pillows / Picture Frames & Albums / Al...,use this photo collage holder to create a pers...,picturesize:3.5 '' x 5 '' |warrantylength:1 ye...,22.0,4.5,13.0,"Terms({'hang', 'frame', 'adoria', 'piec', 'set...",...,,,,,,,0.000000,0.332776,0.0,0.0
1,40171,giddings family theme wall hanging 8 opening p...,Picture Frames,Décor & Pillows / Picture Frames & Albums / Al...,instantly transform a bare wall into a gallery...,overallheight-toptobottom:17|orientation : hor...,130.0,4.0,77.0,"Terms({'hang', 'frame', 'theme', 'wall', 'open...",...,,,,,,,0.000000,0.183112,0.0,0.0


In [None]:
def ndcg_m(results):
    max_dcg = np.sum((2**(np.ones(10) + 1)) / np.log(np.arange(10) + 2))
    results['gain'] = (2**results['grade'] - 1) / np.log(results['rank'] + 1)
    dcgs = results.groupby('query')['gain'].sum()
    dcgs = dcgs.sort_values()
    return dcgs / max_dcg

ndcg_ms_baseline_full = ndcg_m(results_baseline_full)
ndcg_ms_baseline_full.mean(), ndcg_ms_baseline_full

  results['gain'] = (2**results['grade'] - 1) / np.log(results['rank'] + 1)


(0.43300707548366063,
 query
 star wars rug                 0.000000
 drudge report                 0.000000
 merlyn 6                      0.019600
 milk cow chair                0.027511
 carpet 5x6                    0.036163
                                 ...   
 garage sports storage rack    0.750000
 mattress foam topper queen    0.753473
 bathroom single faucet        0.796045
 waterfall faucet              0.796045
 outdoor seat/back cushion     1.124548
 Name: gain, Length: 480, dtype: float64)

## Confirm our evaluation works

Sanity check labeled results appear in the labels for this query and vice-versa.

In [None]:
for query in test_queries:
    product_ids = labels[labels['query'] == query]['product_id']
    for product_id in product_ids:
        assert product_id in downsample.index

    top10 = results_baseline[results_baseline['query'] == query]
    query_labels = labels[labels['query'] == query]
    for id, result_row in top10.iterrows():
        if result_row['grade'] > 0 and result_row['grade'] < 1:
            assert result_row['product_id'] not in query_labels['product_id'].tolist()
        else:
            assert result_row['product_id'] in query_labels['product_id'].tolist()

## Save downsample, etc

In [None]:
from google.colab import drive

drive.mount('/content/drive/')

CACHE_PATH = "/content/drive/MyDrive/wands-gar/"
!mkdir -p {CACHE_PATH}

results_baseline_full.to_pickle('/content/drive/MyDrive/wands-gar/0.results_baseline_full.pkl')
results_baseline.to_pickle('/content/drive/MyDrive/wands-gar/0.results_baseline.pkl')
downsample.to_pickle('/content/drive/MyDrive/wands-gar/0.downsample.pkl')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## Analyze best results per query

In [None]:
best_results = labels[labels['query'].isin(test_queries)].sort_values(['query', 'grade'],
                                           ascending=[True, False]).groupby('query_id').head(10).merge(downsample, how='left', on='product_id')

best_results['rank'] = best_results.groupby('query_id').cumcount() + 1
best_results

Unnamed: 0,id,query_id,product_id,label,grade,query,query_class,product_name,product_class,category hierarchy,...,rating_count,average_rating,review_count,product_name_snowball,product_description_snowball,product_class_snowball,product_name_ws,product_description_ws,product_class_ws,rank
0,13194,101,1860,Exact,2.0,48 in entry table with side by side drawer,Sofa & Console Tables,stronghurst 48 '' solid wood console table,Desks|Sofa & Console Tables,Furniture / Living Room Furniture / Console Ta...,...,448.0,4.5,319.0,"Terms({'wood', '48', 'stronghurst', 'tabl', 's...","Terms({'this', 'all', 'sit', 'to', 'with', 'eq...","Terms({'sofa', 'desk', 'tabl', 'consol'})","Terms({'table', 'console', 'wood', '48', 'stro...","Terms({'this', 'all', 'lets', 'to', 'with', 'n...","Terms({'sofa', 'tables', 'console', 'desks'})",1
1,13195,101,32892,Exact,2.0,48 in entry table with side by side drawer,Sofa & Console Tables,stroud metal console table,Sofa & Console Tables,Furniture / Living Room Furniture / Console Ta...,...,256.0,4.5,153.0,"Terms({'metal', 'tabl', 'consol', 'stroud'})","Terms({'this', 'manufactur', 'addit', 'to', 'w...","Terms({'sofa', 'tabl', 'consol'})","Terms({'table', 'metal', 'console', 'stroud'})","Terms({'this', 'variety', 'to', 'with', 'neutr...","Terms({'sofa', 'tables', 'console'})",2
2,75900,101,32907,Exact,2.0,48 in entry table with side by side drawer,Sofa & Console Tables,bautista 48 '' console table,Sofa & Console Tables,Furniture / Living Room Furniture / Console Ta...,...,209.0,4.5,131.0,"Terms({'tabl', 'bautista', '48', 'consol'})","Terms({'this', 'addit', 'add', 'yet', 'mdf', '...","Terms({'sofa', 'tabl', 'consol'})","Terms({'table', 'bautista', 'console', '48'})","Terms({'this', 'add', 'yet', 'mdf', 'to', 'wit...","Terms({'sofa', 'tables', 'console'})",3
3,75921,101,26500,Exact,2.0,48 in entry table with side by side drawer,Sofa & Console Tables,trinidad 48 '' console table,Sofa & Console Tables,Furniture / Living Room Furniture / Console Ta...,...,461.0,4.5,283.0,"Terms({'tabl', '48', 'consol', 'trinidad'})","Terms({'this', 'manufactur', 'but', 'add', 'to...","Terms({'sofa', 'tabl', 'consol'})","Terms({'table', 'console', '48', 'trinidad'})","Terms({'this', 'but', 'add', 'to', 'with', ""'""...","Terms({'sofa', 'tables', 'console'})",4
4,13070,101,700,Partial,1.0,48 in entry table with side by side drawer,Sofa & Console Tables,2 - drawer end table,End Tables,Furniture / Living Room Furniture / Coffee Tab...,...,2.0,2.5,2.0,"Terms({'end', 'tabl', 'drawer', '2'})",Terms(set()),"Terms({'end', 'tabl'})","Terms({'table', 'end', 'drawer', '2'})",Terms(set()),"Terms({'end', 'tables'})",5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,12964,100,42880,Exact,2.0,tufted upholstered bed diamond,Beds,agnese tufted upholstered low profile platform...,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,...,1048.0,5.0,721.0,"Terms({'platform', 'tuft', 'profil', 'bed', 'u...","Terms({'hassl', 'centerpiec', 'with', 'sophist...",Terms({'bed'}),"Terms({'platform', 'upholstered', 'tufted', 'b...","Terms({'confidence', 'with', 'impressive', 'to...",Terms({'beds'}),6
86,12965,100,17735,Exact,2.0,tufted upholstered bed diamond,Beds,aileen tufted upholstered low profile platform...,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,...,668.0,4.5,437.0,"Terms({'platform', 'tuft', 'aileen', 'profil',...","Terms({'this', 'upholsteri', 'engin', 'tv', 'r...",Terms({'bed'}),"Terms({'platform', 'aileen', 'upholstered', 't...","Terms({'tv', 'with', 'has', 'central', 'full',...",Terms({'beds'}),7
87,12968,100,40521,Exact,2.0,tufted upholstered bed diamond,Beds,alcantara diamond tufted upholstered standard bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,...,2047.0,4.5,1445.0,"Terms({'alcantara', 'tuft', 'diamond', 'standa...","Terms({'this', 'manufactur', '100', 'most', 'a...",Terms({'bed'}),"Terms({'alcantara', 'diamond', 'standard', 'up...","Terms({'this', '100', 'most', 'all', 'importan...",Terms({'beds'}),8
88,12970,100,8828,Exact,2.0,tufted upholstered bed diamond,Beds,ammara tufted upholstered standard bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,...,,,,"Terms({'tuft', 'standard', 'ammara', 'bed', 'u...","Terms({'this', 'tast', 'to', 'with', 'distinct...",Terms({'bed'}),"Terms({'standard', 'ammara', 'upholstered', 't...","Terms({'this', 'to', 'with', 'distinct', 'or',...",Terms({'beds'}),9


In [None]:
ndcg_ms_ideal = ndcg_m(best_results)
ndcg_ms_ideal.mean(), ndcg_ms_ideal

(0.4774549302739514,
 query
 nectar queen mattress                                              0.250000
 oriental vanity                                                    0.250000
 alyse 8 light                                                      0.360046
 jordanna solid wood rocking                                        0.360046
 kari 2 piece                                                       0.360046
 48 in entry table with side by side drawer                         0.531894
 bedroom wall decor floral, multicolored with some teal (prints)    0.685062
 desk for kids                                                      0.750000
 tufted upholstered bed diamond                                     0.750000
 Name: gain, dtype: float64)

In [None]:
(ndcg_ms_baseline / ndcg_ms_ideal).sort_values()

Unnamed: 0_level_0,gain
query,Unnamed: 1_level_1
"bedroom wall decor floral, multicolored with some teal (prints)",0.138395
kari 2 piece,0.554886
desk for kids,0.586116
48 in entry table with side by side drawer,0.689659
tufted upholstered bed diamond,0.789589
alyse 8 light,1.0
jordanna solid wood rocking,1.0
nectar queen mattress,1.0
oriental vanity,1.0


In [None]:
best_results[best_results['query'] == 'desk for kids'][['product_name', 'product_description', 'grade', 'product_id']]

Unnamed: 0,product_name,product_description,grade,product_id
30,42.8 `` simple student table kids desk white w...,"made of high-quality material , with drawers a...",2.0,22465
31,aadhya kids study desk,curate a cozy and organized workspace for your...,2.0,38525
32,abella kids writing desk,the features a truly unique finish reminiscent...,2.0,24327
33,adalyn kids study 35.44 '' writing desk,bring a streamlined style to your home office ...,2.0,42740
34,alessia kids study writing desk,this desk is an ideal pick whether your child ...,2.0,17397
35,avalon kids 41.6 '' writing desk with hutch an...,the avalon kids 41.6 '' writing desk with hutc...,2.0,21865
36,barfield 34.25 '' w art desk and chair set,"nature makes life full of poetry , the sun is ...",2.0,20953
37,betsy 31.5 '' w writing desk and chair set,,2.0,6974
38,biergh wooden 31 '' writing desk and chair set,,2.0,41138
39,bisa kids study 36 '' writing desk and chair set,"whether your child needs to work on homework ,...",2.0,38433


In [None]:
best_results[(best_results['query'] == 'desk for kids')]

Unnamed: 0,id,query_id,product_id,label,grade,query,query_class,product_name,product_class,category hierarchy,...,average_rating,review_count,product_name_snowball,product_description_snowball,product_class_snowball,product_name_ws,product_description_ws,product_class_ws,rank,gain
30,47209,441,22465,Exact,2.0,desk for kids,Kids Desks,42.8 `` simple student table kids desk white w...,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,...,5.0,4.0,"Terms({'drawer', 'with', 'kid', 'desk', 'stude...","Terms({'with', 'storag', 'for', 'set', 'puzzl'...","Terms({'desk', 'kid'})","Terms({'simple', 'table', 'kids', 'with', 'dra...","Terms({'with', 'for', 'set', 'high', 'working'...","Terms({'kids', 'desks'})",1,4.328085
31,47210,441,38525,Exact,2.0,desk for kids,Kids Desks,aadhya kids study desk,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,...,4.0,86.0,"Terms({'desk', 'kid', 'studi', 'aadhya'})","Terms({'this', 'sure', 'to', 'with', 'for', 's...","Terms({'desk', 'kid'})","Terms({'desk', 'study', 'aadhya', 'kids'})","Terms({'this', 'sure', 'workspace', 'to', 'wit...","Terms({'kids', 'desks'})",2,2.730718
32,47212,441,24327,Exact,2.0,desk for kids,Kids Desks,abella kids writing desk,Kids Tables and Sets,Baby & Kids / Toddler & Kids Playroom / Playro...,...,5.0,4.0,"Terms({'desk', 'write', 'kid', 'abella'})","Terms({'with', 'leav', 'was', 'his', 'provid',...","Terms({'and', 'set', 'tabl', 'kid'})","Terms({'desk', 'writing', 'abella', 'kids'})","Terms({'with', 'professional', 'was', 'lodges'...","Terms({'and', 'sets', 'tables', 'kids'})",3,2.164043
33,47214,441,42740,Exact,2.0,desk for kids,Kids Desks,adalyn kids study 35.44 '' writing desk,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,...,4.0,86.0,"Terms({'write', 'kid', 'adalyn', '44', 'desk',...","Terms({'this', 'glide', 'engin', 'plenti', 'to...","Terms({'desk', 'kid'})","Terms({'kids', 'adalyn', '44', 'desk', 'writin...","Terms({'this', 'plenty', 'to', 'provides', 'wi...","Terms({'kids', 'desks'})",4,1.864005
34,47215,441,17397,Exact,2.0,desk for kids,Kids Desks,alessia kids study writing desk,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,...,4.0,16.0,"Terms({'alessia', 'kid', 'desk', 'studi', 'wri...","Terms({'engin', 'project', 'next', 'crayon', '...","Terms({'desk', 'kid'})","Terms({'kids', 'alessia', 'desk', 'writing', '...","Terms({'project', 'plenty', 'next', 'their', '...","Terms({'kids', 'desks'})",5,1.674332
35,47224,441,21865,Exact,2.0,desk for kids,Kids Desks,avalon kids 41.6 '' writing desk with hutch an...,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,...,4.5,567.0,"Terms({'6', 'avalon', '41', 'and', 'with', 'ki...","Terms({'mdf', 'with', 'has', 'stabil', 'blend'...","Terms({'desk', 'kid'})","Terms({'6', '41', 'kids', 'and', 'with', 'desk...","Terms({'materials', 'mdf', 'with', 'ensures', ...","Terms({'kids', 'desks'})",6,1.541695
36,47226,441,20953,Exact,2.0,desk for kids,Kids Desks,barfield 34.25 '' w art desk and chair set,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,...,5.0,4.0,"Terms({'w', 'and', '34', 'desk', 'set', '25', ...","Terms({'such', 'sun', 'interior', 'fragranc', ...","Terms({'desk', 'kid'})","Terms({'w', 'and', '34', 'desk', 'set', '25', ...","Terms({'such', 'sun', 'interior', 'flowing', '...","Terms({'kids', 'desks'})",7,1.442695
37,47230,441,6974,Exact,2.0,desk for kids,Kids Desks,betsy 31.5 '' w writing desk and chair set,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,...,3.0,6.0,"Terms({'w', 'and', '31', 'write', 'betsi', 'de...",Terms(set()),"Terms({'desk', 'kid'})","Terms({'w', 'betsy', 'and', '31', 'desk', 'set...",Terms(set()),"Terms({'kids', 'desks'})",8,1.365359
38,47231,441,41138,Exact,2.0,desk for kids,Kids Desks,biergh wooden 31 '' writing desk and chair set,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,...,4.5,3.0,"Terms({'wooden', 'and', '31', 'biergh', 'desk'...",Terms(set()),"Terms({'desk', 'kid'})","Terms({'wooden', 'and', '31', 'biergh', 'desk'...",Terms(set()),"Terms({'kids', 'desks'})",9,1.302883
39,47232,441,38433,Exact,2.0,desk for kids,Kids Desks,bisa kids study 36 '' writing desk and chair set,Kids Desks,Baby & Kids / Toddler & Kids Bedroom Furniture...,...,4.5,143.0,"Terms({'36', 'and', 'kid', 'desk', 'set', 'stu...","Terms({'this', 'all', 'to', 'paper', 'for', 's...","Terms({'desk', 'kid'})","Terms({'kids', '36', 'and', 'desk', 'set', 'wr...","Terms({'this', 'workspace', 'all', 'crafts', '...","Terms({'kids', 'desks'})",10,1.251097


In [None]:
results_baseline[(results_baseline['query'] == 'desk for kids') & (results_baseline['grade'] == 0)][['product_name', 'product_description', 'grade']]

Unnamed: 0,product_name,product_description,grade
2,cordova kids desk,this office chair is perfect for the kids and ...,0.0
3,brister kids desk,if you are looking for a proper chair set for ...,0.0
6,keighley kids desk,this article offers excellent design and quali...,0.0
7,alirra kids desk,this article offers excellent design and quali...,0.0
8,arias kids desk,this article offers excellent design and quali...,0.0


In [None]:
best_results[best_results['query'] == 'desk for kids'][['product_id', 'product_name', 'product_description', 'grade', 'product_id']]

Unnamed: 0,product_id,product_name,product_description,grade,product_id.1
30,22465,42.8 `` simple student table kids desk white w...,"made of high-quality material , with drawers a...",2.0,22465
31,38525,aadhya kids study desk,curate a cozy and organized workspace for your...,2.0,38525
32,24327,abella kids writing desk,the features a truly unique finish reminiscent...,2.0,24327
33,42740,adalyn kids study 35.44 '' writing desk,bring a streamlined style to your home office ...,2.0,42740
34,17397,alessia kids study writing desk,this desk is an ideal pick whether your child ...,2.0,17397
35,21865,avalon kids 41.6 '' writing desk with hutch an...,the avalon kids 41.6 '' writing desk with hutc...,2.0,21865
36,20953,barfield 34.25 '' w art desk and chair set,"nature makes life full of poetry , the sun is ...",2.0,20953
37,6974,betsy 31.5 '' w writing desk and chair set,,2.0,6974
38,41138,biergh wooden 31 '' writing desk and chair set,,2.0,41138
39,38433,bisa kids study 36 '' writing desk and chair set,"whether your child needs to work on homework ,...",2.0,38433


In [None]:
results_baseline[results_baseline['query'] == 'desk for kids'][['product_id', 'product_name', 'product_description', 'grade']]

Unnamed: 0,product_id,product_name,product_description,grade
0,19145,osterley 31.5 '' w writing desk and chair set,height adjustable wooden student desk and chai...,2.0
1,17053,kids desk,meals are an important part of family time . e...,2.0
2,18018,cordova kids desk,this office chair is perfect for the kids and ...,0.0
3,8434,brister kids desk,if you are looking for a proper chair set for ...,0.0
4,25007,monarch hill haven kids 47.48 '' w writing des...,it ’ s what every intrepid explorer dreams of ...,2.0
5,17416,willette kids study desk,this kids ' complete desk system is designed t...,2.0
6,38997,keighley kids desk,this article offers excellent design and quali...,0.0
7,755,alirra kids desk,this article offers excellent design and quali...,0.0
8,754,arias kids desk,this article offers excellent design and quali...,0.0
9,9611,sesame street kids desk chair with storage com...,add a functional appeal to your kid ’ s room w...,2.0


In [None]:
results_baseline[results_baseline['product_id'] == 18018].iloc[0]['product_description']

"this office chair is perfect for the kids and their homework desks but it 's also a great chair for a lady 's or little girl 's bedroom for sitting at a vanity and putting on makeup . it 's fully adjustable and has a plastic seat , back , base , and casters which coming all together to bring you the most comfortable experience ."

In [None]:
best_results[best_results['query'] == 'desk for kids'][['product_name', 'product_description', 'grade', 'product_id']].iloc[0]['product_description']

'made of high-quality material , with drawers and storage functions , it is suitable for your kids . our table and chairs set are ideal for study , playing games , working on puzzles , and more .'

In [None]:
best_results[best_results['query'] == 'desk for kids'][['product_name', 'product_description', 'grade', 'product_id']].iloc[0]['product_name'] + best_results[best_results['query'] == 'desk for kids'][['product_name', 'product_description', 'grade', 'product_id']].iloc[0]['product_description']

'42.8 `` simple student table kids desk white with drawersmade of high-quality material , with drawers and storage functions , it is suitable for your kids . our table and chairs set are ideal for study , playing games , working on puzzles , and more .'

In [None]:
downsample.iloc[5000]['product_name']

'lavaca upholstered platform bed'

In [None]:
downsample.iloc[5000]['product_description']

'impart a traditional appeal to your bedroom with the inclusion of this traditional bed . constructed from a combination of solid wood and veneer it features a leatherette padded scrolled headboard with button tufting . the headboard and sleigh-type footboard display nailhead trim adding more style to it .'