# Problem Context
Was proposed a simple ecommerce search engine using `TF-IDF` using product name and description as input with none pre-processing. Final **MAP@10 was 0.29**.

# Proposed Approach

Use BM-25 (default) as search engine, using all "easy to access" product information (name, description, class and category) as input with a simple pre-processing that checks for some pitfalls: ignore empty tokens, keep numbers, remove punctuation and etc. Final **MAP@10 was 0.38**.

# Questions to Answer

**1) The search engine in the notebook has a MAP@10 across all queries of 0.29. This is considered low. Please propose some updates to increase the score. For reference, large ecommerce websites have MAP@10 values between 0.6 — 0.8, although there is no expectation for your solution to be in that range. The strength of your ideas holds greater weight than the final MAP score of the solution.**

Why I proposed this simple approach?
- Easier and fast to implement and deploy
- It's well known that BM-25 is an improvement compared to TF-IDF in RAG scenarios (BM-25 was developed basically by this idea).
- For sparse methods as TF-IDF and BM-25, it's really important to have text processing: remove sequential spaces, convert all to lower, remove stopwords and etc.
- Let's not get complex at the beggining: I could simple use the embeddings + cosine similarity to solve the case, but would take a lot of time to process the embeddings to product dataset and BM-25 is a good start.
- I used the other two columns based on the simple principle: more data = better model. I did not used `product_features` given the necessity to understand it better.

So far, this method achieved **0.38**, which is an improvement. 

If I had more time and resources, what would I do? (Have in mind that, each topic that I present it here would be compared with my base model (0.40) before I move to the next topic)
- Explore the column `product_features` to understand more your meaning and how would I extract useful information from the product. After that, if I found it useful, I would add in the `search_text` (with previous pre-processing);
- Improve text pre-processing by a carefully analysis over my products info: the idea would to have a deeper knowledge over it to, hopefully, convert into a better pre-processing.
- Implement stemming and lemmatization during pre-processing phase;
- Optimize BM-25 hyper-parameters: k1 - b - version;
- Implement in parallel a dense approach: Convert my `search_text` to a dense vector (embedding) and use cosine similarity to define my Top K (I started to implement, using sentence_transformers with open source models from hugging face, but would take a lot of time to generate to products dataset);
- Optimize Embedding Selection Model (I know, would take a lot of time, but I want to mention it)
- Compare BM-25 (sparse) and Embeddings (dense) approach to define for: hybrid (using both through RRF Score) or keep only one (reduce complexity - Would be using if one is way better than the other);

The PROs and CONs:

- Sparse with BM-25:
    PROS - Fast to train, Low Latency, easy to deploy, Good Start and most of the time presents a good performance.
    CONS - Lack of semantic understandment (home != house != apartment), necessity to pre-processing
- Dense with Embeddings:
    PROS - Semantic understandment, lack of dependencies (only need to load the data, the model and apply cosine similarity)
    CONS - Complex to deploy when you have millions of items to search (should apply a reduction search algorithm as Faiss), slow to prototype compared to sparse methods
- Hybrid Approach: Add complexity to deployment

**2) Currently, partial matches are treated as irrelevants, which penalizes the model too strictly. Can you implement another function that leverages the partial match count to provide a fairer assessment of performance? Please provide a justification for why you chose this function and the tradeoffs. If you choose to implement additional evaluation metrics, please provide a justification for using them along with tradeoffs.**

I proposed to weight **MAP@K** metric, considering the weight of 1 to exact case and weight of 0.5 to partial cases. I validated this metric later by checking if the results for only exact cases using your implementation matches my custom weighted implementation.

I consider that would be an easier approach given my time constrains.

**3) For this prompt you can choose one of two options, but you DO NOT need to do both. We value the ability to improve model performance and refactor code equally, so please choose based on what you feel most comfortable doing: (A) Please implement at least one change you suggest for prompt 1 to demonstrate an improvement in the MAP score. Please document your code changes with comments and markdown cells so we can follow your thought process. (B) Please modify the code to make it more object oriented, more flexible to accommodate changes to the retrieval model, and getting it ready for production (such as adding logging, error handling, etc.)**

For this case I selected (A) - I will give the option to use stemming or lemmatization to compare the final results. As you can see in the notebook, the results were in fact improved: Going from **0.38** to **0.42**.

In [1]:
#clone the git repo that contains the data and additional information about the dataset
#!git clone https://github.com/wayfair/WANDS.git

In [2]:
# Packages
import re
import nltk
import string
import pandas as pd
from rank_bm25 import BM25Plus
from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [3]:
# Download NLTK requirements
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kaike\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kaike\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kaike\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Execution Steps

## Data Preprocessing

In [4]:
# Load dataset - query
df_query = pd.read_csv("WANDS/dataset/query.csv", sep='\t')
# Load dataset - products
df_product = pd.read_csv("WANDS/dataset/product.csv", sep='\t')
# Load dataset - labels
df_label = pd.read_csv("WANDS/dataset/label.csv", sep='\t')
df_label_grouped = df_label.groupby('query_id')

In [5]:
# Verify label options
df_label['label'].unique().tolist()

['Exact', 'Irrelevant', 'Partial']

In [6]:
# Add labels to query to future evaluation - Strict Classes
df_query['ids_true_partial'] = df_query['query_id'].apply(
    lambda qid:
    df_label.loc[(df_label['label'] == 'Partial') & (df_label['query_id'] == qid), 'product_id'].tolist()
)
df_query['ids_true_irrelevant'] = df_query['query_id'].apply(
    lambda qid:
    df_label.loc[(df_label['label'] == 'Irrelevant') & (df_label['query_id'] == qid), 'product_id'].tolist()
)
df_query['ids_true_exact'] = df_query['query_id'].apply(
    lambda qid:
    df_label.loc[(df_label['label'] == 'Exact') & (df_label['query_id'] == qid), 'product_id'].tolist()
)

In [7]:
# Remove NaN
search_cols = ['product_name', 'product_class', 'category hierarchy', 'product_description']
df_product[search_cols] = df_product[search_cols].fillna('')

In [8]:
def generate_product_search_column(name, class_type, category, description):
    '''
    Function to generate correctly the search column.
    '''
    # Construct
    search_text = ' '.join([name, class_type, category, description])
    search_text = re.sub(r'\s+', ' ', search_text)
    search_text = search_text.strip()
    # Return
    return search_text

In [9]:
# Construct search column over product
df_product['search_text'] = df_product.apply(lambda row: 
                                             generate_product_search_column(
                                                row['product_name'],
                                                row['product_class'],
                                                row['category hierarchy'],
                                                row['product_description']
                                            ),axis=1)

In [10]:
def apply_text_preprocessing(text, normalization='none'):
    '''
    Function to apply text pre-processing. This function will:
    - Remove all punctuation, by adding a space in your place, except for numerical cases: 6.5 or 7,5 for example.
    - Remove any sequence of blank space
    - remove stopwords in english
    - convert all to lower
    - optionally apply lemmatization or stemming
    '''
    # Ensure we're working with string input
    if not isinstance(text, str):
        return ''

    # Initialize stemmer and lemmatizer if needed
    stemmer = PorterStemmer() if normalization == 'stem' else None
    lemmatizer = WordNetLemmatizer() if normalization == 'lemma' else None
    
    # Convert to lowercase
    text = text.lower()
    
    # Handle special numerical cases (preserve decimals with . or ,)
    # This pattern matches:
    # 1. Standard numbers (123)
    # 2. Numbers with decimal points (123.45 or .45)
    # 3. Numbers with decimal commas (123,45 or ,45)
    numeric_pattern = r'(?:\d+[.,]\d+|\d+|[.,]\d+)'
    numeric_matches = re.findall(numeric_pattern, text)
    
    # First replace numeric patterns with temporary placeholders
    placeholder = " NUMERICPLACEHOLDERNOTHINGISEQUALTOTHIS "
    processed_text = re.sub(numeric_pattern, placeholder, text)
    
    # Remove all punctuation (replace with space)
    translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
    processed_text = processed_text.translate(translator)
    
    # Tokenize and process words
    words = processed_text.split(' ') # word_tokenize(processed_text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    
    # Apply normalization if requested
    if normalization == 'stem' and stemmer:
        normalized_words = [stemmer.stem(word) for word in filtered_words]
    elif normalization == 'lemma' and lemmatizer:
        normalized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
    else:
        normalized_words = filtered_words
    
    # Join words with single space and strip any leading/trailing whitespace
    processed_text = ' '.join(normalized_words)
    
    # Remove any sequence of multiple spaces with single space
    processed_text = re.sub(r'\s+', ' ', processed_text).strip()

    # Returning ...
    return processed_text

In [11]:
# Apply pre-processing - products
df_product['search_text_proc'] = df_product['search_text'].apply(lambda x: apply_text_preprocessing(x, 'none'))
df_product['search_text_proc_stem'] = df_product['search_text'].apply(lambda x: apply_text_preprocessing(x, 'stem'))
df_product['search_text_proc_lemma'] = df_product['search_text'].apply(lambda x: apply_text_preprocessing(x, 'lemma'))

In [12]:
# Apply pre-processing - query
df_query['query_proc'] = df_query['query'].apply(lambda x: apply_text_preprocessing(x))
df_query['query_proc_stem'] = df_query['query'].apply(lambda x: apply_text_preprocessing(x, 'stem'))
df_query['query_proc_lemma'] = df_query['query'].apply(lambda x: apply_text_preprocessing(x, 'lemma'))

## Recommendation System Engine

Sparse Approach

In [13]:
def tokenize_text(text):
    '''
    Simple tokenizer to avoid empty tokens.
    '''
    return [w for w in text.split(' ') if len(w) > 0]

In [14]:
def generate_search_engine(corpus, indexes):
    '''
    Train BM-25.
    '''
    # Tokenize each search text
    tokenized_corpus = [tokenize_text(product_text) for product_text in corpus]
    # Generate the search engine
    search_engine = BM25Plus(tokenized_corpus)
    # Returning
    return {'search_engine': search_engine, 'corpus': corpus, 'indexes': indexes}

In [15]:
def search_items(query_text, k, bm25_engine, bm25_indexes):
    '''
    Search the most relevant itens using bm-25.
    '''
    # Tokenize the text
    tokenized_query = tokenize_text(query_text)
    # Execute the search
    top_searches_indexes = bm25_engine.get_top_n(tokenized_query, bm25_indexes, n=k)
    # Returning
    return top_searches_indexes

In [16]:
# Construct the search engine
bm25 = generate_search_engine(df_product['search_text_proc'], df_product['product_id'])
bm25_stem = generate_search_engine(df_product['search_text_proc_stem'], df_product['product_id'])
bm25_lemma = generate_search_engine(df_product['search_text_proc_lemma'], df_product['product_id'])

In [17]:
# Testing one of the search system
query_test = df_query.loc[17, 'query_proc']
search_indexes_results = search_items(query_test, 10, bm25['search_engine'], bm25['indexes'])

In [18]:
# Calculate for all query items
df_query['ids_predicted'] = df_query.apply(lambda x: search_items(x['query_proc'], 10, bm25['search_engine'], bm25['indexes']), axis=1)
df_query['ids_predicted_stem'] = df_query.apply(lambda x: search_items(x['query_proc_stem'], 10, bm25_stem['search_engine'], bm25_stem['indexes']), axis=1)
df_query['ids_predicted_lemma'] = df_query.apply(lambda x: search_items(x['query_proc_lemma'], 10, bm25_lemma['search_engine'], bm25_lemma['indexes']), axis=1)

## System Evaluation

Strict Classes

In [19]:
def map_at_k(true_ids, predicted_ids, k=10):
    """
    Calculate the Mean Average Precision at K (MAP@K).

    Parameters:
        true_ids (list): List of relevant product IDs.
        predicted_ids (list): List of predicted product IDs.
        k (int): Number of top elements to consider.
            NOTE: IF you wish to change top k, please provide a justification for choosing the new value

    Returns:
        float: MAP@K score.
    """
    #if either list is empty, return 0
    if not len(true_ids) or not len(predicted_ids):
        return 0.0

    score = 0.0
    num_hits = 0.0

    for i, p_id in enumerate(predicted_ids[:k]):
        if p_id in true_ids and p_id not in predicted_ids[:i]:
            num_hits += 1.0
            score += num_hits / (i + 1.0)

    return score / min(len(true_ids), k)

In [20]:
# Calculate Precision@10
df_query['precision_at_10_exact'] = df_query.apply(lambda x: map_at_k(x['ids_true_exact'], x['ids_predicted']), axis=1)
df_query['precision_at_10_partial'] = df_query.apply(lambda x: map_at_k(x['ids_true_partial'], x['ids_predicted']), axis=1)

# Check Mean over the test dataset
print('> Mean Precision@10 EXACT: ', df_query['precision_at_10_exact'].mean().round(2))
print('> Mean Precision@10 PARTIAL: ', df_query['precision_at_10_partial'].mean().round(2))

> Mean Precision@10 EXACT:  0.38
> Mean Precision@10 PARTIAL:  0.32


In [21]:
# Calculate Precision@10
df_query['precision_at_10_exact'] = df_query.apply(lambda x: map_at_k(x['ids_true_exact'], x['ids_predicted']), axis=1)
df_query['precision_at_10_exact_stem'] = df_query.apply(lambda x: map_at_k(x['ids_true_exact'], x['ids_predicted_stem']), axis=1)
df_query['precision_at_10_exact_lemma'] = df_query.apply(lambda x: map_at_k(x['ids_true_exact'], x['ids_predicted_lemma']), axis=1)

# Check Mean over the test dataset
print('> Mean Precision@10 EXACT: ', df_query['precision_at_10_exact'].mean().round(2))
print('> Mean Precision@10 EXACT - STEM: ', df_query['precision_at_10_exact_stem'].mean().round(2))
print('> Mean Precision@10 EXACT - LEMMA: ', df_query['precision_at_10_exact_lemma'].mean().round(2))

> Mean Precision@10 EXACT:  0.38
> Mean Precision@10 EXACT - STEM:  0.42
> Mean Precision@10 EXACT - LEMMA:  0.42


# System Evaluation

Partial Included

In [22]:
# Define weights
EXACT_WEIGHT = 1.0
PARTIAL_WEIGHT = 0.5

# Construct IDs / Weight
df_query['ids_true_exact_partial'] = df_query['query_id'].apply(
    lambda qid:
    df_label.loc[(df_label['label'].isin(['Exact', 'Partial'])) & (df_label['query_id'] == qid), 'product_id'].tolist()
)
df_query['weight_ids_true_exact_partial'] = df_query['query_id'].apply(
    lambda qid:
    df_label.loc[(df_label['label'].isin(['Exact', 'Partial'])) & (df_label['query_id'] == qid), 'label'].tolist()
)
df_query['weight_ids_true_exact_partial'] = df_query['weight_ids_true_exact_partial'].apply(
    lambda x_list:
    [EXACT_WEIGHT if x == 'Exact' else PARTIAL_WEIGHT for x in x_list]
)

In [23]:
def map_at_k_weighted(true_ids, class_weights, predicted_ids, k=10):
    """
    Calculate the Custom Mean Average Precision Weighted at K (MAP@K).
    """
    # if either list is empty, return 0
    if not len(true_ids) or not len(predicted_ids):
        return 0.0

    # Create a dictionary to map true_ids to their corresponding weights
    true_id_to_weight = {true_id: weight for true_id, weight in zip(true_ids, class_weights)}

    # Initiate the variables
    score = 0.0
    num_hits = 0.0
    sum_weights = 0.0

    # Loop over the predicted results
    for i, p_id in enumerate(predicted_ids[:k]):
        # Checking
        if p_id in true_ids and p_id not in predicted_ids[:i]:
            # Get the weight of the current true_id
            weight = true_id_to_weight[p_id]
            num_hits += weight
            score += (num_hits / (i + 1.0)) * weight
            sum_weights += weight

    # Normalize by the sum of weights of the relevant items in top-k or all true_ids, whichever is smaller
    if sum_weights > 0:
        return score / min(len(true_ids), k)
    else:
        return 0.0

In [24]:
# Calculate Precision@10 - Weighted Matches
df_query['precision_at_10_weighted'] = df_query.apply(
    lambda x: 
    map_at_k_weighted(x['ids_true_exact_partial'], x['weight_ids_true_exact_partial'], x['ids_predicted']),
    axis=1
)

# Check Mean over the test dataset
print('> Mean Precision@10 WEIGHTED: ', df_query['precision_at_10_weighted'].mean().round(2))

> Mean Precision@10 WEIGHTED:  0.42


In [25]:
# Calculate Precision@10 - Exact Matches (Weighted Version) - Metric validation
df_query['precision_at_10_exact_validation'] = df_query.apply(
    lambda x: 
    map_at_k_weighted(x['ids_true_exact'], [1 for i in range(0, len(x['ids_true_exact']))], x['ids_predicted']),
    axis=1
)

# Check Mean over the test dataset
print('> Mean Precision@10 EXACT VALIDATION: ', df_query['precision_at_10_exact_validation'].mean().round(2))

> Mean Precision@10 EXACT VALIDATION:  0.38
