<a id="top"></a>
<img width="40%" alt="Bluelight AI Logo" href="https://bluelightai.com/" src="https://github.com/BlueLightAI/cobalt-examples/blob/main/assets/blai-logo-light.png?raw=true">

# Use Cobalt to Pick the Best Model for your E-Commerce needs 
<a href="https://bluelightai.com/contact">Give Feedback</a> | <a href="https://bluelightai.com/">Our Website</a> | <a href="https://docs.cobalt.bluelightai.com/">Cobalt Docs</a> | <a href="https://bluelightaicom.slack.com/archives/C0807BUJ4KE">Slack Channel</a> 

**Last update:** 2024-12-11 (Created: 2024-11-15)

## Introduction


At [BluelightAI](https://bluelightai.com/) we are **thrilled** to help you identify the best model for your use case!

**Business Context for This Notebook**: 

An ecommerce retailer is spending millions of dollars bringing customers to their website, obtaining inventory and optimizing their models.

Here, they use BluelightAI Cobalt to compare two different prospective retrieval models on their customer product search dataset before deploying one.


**Model and Dataset Details**

We compare the retrieval performance of an [SBERT](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) and an [E5](https://huggingface.co/intfloat/e5-base-v2) model on a popular ecommerce benchmark [dataset](https://huggingface.co/datasets/Marqo/marqo-GS-10M) from Marqo. 

The E5 model was fine-tuned on this dataset using Marqo's ecommerce fine-tuning [Marqtune](https://www.marqo.ai/blog/introducing-marqtune) Platform.

### Install dependencies

For Setup Instructions, see the [Cobalt Docs](https://docs.cobalt.bluelightai.com/).

In [None]:
# %pip install cobalt-ai
# %pip install sentence_transformers==3.3.1

### Import libraries

In [None]:
import warnings

# cobalt.setup_license()
import pandas as pd

import cobalt
from cobalt.embedding_models import SentenceTransformerEmbeddingModel
from cobalt.lab.generate_interpretable_dataframe import get_interpretable_groups
from cobalt.lab.neighbors import get_raw_subset_with_label

warnings.filterwarnings("ignore")

**Why BluelightAI Cobalt:** 

The time you have to understand and fix a model’s errors is limited, expensive and hard to scale to the size of your dataset. Cobalt automates the otherwise painful step of looking for patterns in how a model is performing. We also make comparing models on your dataset easy.


1. We identify groups of customer queries (inputs into your machine learning model) that have similar natural language using [TDA](https://www.nature.com/articles/srep01236) 👥 

2. We provide an easy Pandas DataFrame table so you can do model comparisons on these groups of user queries. 💡

3. This helps you to do risk analysis, model improvement, and model selection so that you can deploy the best possible model for your use case  📊

#### Data Prep

For each query we simple need an evaluation or performance score from using your current or prospective model(s)
- Common [evaluation metrics](https://weaviate.io/blog/retrieval-evaluation-metrics) for search retrieval include Precision, Recall, MRR, NDCG, etc.
- Business evaluation scores often include add-to-cart rate, purchase rate, clickthrough rate, etc.

In this notebook we precomputed a common evaluation score called [NDCG](https://www.marqo.ai/blog/what-is-normalized-discounted-cumulative-gain-ndcg) which can evaluate a product search model before deployment 

ie: Before deployment onto an ecommerce retailer website, it's useful to use BluelightAI Cobalt to compare how two different models perform on the training data

Note: The "Score" per query has a best value of 1, and a worst value of 0 (NDCG metric)

In [None]:
from urllib.request import urlretrieve

base_path = "https://examples.cobalt.bluelightai.com/marqo-gs-10m/v1"
e5_results_file = "training_epoch_1_ndcg_per_query.csv"
sbert_results_file = "ndcg_per_query_gs_100k_training_2024-10-23_mini_lm_l6_v2.csv"
urlretrieve(f"{base_path}/{e5_results_file}", e5_results_file)
urlretrieve(f"{base_path}/{sbert_results_file}", sbert_results_file)

In [None]:
sbert_minilm_ndcg_per_query_df = pd.read_csv(sbert_results_file, index_col=0)
sbert_minilm_ndcg_per_query_df.head(3)

In [None]:
e5_marqtune_ndcg_per_query = pd.read_csv(e5_results_file, index_col=0)
e5_marqtune_ndcg_per_query = e5_marqtune_ndcg_per_query.drop(columns=["Score"])
e5_marqtune_ndcg_per_query.head(3)

In [None]:
print(
    f"There are {len(sbert_minilm_ndcg_per_query_df)} queries in the Sbert_minilm dataset"
)
print(f"There are {len(e5_marqtune_ndcg_per_query)} queries in the E5_marqtune dataset")
print("The queries are the same in both datasets and the rows are aligned")
print("ie: row 2 is the query for Customizable Buttons for Men in both dataframes")

Without BluelightAI, current approaches analyze performance using the average on the whole dataset:

In [None]:
sbert_print = sbert_minilm_ndcg_per_query_df["ndcg_score"].mean().round(2)
print(f"The SBERT model had an average NDCG score of {sbert_print} on this dataset")
e5_print = e5_marqtune_ndcg_per_query["ndcg_score"].mean().round(2)
print(f"The E5 model had an average NDCG score of {e5_print} on this dataset")

***Limitations of Current Approaches***

- Identifying where your model is performing poorly isn't addressed by taking an average on your whole dataset

- Looking at individual queries at a time to understand and improve model performance isn't scalable

**How BluelightAI Cobalt address these limitations:**

1. Automatically identify problematic groups of data in your model, saving days or weeks of troubleshooting effort.

2. Quickly compare models and assess the deployment risk for multiple models for your dataset

#### Data Prep: Compare Models on the same Queries

We can combine our dataframes since the queries are identical and aligned
 (ie: rows 2 is the query for Customizable Buttons for Men in both dataframes)

In [None]:
model_comparison_df = sbert_minilm_ndcg_per_query_df.copy()
model_comparison_df = model_comparison_df.rename(columns={"ndcg_score": "sbert_model"})
model_comparison_df["e5_model"] = e5_marqtune_ndcg_per_query["ndcg_score"]

In [None]:
model_comparison_df["Switching_to_E5_Impact"] = (
    model_comparison_df["e5_model"] - model_comparison_df["sbert_model"]
)

#### And Now... BluelightAI Cobalt 🔥

1. We find groups of user queries that have similar natural language using [TDA](https://www.nature.com/articles/srep01236) 👥 🔗

2. We then illuminate the performance of these groups on each of your models 🔎 

3. This makes identifying problematic groups and comparing models quick and easy

In [None]:
# First load your dataframe into a `CobaltDataset`.
ds = cobalt.CobaltDataset(model_comparison_df)

m = SentenceTransformerEmbeddingModel("all-MiniLM-L6-v2")

# Using an embedding model (ie: we specify one for you above)
# an embedding is made for each user query
# You can specify GPU-acceleration here.
# If you already have embeddings for each of your samples skip this
# and them to add_embedding_array()
embedding = m.embed(model_comparison_df["query"].tolist(), device="mps")

# And add the embedding to the dataset, using the "cosine" similarity metric.
ds.add_embedding_array(embedding, metric="cosine", name="sbert")

The Results variable has your output table!

In [None]:
results, workspace, keywords_per_level = get_interpretable_groups(
    ds,
    text_column_name="query",
    n_gram_range="up_to_bigrams",
    min_level=0,
    max_level=20,
    max_keywords=3,
    return_intermediates=True,
)

In [None]:
results = (
    results[(results["level"] == 10) & (results["query_count"] > 10)]
    .round(2)
    .reset_index(drop=True)
)
graph = workspace.graphs["New Graph"]

In [None]:
results

### Observing the Results  🧠

Note: "Score" a best value of 1, and a worst value of 0 (NDCG metric)

Easily navigate the clustering in the dataframe: 
- Filter the results by a minimum query_count
- Sort for largest impact!
- Etcetera

# Interpreting Results

We quickly found many groups or queries, like "corks, wine corkscrew, bottles corks" searches, are actually better in the sbert model!

**Context:** On average on the whole dataset, the e5 model had a higher ndcg performance than the sbert model because it was fine-tuned on this dataset

Examples:
1. "corks, wine corkscrew, bottles corks" has 11 queries and has an average performance of 0.65 as a category of user searches in the sbert model which contrasts with a score of 0.43 on the e5 model

2. "bathtub caddy, caddy, caddy book has 11 queries and has an average performance of 0.43 as a category of user searches in the sbert model
which contrasts with a score of 0.21 on the e5 model

In [None]:
results.sort_values(by=["Switching_to_E5_Impact"]).head()

#### Bigger Groups with the "level" column:

- The higher values for the "level" column retrieve larger sized groups on your source data

- Each level contains all of the unique points from the source, so combine levels with caution

- Levels are a part of our clustering algorithm design to enable "zoom" levels on patterns in the data

#### Inspecting the Original Samples for a group

- For any "Label" you want to understand more about, pass it and its "level" column below (for uniqueness)

In [None]:
# Insert the label of interest here; note that order of keywords may vary
see_label = "corks, bottles corks, wine corkscrew"
# Make sure this matches your row of interest from the results dataframe
level_column_for_see_label = 10

In [None]:
results[
    (results["Label"] == see_label) & (results["level"] == level_column_for_see_label)
].head()

Simply run the next cell to see the matching source data!

In [None]:
raw_data = get_raw_subset_with_label(
    coarseness=level_column_for_see_label,
    label=see_label,
    g=graph,
    ds=ds,
    keywords_per_level=keywords_per_level,
)
raw_data.df

### Possible Next Steps:
- **Risk Analysis**
1. Weigh the risk for important customer query patterns on whether the performance is satisfactory for model deployment.

2. You can evaluate more prospective models with BluelightAI until your bar for minimal performance is met, or

3. You can do precision improvement of the models on the queries you are concerned about.

- **Curate your Training Data** 
1. Ensure that your dataset is comprehensive and representative of the real-world scenarios your model will face.
2. Curate and improve your data for fine-tuning your model. 

    [Marqtune](https://www.marqo.ai/blog/introducing-marqtune) helps with fine-tuning ecommerce models. 
    
    [BluelightAI](https://bluelightai.com/) can help you to track performance at each of your fine-tuning model checkpoints for each of your queries and their associated groups.

Feel free to email support@bluelightai.com for enhancements 💪 or troubleshooting 🙏

<div style="display: flex; align-items: center; justify-content: space-between;">
    <div style:"flex: 1; text-align: left;">
        <a href="#top" style="text-decoration: none; color: inherit;"> 
            <h3>Top of Page</h3> 
        </a>
    </div>
    <div style:"flex: 1; text-align: right;">
        <img width="50%" alt="Bluelight AI Logo" href="https://bluelightai.com/" src="https://github.com/BlueLightAI/cobalt-examples/blob/main/assets/blai-logo-light.png?raw=true" style="float: right;">
    </div>
</div>