In [None]:
import os
import pandas as pd
import matplotlib

# 1. Model Selection
This is an asymmetric semantic search problem. Semantic search models can be categorized based on the datasets they were trained on. One category is Multi-QA Models, which are trained on datasets like Amazon-QA (Question-Answer pairs from Amazon product pages). The other category is MSMARCO Passage Models, trained on search queries from Bing.

Given that the ESCI dataset is an e-commerce dataset similar to Amazon's, I selected a model from the Multi-QA category. Among the six Multi-QA models, I reviewed the training datasets listed in their model cards and considered only those with detailed training dataset information. Balancing performance and query speed, I chose the multi-qa-MiniLM-L6-cos-v1 model, which offers moderate performance and very fast query speed.

Its primary use case includes queries/questions and text paragraphs.

# 2. Model Evaluation

In [None]:
# Read in data
output_folder = "output"
query_ndcg_file = os.path.join(output_folder, "query_ndcg_df.parquet")
query_ndcg_df = pd.read_parquet(query_ndcg_file)

In [None]:
# Check NDCG scores statistics
query_ndcg_df['ndcg_score'].describe()

In [None]:
# Plot histogram of NDCG scores
query_ndcg_df['ndcg_score'].hist(bins=50)

In [None]:
query_ndcg_df.loc[query_ndcg_df['ndcg_score'] < 0.5]

The mean and median NDCG scores are both greater than 0.9, demonstrating strong performance of the selected model. However, the model exhibits reduced performance on queries in Japanese or Spanish. To further evaluate and improve model robustness, repeated cross-validation could be conducted to obtain the distribution of mean NDCG scores, ensuring that queries in Japanese and Spanish are proportionally sampled. Additionally, performance can be separately evaluated on Japanese and Spanish queries, and a non-parametric two-sample t-test could be conducted to assess whether the mean NDCG scores are significantly higher than a baseline.

Since the model was not trained specifically on the ESCI dataset, its performance could potentially be improved by fine-tuning it on the ESCI dataset directly, or by further training on language-specific subsets (e.g., Japanese or Spanish queries) to better capture linguistic nuances.

It should also be noted that the current evaluation is based on a relatively small sample size, which may introduce bias or limit the generalizability of the results. Expanding the evaluation to a larger and more representative sample would provide a more robust assessment of model performance.