<a href="https://colab.research.google.com/github/KJOELJOYSON2427/Fine_Tunning_Bert_Query_Aware_Retrieval_-_Reranking_Olist_Dataset_using_Pytorch/blob/main/Fine_Tunning_Bert_Query_Aware_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# 🔍 Hybrid Search Engine on Olist E-commerce Reviews

This notebook demonstrates a **production-grade hybrid search system** on the **Olist e-commerce reviews dataset**. Instead of just performing traditional **sentiment analysis** or **exploratory data analysis (EDA)**, we are building a **real-world search engine** that leverages multiple advanced techniques:

---

## **Architecture Overview**

Our search system combines three complementary approaches to provide **fast and accurate results**:

1. **BM25 (Lexical Search)**
   - 🔹 Fast keyword-based retrieval
   - 🔹 Excellent for exact matches and surface-level search

2. **BERT Embeddings (Semantic Search)**
   - 🔹 Understands meaning beyond keywords
   - 🔹 Captures context and semantic similarity between queries and reviews

3. **Cross-Encoder Reranking**
   - 🔹 Precision scoring for top results
   - 🔹 Reranks candidate documents from BM25 and BERT for maximum relevance

---

## **Why Hybrid Search?**

Modern search engines like **Amazon**, **Google**, and other e-commerce platforms use a hybrid approach to **balance speed and accuracy**:

- 🔹 **BM25** for quick filtering  
- 🔹 **BERT embeddings** for semantic understanding  
- 🔹 **Cross-encoder** for precise ranking  

By combining these, we ensure that our search engine returns **both relevant and contextually meaningful results**.

---

Let's dive in and **build a search engine that feels like the real deal!**




# Given a user query like "fast delivery" or "poor quality product", retrieve the most relevant product reviews from thousands of customer reviews using a two-stage retrieval system

# Stage 0: Environment Setup

In [20]:
import sys
import subprocess
import warnings
import os

# Redirect all stderr to devnull temporarily
import io
from contextlib import redirect_stderr, redirect_stdout

# Capture and suppress ALL output during installation
with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):

  # Fix protobuf
  subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", "protobuf"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "protobuf==3.20.3"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

  #Install Packages
  # Install packages
  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "sentence-transformers"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "rank-bm25"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "faiss-cpu"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

  # Suppress warnings
  warnings.filterwarnings('ignore')
  os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
  os.environ['CUDA_VISIBLE_DEVICES'] = ''

  # Import TensorFlow first to trigger errors in suppressed context
  import tensorflow as tf

  import logging
  logging.getLogger('tensorflow').setLevel(logging.ERROR)
  logging.getLogger('transformers').setLevel(logging.ERROR)


# Now import everything normally (errors already suppressed)
import pandas as pd
import numpy as np
import re
from typing import List, Tuple

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import faiss
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
import time
from collections import defaultdict

print("Setup complete - all libraries loaded successfully!")

Setup complete - all libraries loaded successfully!


# Stage 1: Data Loading and Exploration

We load the Olist reviews dataset and perform initial exploration. The main dataset contains customer reviews with text, ratings, and metadata.

In [21]:
# Load the reviews dataset
# Load the reviews dataset
reviews_df = pd.read_csv('/content/olist_order_reviews_dataset.csv')
orders_df = pd.read_csv('/content/olist_orders_dataset.csv')
order_items_df = pd.read_csv('/content/olist_order_items_dataset.csv')
products_df = pd.read_csv('/content/olist_products_dataset.csv')
translation_df = pd.read_csv('/content/product_category_name_translation.csv')


print("Dataset loaded successfully!")
print(f"\nReviews shape: {reviews_df.shape}")
print(f"Orders shape: {orders_df.shape}")
print(f"Order items shape: {order_items_df.shape}")
print(f"Products shape: {products_df.shape}")

Dataset loaded successfully!

Reviews shape: (99224, 7)
Orders shape: (99441, 8)
Order items shape: (112650, 7)
Products shape: (32951, 9)


In [22]:

# Display first few rows
print("\nSample Reviews:")
reviews_df.head()



Sample Reviews:


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


# Initial Data Inspection¶


Checking data quality, missing values, and review text distribution.

In [23]:
# Check missing values in reviews
print("Missing values in reviews dataset:")
print(reviews_df.isnull().sum())

print("\n" + "="*50)
print("Review Score Distribution:")
print(reviews_df['review_score'].value_counts().sort_index())

print("\n" + "="*50)
# Check non-null reviews
non_null_reviews = reviews_df['review_comment_message'].notna().sum()
print(f"Reviews with text: {non_null_reviews} out of {len(reviews_df)} ({non_null_reviews/len(reviews_df)*100:.2f}%)")

Missing values in reviews dataset:
review_id                      0
order_id                       0
review_score                   0
review_comment_title       87656
review_comment_message     58247
review_creation_date           0
review_answer_timestamp        0
dtype: int64

Review Score Distribution:
review_score
1    11424
2     3151
3     8179
4    19142
5    57328
Name: count, dtype: int64

Reviews with text: 40977 out of 99224 (41.30%)



# Stage 2: Data Preprocessing and Enrichment

We clean the review text, combine title and message, and enrich reviews with product information. This creates a comprehensive search corpus.

In [24]:
def clean_text(text):
  """Clean and normalize text data"""
  if pd.isna(text):
        return ""

  # Convert to string and lowercase
  text = str(text).lower()


  # Remove special characters but keep spaces and basic punctuation
  text = re.sub(r'[^a-zA-Z0-9\s\.\,\!\?]', ' ', text)


  # Remove extra whitespace
  text = re.sub(r'\s+', ' ', text).strip()

  return text

# Create a copy for processing
df = reviews_df.copy()

# Clean review title and message
df['review_title_clean'] = df['review_comment_title'].apply(clean_text)
df['review_message_clean'] = df['review_comment_message'].apply(clean_text)


# Combine title and message
df['review_text'] = df['review_title_clean'] + ' ' + df['review_message_clean']
df['review_text'] = df['review_text'].str.strip()

df=df[df['review_text'].str.len()  >= 10].copy()

print(f"After cleaning: {len(df)} reviews remaining")
print(
    f"Average review length: {df['review_text'].str.len().mean():.0f} characters"
    )


After cleaning: 38968 reviews remaining
Average review length: 74 characters


In [25]:
# Enrich reviews with product information
# Merge with orders to get customer and product info
# Merge with order items to get product info
df = df.merge(order_items_df[['order_id', 'product_id']], on='order_id', how='left')

df

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,review_title_clean,review_message_clean,review_text,product_id
0,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,recebi bem antes do prazo estipulado.,recebi bem antes do prazo estipulado.,52c80cedd4e90108bf4fa6a206ef6b03
1,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,parab ns lojas lannister adorei comprar pela i...,parab ns lojas lannister adorei comprar pela i...,3880d25d502b15b1de6fddc42ad1d67a
2,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente. no site a marca do aparelh...,2018-05-22 00:00:00,2018-05-23 16:45:47,recomendo,aparelho eficiente. no site a marca do aparelh...,recomendo aparelho eficiente. no site a marca ...,61a4100ccd6d9c4c808a1fd954ddb8ad
3,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",2018-02-16 00:00:00,2018-02-20 10:52:22,,"mas um pouco ,travando...pelo valor ta boa.","mas um pouco ,travando...pelo valor ta boa.",acffe5d7cd56e6b564cf6841486644ff
4,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,Super recomendo,"Vendedor confiável, produto ok e entrega antes...",2018-05-23 00:00:00,2018-05-24 03:00:01,super recomendo,"vendedor confi vel, produto ok e entrega antes...","super recomendo vendedor confi vel, produto ok...",6871a3c157d6f51697e887f3c3598479
...,...,...,...,...,...,...,...,...,...,...,...
45945,98fffa80dc9acbde7388bef1600f3b15,d398e9c82363c12527f71801bf0e6100,4,,para este produto recebi de acordo com a compr...,2017-11-29 00:00:00,2017-11-30 15:52:51,,para este produto recebi de acordo com a compr...,para este produto recebi de acordo com a compr...,0bdd9a52d192d64770e33ae559512478
45946,df5fae90e85354241d5d64a8955b2b09,509b86c65fe4e2ad5b96408cfef9755e,5,,Entregou dentro do prazo. O produto chegou em ...,2018-02-07 00:00:00,2018-02-19 19:47:23,,entregou dentro do prazo. o produto chegou em ...,entregou dentro do prazo. o produto chegou em ...,b5e2d4b3fa2f9ecad4c1df951afcc8ce
45947,a709d176f59bc3af77f4149c96bae357,d5cb12269711bd1eaf7eed8fd32a7c95,3,,"O produto não foi enviado com NF, não existe v...",2018-05-19 00:00:00,2018-05-20 21:51:06,,"o produto n o foi enviado com nf, n o existe v...","o produto n o foi enviado com nf, n o existe v...",79da264732f717f10ebf5d102aa6c32a
45948,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43,,"excelente mochila, entrega super r pida. super...","excelente mochila, entrega super r pida. super...",8ba0118a487ec8671aed57e5ef846574


In [26]:
# Merge with order items to get product info
df = df.merge(order_items_df[['order_id', 'product_id']], on='order_id', how='left')
df

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,review_title_clean,review_message_clean,review_text,product_id_x,product_id_y
0,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,recebi bem antes do prazo estipulado.,recebi bem antes do prazo estipulado.,52c80cedd4e90108bf4fa6a206ef6b03,52c80cedd4e90108bf4fa6a206ef6b03
1,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,parab ns lojas lannister adorei comprar pela i...,parab ns lojas lannister adorei comprar pela i...,3880d25d502b15b1de6fddc42ad1d67a,3880d25d502b15b1de6fddc42ad1d67a
2,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente. no site a marca do aparelh...,2018-05-22 00:00:00,2018-05-23 16:45:47,recomendo,aparelho eficiente. no site a marca do aparelh...,recomendo aparelho eficiente. no site a marca ...,61a4100ccd6d9c4c808a1fd954ddb8ad,61a4100ccd6d9c4c808a1fd954ddb8ad
3,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",2018-02-16 00:00:00,2018-02-20 10:52:22,,"mas um pouco ,travando...pelo valor ta boa.","mas um pouco ,travando...pelo valor ta boa.",acffe5d7cd56e6b564cf6841486644ff,acffe5d7cd56e6b564cf6841486644ff
4,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,Super recomendo,"Vendedor confiável, produto ok e entrega antes...",2018-05-23 00:00:00,2018-05-24 03:00:01,super recomendo,"vendedor confi vel, produto ok e entrega antes...","super recomendo vendedor confi vel, produto ok...",6871a3c157d6f51697e887f3c3598479,6871a3c157d6f51697e887f3c3598479
...,...,...,...,...,...,...,...,...,...,...,...,...
68209,98fffa80dc9acbde7388bef1600f3b15,d398e9c82363c12527f71801bf0e6100,4,,para este produto recebi de acordo com a compr...,2017-11-29 00:00:00,2017-11-30 15:52:51,,para este produto recebi de acordo com a compr...,para este produto recebi de acordo com a compr...,0bdd9a52d192d64770e33ae559512478,0bdd9a52d192d64770e33ae559512478
68210,df5fae90e85354241d5d64a8955b2b09,509b86c65fe4e2ad5b96408cfef9755e,5,,Entregou dentro do prazo. O produto chegou em ...,2018-02-07 00:00:00,2018-02-19 19:47:23,,entregou dentro do prazo. o produto chegou em ...,entregou dentro do prazo. o produto chegou em ...,b5e2d4b3fa2f9ecad4c1df951afcc8ce,b5e2d4b3fa2f9ecad4c1df951afcc8ce
68211,a709d176f59bc3af77f4149c96bae357,d5cb12269711bd1eaf7eed8fd32a7c95,3,,"O produto não foi enviado com NF, não existe v...",2018-05-19 00:00:00,2018-05-20 21:51:06,,"o produto n o foi enviado com nf, n o existe v...","o produto n o foi enviado com nf, n o existe v...",79da264732f717f10ebf5d102aa6c32a,79da264732f717f10ebf5d102aa6c32a
68212,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43,,"excelente mochila, entrega super r pida. super...","excelente mochila, entrega super r pida. super...",8ba0118a487ec8671aed57e5ef846574,8ba0118a487ec8671aed57e5ef846574


In [27]:
# 1. See how many product_id columns you actually have
print(df.filter(like='product_id').columns.tolist())

# 2. Drop the messy duplicates and keep only one clean one
cols_to_keep = ['product_id_x']  # or 'product_id_y' — pick the one that has values
cols_to_drop = [col for col in df.columns if col.startswith('product_id') and col not in cols_to_keep]

df = df.drop(columns=cols_to_drop)

# 3. Rename it back to normal
df = df.rename(columns= {'product_id_x': 'product_id' })

['product_id_x', 'product_id_y']


In [28]:
df['product_id'].value_counts()

Unnamed: 0_level_0,count
product_id,Unnamed: 1_level_1
422879e10f46682990de24d770e7f83d,487
368c6c730842d78016ad823897a372db,418
53759a2ecddad2bb87a079a1f1519f73,400
389d119b48cf3043d311335e499d9c6b,340
a62e25e09e05e6faf31d90c6ec1aa3d1,339
...,...
9f4923c125aadccb0e99914c96c8829d,1
49454c6dad93119e2815c43df82b2f57,1
a599c9fe3eab8ac297b1bab1e3b60f88,1
d249623077b8906276b8d77ced95f8c4,1


In [36]:
# Merge with products to get category
# Merge with products to get category
df = df.merge(products_df[['product_id', 'product_category_name']],
              on='product_id', how='left')


In [37]:
df

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,review_title_clean,review_message_clean,review_text,product_id,product_category_name
0,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,recebi bem antes do prazo estipulado.,recebi bem antes do prazo estipulado.,52c80cedd4e90108bf4fa6a206ef6b03,ferramentas_jardim
1,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,parab ns lojas lannister adorei comprar pela i...,parab ns lojas lannister adorei comprar pela i...,3880d25d502b15b1de6fddc42ad1d67a,esporte_lazer
2,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente. no site a marca do aparelh...,2018-05-22 00:00:00,2018-05-23 16:45:47,recomendo,aparelho eficiente. no site a marca do aparelh...,recomendo aparelho eficiente. no site a marca ...,61a4100ccd6d9c4c808a1fd954ddb8ad,eletroportateis
3,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",2018-02-16 00:00:00,2018-02-20 10:52:22,,"mas um pouco ,travando...pelo valor ta boa.","mas um pouco ,travando...pelo valor ta boa.",acffe5d7cd56e6b564cf6841486644ff,beleza_saude
4,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,Super recomendo,"Vendedor confiável, produto ok e entrega antes...",2018-05-23 00:00:00,2018-05-24 03:00:01,super recomendo,"vendedor confi vel, produto ok e entrega antes...","super recomendo vendedor confi vel, produto ok...",6871a3c157d6f51697e887f3c3598479,informatica_acessorios
...,...,...,...,...,...,...,...,...,...,...,...,...
68209,98fffa80dc9acbde7388bef1600f3b15,d398e9c82363c12527f71801bf0e6100,4,,para este produto recebi de acordo com a compr...,2017-11-29 00:00:00,2017-11-30 15:52:51,,para este produto recebi de acordo com a compr...,para este produto recebi de acordo com a compr...,0bdd9a52d192d64770e33ae559512478,relogios_presentes
68210,df5fae90e85354241d5d64a8955b2b09,509b86c65fe4e2ad5b96408cfef9755e,5,,Entregou dentro do prazo. O produto chegou em ...,2018-02-07 00:00:00,2018-02-19 19:47:23,,entregou dentro do prazo. o produto chegou em ...,entregou dentro do prazo. o produto chegou em ...,b5e2d4b3fa2f9ecad4c1df951afcc8ce,esporte_lazer
68211,a709d176f59bc3af77f4149c96bae357,d5cb12269711bd1eaf7eed8fd32a7c95,3,,"O produto não foi enviado com NF, não existe v...",2018-05-19 00:00:00,2018-05-20 21:51:06,,"o produto n o foi enviado com nf, n o existe v...","o produto n o foi enviado com nf, n o existe v...",79da264732f717f10ebf5d102aa6c32a,telefonia
68212,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43,,"excelente mochila, entrega super r pida. super...","excelente mochila, entrega super r pida. super...",8ba0118a487ec8671aed57e5ef846574,papelaria


In [31]:
# # 1. See how many product_id columns you actually have
# print(df.filter(like='product_category_name').columns.tolist())

# # 2. Drop the messy duplicates and keep only one clean one
# cols_to_keep = ['product_category_name_x']  # or 'product_id_y' — pick the one that has values
# cols_to_drop = [col for col in df.columns if col.startswith('product_category_name') and col not in cols_to_keep]

# df = df.drop(columns=cols_to_drop)

# # 3. Rename it back to normal
# df = df.rename(columns= {'product_category_name_x': 'product_category_name' })

['product_category_name']


In [38]:
df

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,review_title_clean,review_message_clean,review_text,product_id,product_category_name
0,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,recebi bem antes do prazo estipulado.,recebi bem antes do prazo estipulado.,52c80cedd4e90108bf4fa6a206ef6b03,ferramentas_jardim
1,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,parab ns lojas lannister adorei comprar pela i...,parab ns lojas lannister adorei comprar pela i...,3880d25d502b15b1de6fddc42ad1d67a,esporte_lazer
2,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente. no site a marca do aparelh...,2018-05-22 00:00:00,2018-05-23 16:45:47,recomendo,aparelho eficiente. no site a marca do aparelh...,recomendo aparelho eficiente. no site a marca ...,61a4100ccd6d9c4c808a1fd954ddb8ad,eletroportateis
3,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",2018-02-16 00:00:00,2018-02-20 10:52:22,,"mas um pouco ,travando...pelo valor ta boa.","mas um pouco ,travando...pelo valor ta boa.",acffe5d7cd56e6b564cf6841486644ff,beleza_saude
4,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,Super recomendo,"Vendedor confiável, produto ok e entrega antes...",2018-05-23 00:00:00,2018-05-24 03:00:01,super recomendo,"vendedor confi vel, produto ok e entrega antes...","super recomendo vendedor confi vel, produto ok...",6871a3c157d6f51697e887f3c3598479,informatica_acessorios
...,...,...,...,...,...,...,...,...,...,...,...,...
68209,98fffa80dc9acbde7388bef1600f3b15,d398e9c82363c12527f71801bf0e6100,4,,para este produto recebi de acordo com a compr...,2017-11-29 00:00:00,2017-11-30 15:52:51,,para este produto recebi de acordo com a compr...,para este produto recebi de acordo com a compr...,0bdd9a52d192d64770e33ae559512478,relogios_presentes
68210,df5fae90e85354241d5d64a8955b2b09,509b86c65fe4e2ad5b96408cfef9755e,5,,Entregou dentro do prazo. O produto chegou em ...,2018-02-07 00:00:00,2018-02-19 19:47:23,,entregou dentro do prazo. o produto chegou em ...,entregou dentro do prazo. o produto chegou em ...,b5e2d4b3fa2f9ecad4c1df951afcc8ce,esporte_lazer
68211,a709d176f59bc3af77f4149c96bae357,d5cb12269711bd1eaf7eed8fd32a7c95,3,,"O produto não foi enviado com NF, não existe v...",2018-05-19 00:00:00,2018-05-20 21:51:06,,"o produto n o foi enviado com nf, n o existe v...","o produto n o foi enviado com nf, n o existe v...",79da264732f717f10ebf5d102aa6c32a,telefonia
68212,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43,,"excelente mochila, entrega super r pida. super...","excelente mochila, entrega super r pida. super...",8ba0118a487ec8671aed57e5ef846574,papelaria


In [33]:
# # 1. See how many product_id columns you actually have
# print(df.filter(like='product_category_name_english').columns.tolist())

# # 2. Drop the messy duplicates and keep only one clean one
# cols_to_keep = ['product_category_name_english_x']  # or 'product_id_y' — pick the one that has values
# cols_to_drop = [col for col in df.columns if col.startswith('product_category_name_english') and col not in cols_to_keep]

# df = df.drop(columns=cols_to_drop)

# # 3. Rename it back to normal
# df = df.rename(columns= {'product_category_name_english_x': 'product_category_name_english' })

In [34]:
df

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,review_title_clean,review_message_clean,review_text,product_id
0,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,recebi bem antes do prazo estipulado.,recebi bem antes do prazo estipulado.,52c80cedd4e90108bf4fa6a206ef6b03
1,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,parab ns lojas lannister adorei comprar pela i...,parab ns lojas lannister adorei comprar pela i...,3880d25d502b15b1de6fddc42ad1d67a
2,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente. no site a marca do aparelh...,2018-05-22 00:00:00,2018-05-23 16:45:47,recomendo,aparelho eficiente. no site a marca do aparelh...,recomendo aparelho eficiente. no site a marca ...,61a4100ccd6d9c4c808a1fd954ddb8ad
3,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",2018-02-16 00:00:00,2018-02-20 10:52:22,,"mas um pouco ,travando...pelo valor ta boa.","mas um pouco ,travando...pelo valor ta boa.",acffe5d7cd56e6b564cf6841486644ff
4,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,Super recomendo,"Vendedor confiável, produto ok e entrega antes...",2018-05-23 00:00:00,2018-05-24 03:00:01,super recomendo,"vendedor confi vel, produto ok e entrega antes...","super recomendo vendedor confi vel, produto ok...",6871a3c157d6f51697e887f3c3598479
...,...,...,...,...,...,...,...,...,...,...,...
68209,98fffa80dc9acbde7388bef1600f3b15,d398e9c82363c12527f71801bf0e6100,4,,para este produto recebi de acordo com a compr...,2017-11-29 00:00:00,2017-11-30 15:52:51,,para este produto recebi de acordo com a compr...,para este produto recebi de acordo com a compr...,0bdd9a52d192d64770e33ae559512478
68210,df5fae90e85354241d5d64a8955b2b09,509b86c65fe4e2ad5b96408cfef9755e,5,,Entregou dentro do prazo. O produto chegou em ...,2018-02-07 00:00:00,2018-02-19 19:47:23,,entregou dentro do prazo. o produto chegou em ...,entregou dentro do prazo. o produto chegou em ...,b5e2d4b3fa2f9ecad4c1df951afcc8ce
68211,a709d176f59bc3af77f4149c96bae357,d5cb12269711bd1eaf7eed8fd32a7c95,3,,"O produto não foi enviado com NF, não existe v...",2018-05-19 00:00:00,2018-05-20 21:51:06,,"o produto n o foi enviado com nf, n o existe v...","o produto n o foi enviado com nf, n o existe v...",79da264732f717f10ebf5d102aa6c32a
68212,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43,,"excelente mochila, entrega super r pida. super...","excelente mochila, entrega super r pida. super...",8ba0118a487ec8671aed57e5ef846574


In [39]:
# Merge with translation for English category names
df = df.merge(translation_df, on='product_category_name', how='left')


In [None]:
df

In [40]:
# Fill missing category with 'unknown'
df['product_category_name_english'] = df['product_category_name_english'].fillna('unknown')


In [41]:

print("Data enrichment complete!")
print(f"\nFinal dataset shape: {df.shape}")
print(f"\nCategories available: {df['product_category_name_english'].nunique()}")

Data enrichment complete!

Final dataset shape: (68214, 13)

Categories available: 72


# Sample Enriched Data

Sample Enriched Data
Let's inspect a few enriched reviews with product information.

In [42]:
# Display sample enriched data
sample_cols = ['review_text', 'review_score', 'product_category_name_english']
print("Sample enriched reviews:\n")
df[sample_cols].head(10)

Sample enriched reviews:



Unnamed: 0,review_text,review_score,product_category_name_english
0,recebi bem antes do prazo estipulado.,5,garden_tools
1,parab ns lojas lannister adorei comprar pela i...,5,sports_leisure
2,recomendo aparelho eficiente. no site a marca ...,4,small_appliances
3,"mas um pouco ,travando...pelo valor ta boa.",4,health_beauty
4,"super recomendo vendedor confi vel, produto ok...",5,computers_accessories
5,"gostaria de saber o que houve, sempre recebi e...",2,cool_stuff
6,n o chegou meu produto p ssimo,1,luggage_accessories
7,timo loja nota 10,5,health_beauty
8,obrigado pela aten ao amim dispensada,5,auto
9,a compra foi realizada facilmente. a entrega f...,5,electronics


# Building the Search Corpus

We prepare the final search corpus by creating a clean dataset with all necessary fields. We'll also create a subset for faster prototyping if needed.

In [43]:
from SearchCorpus import SearchCorpusBuilder

builder = SearchCorpusBuilder(df)
corpus = builder.build_corpus()

Using a sample of 30000 reviews for the search system

Final corpus size: 30000 documents
Average text length: 80 characters


In [None]:
!ls -l /content


In [44]:
search_corpus=builder.corpus
search_corpus

Unnamed: 0,review_id,review_text,review_score,product_category_name_english,order_id
0,dae5a78729fd69013f55c992ff15f298,otimo produto muito bonito,4,furniture_decor,ecc812c240c4f03824a3c5d7e3b6d351
1,9e21efc628414102721b3f7cfc3abbf3,produto lindo.,5,garden_tools,84c1ef4f5624b36f93862b353cd6fef0
2,03129dea7c12fa5878b2e629ccdf2ce6,comprei 14 unidades e recebi somente 9. ainda ...,1,telephony,9bdc4d4c71aa1de4606060929dee888c
3,639b7d9539b902df47a5ea460fbd5f01,s o produto tem uma voz irritante ao mudar ent...,5,electronics,a4f4688be9887418ee0c130b6f061586
4,704841be76891e4fed352e235aa80abe,"o produto chegou, o tempo foi bom. por m n o e...",1,sports_leisure,70ad0c0c83d69921cc0675f3e56d4e02
...,...,...,...,...,...
29995,0b309a3dd442730e7ea90f7fbf4e0696,timo vendedor recomendo a todos.,5,home_appliances,99f48f57d8e3a6f6bcc13902ee0067c7
29996,ff0693cd9ed942fed4dc10a397be570d,"boa tarde, n o recebi as p talas, somente o ad...",1,furniture_decor,732f26c55f8366066c096f572cb73854
29997,84f78072bcd80e27587ab488f5f47981,eu recomendo muito excelente pre o,4,sports_leisure,87ce6f9094188f36623b250901635869
29998,ca101636d2830ba9ec964f0546d75e96,produto chegou e muito bom!,5,housewares,1b76e3d417b12339d254588a95496581


# Stage 4: BM25 Index Creation (Lexical Search)

BM25 is a probabilistic retrieval function that ranks documents based on query term frequency. It's fast and effective for keyword matching.

**Tokenizer**

In [45]:
def tokenize(text):
    """Simple whitespace tokenizer"""
    return text.lower().split()


In [46]:
from BM25IndexCreation import BM25SearchEngine

bm25_engine = BM25SearchEngine(search_corpus, tokenize)


Tokenizing documents...
Building BM25 index...
BM25 Index created for 30000 documents!

Sample reviews (checking language):
1. otimo produto muito bonito...
2. produto lindo....
3. comprei 14 unidades e recebi somente 9. ainda n o foi resolvida a quest o....

Detected: Reviews are in Portuguese
Recommendation: Use Portuguese queries for BM25, or rely more on semantic search


In [47]:
bm25_engine.search("fast delivery", top_n=5)


Unnamed: 0,review_id,review_text,review_score,product_category_name_english,order_id,bm25_score


<pre>
Testing BM25 Search
Quick test to verify BM25 is working correctly.

Important Note: Language Considerations
The Olist dataset contains reviews in Portuguese. This affects our search components differently:

BM25 (Keyword Search):

Requires exact word matches
English queries get zero scores on Portuguese text
Solution: Use Portuguese queries for BM25
BERT Semantic Search:

The multilingual models can understand cross-language similarity
Works reasonably well even with English queries on Portuguese text
Better at capturing meaning across languages
Hybrid System Advantage:

Even if BM25 returns zero scores, semantic search still finds relevant results
Cross-encoder reranking works across languages
System remains functional with English queries due to semantic component
For best results with this Portuguese dataset, use Portuguese queries. However, the system will still work with English queries thanks to the semantic search component.
</pre>


In [48]:
# Test BM25 with Portuguese queries (since reviews are in Portuguese)
test_queries_pt = [
    "entrega rapida",  # fast delivery
    "produto bom",     # good product
    "qualidade"        # quality
]

print("Testing BM25 with Portuguese queries:\n")
print("=" * 80)

for query in test_queries_pt:
    print(f"\nQUERY →", query.upper())
    print("—" * 80)

    # Run search and show results beautifully
    results = bm25_engine.search(
        query=query,
        top_n=5,

    )
    # Optional: show as nice table if you want
    if not results.empty:
        display(results[['review_text', 'review_score', 'bm25_score']].round(4))
    else:
        print("   → No results found")

    print("=" * 80)

Testing BM25 with Portuguese queries:


QUERY → ENTREGA RAPIDA
————————————————————————————————————————————————————————————————————————————————


Unnamed: 0,review_text,review_score,bm25_score
7279,entrega rapida,5,11.8765
26092,entrega rapida,4,11.8765
10958,entrega rapida,5,11.8765
17372,entrega rapida,5,11.8765
24341,entrega rapida,5,11.8765



QUERY → PRODUTO BOM
————————————————————————————————————————————————————————————————————————————————


Unnamed: 0,review_text,review_score,bm25_score
635,bom bom produto,4,5.9947
27101,bom bom produto,4,5.9947
1514,bom produto bom,4,5.9947
11176,bom bom produto,4,5.9947
17227,muito bom produto muito bom,5,5.6236



QUERY → QUALIDADE
————————————————————————————————————————————————————————————————————————————————


Unnamed: 0,review_text,review_score,bm25_score
9819,excelente qualidade,5,5.4485
28650,boa qualidade,3,5.4485
12455,boa qualidade,5,5.4485
16052,otima qualidade,5,5.4485
14012,boa qualidade,5,5.4485




# BERT Semantic Search Setup

We use SentenceTransformers to encode reviews into dense vectors. This captures semantic meaning beyond exact keyword matches

<pre>
Transformer → Pooling → Sentence Embedding
</pre>?

In [49]:
# Load pre-trained sentence transformer model
print("Loading BERT model for semantic search...")
print("This may take a minute on first run...")

model_name = 'all-MiniLM-L6-v2'  # Fast and efficient model

semantic_model = SentenceTransformer(model_name)

print(f"Model loaded: {model_name}")
print(f"Embedding dimension:{semantic_model.get_sentence_embedding_dimension()}")

Loading BERT model for semantic search...
This may take a minute on first run...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Model loaded: all-MiniLM-L6-v2
Embedding dimension:384


# Creating Document Embeddings
We encode all reviews into vector representations. This is the most time-consuming step but only needs to be done once.

In [50]:
from DocumentEmbedding import SemanticEmbeddingEngine

semantic_engine = SemanticEmbeddingEngine(
    dataframe=search_corpus,
    text_column="review_text",
    model_name=model_name,
    batch_size=32
)



Loading semantic model: all-MiniLM-L6-v2 ...
Encoding documents into embeddings...
['otimo produto muito bonito', 'produto lindo.', 'comprei 14 unidades e recebi somente 9. ainda n o foi resolvida a quest o.', 's o produto tem uma voz irritante ao mudar entre os modos, o r dio fm n o funciona no brasil, os cabos de energia e conector s o curtos e n o existe bot o de liga desliga.', 'o produto chegou, o tempo foi bom. por m n o estou confiante na empresa. produto doce, procuramos pelo cnpj da empresa e n o encontramos, infelizmente n o estamos seguros que o produto seja original.', 'seria maravilhoso se diminu ssem o custo do frete que ainda muito caro.', 'os produtos chegaram direitinho e antes do prazo!!! excelente!!', 'bom gostei', 'o produto n o foi entregue, fiz uma reclama o e nem responderam.', 'as lentes estao com defeito nao sei se a cx toda', 'n o recebi o produto e n o tenho previs o alguma de receber. p ssimo neg cio.', 'comprei 3 conjuntos, veio apenas 1', 'o pedido chegou 

Batches:   0%|          | 0/938 [00:00<?, ?it/s]


Embeddings created!
Shape: (30000, 384)


In [59]:
def embed_query(query):
        # Encode query
        query_emb = semantic_engine.model.encode(query, convert_to_numpy=True)

        if query_emb is None or len(query_emb) == 0:
            raise ValueError("Query embedding is empty — model encoding failed")
        return query_emb

In [55]:
results = semantic_engine.search("good quality product", top_n=5)



# Building FAISS Index for Fast Vector Search

<pre>
FAISS (Facebook AI Similarity Search) is a library designed for efficient similarity search on high-dimensional vectors. It is mostly used for embeddings from NLP models, images, or other feature vectors.

Purpose: Quickly find nearest neighbors for a query vector among millions of embeddings.

Why fast: It uses optimized vector indexing, quantization, and sometimes GPU acceleration.

Typical workflow:

Normalize or preprocess embeddings.

Build an index.

Add vectors to the index.

Query the index for nearest neighbors.
</pre>

In [None]:
corpus_embeddings = semantic_engine.corpus_embeddings
test_query_semantic="good quality product"
query_embedding =embed_query(test_query_semantic)

In [67]:
corpus_embeddings.shape[1]

384

In [76]:
import faiss
import numpy as np


faiss_index = FAISSIndex(corpus_embeddings)
indices, scores = faiss_index.search(query_embedding, top_k=5)
print(indices[0],scores[0])

for i, (idx, score) in enumerate(zip(indices, scores), 1):
    print(f"\n{i}. Similarity Score: {score:.3f}")
    print(f"   Review: {search_corpus.iloc[idx]['review_text'][:100]}...")

8843 0.3988079

1. Similarity Score: 0.399
   Review: super recomendo produto original...

2. Similarity Score: 0.371
   Review: timos produtos super recomendo...

3. Similarity Score: 0.371
   Review: timos produtos super recomendo...

4. Similarity Score: 0.371
   Review: timos produtos super recomendo...

5. Similarity Score: 0.371
   Review: timos produtos super recomendo...


# Stage 6: Cross-Encoder Reranker Setup
Cross-encoders provide precise relevance scoring by jointly encoding query and document. Much slower but more accurate than bi-encoders.