

# 🔍 Hybrid Search Engine on Olist E-commerce Reviews

This notebook demonstrates a **production-grade hybrid search system** on the **Olist e-commerce reviews dataset**. Instead of just performing traditional **sentiment analysis** or **exploratory data analysis (EDA)**, we are building a **real-world search engine** that leverages multiple advanced techniques:

---

## **Architecture Overview**

Our search system combines three complementary approaches to provide **fast and accurate results**:

1. **BM25 (Lexical Search)**
   - 🔹 Fast keyword-based retrieval
   - 🔹 Excellent for exact matches and surface-level search

2. **BERT Embeddings (Semantic Search)**
   - 🔹 Understands meaning beyond keywords
   - 🔹 Captures context and semantic similarity between queries and reviews

3. **Cross-Encoder Reranking**
   - 🔹 Precision scoring for top results
   - 🔹 Reranks candidate documents from BM25 and BERT for maximum relevance

---

## **Why Hybrid Search?**

Modern search engines like **Amazon**, **Google**, and other e-commerce platforms use a hybrid approach to **balance speed and accuracy**:

- 🔹 **BM25** for quick filtering  
- 🔹 **BERT embeddings** for semantic understanding  
- 🔹 **Cross-encoder** for precise ranking  

By combining these, we ensure that our search engine returns **both relevant and contextually meaningful results**.

---

Let's dive in and **build a search engine that feels like the real deal!**




# Given a user query like "fast delivery" or "poor quality product", retrieve the most relevant product reviews from thousands of customer reviews using a two-stage retrieval system

# Stage 0: Environment Setup

In [1]:
import sys
import subprocess
import warnings
import os

# Redirect all stderr to devnull temporarily
import io
from contextlib import redirect_stderr, redirect_stdout

# Capture and suppress ALL output during installation
with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):

  # Fix protobuf
  subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", "protobuf"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "protobuf==3.20.3"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

  #Install Packages
  # Install packages
  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "sentence-transformers"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "rank-bm25"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "faiss-cpu"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

  # Suppress warnings
  warnings.filterwarnings('ignore')
  os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
  os.environ['CUDA_VISIBLE_DEVICES'] = ''

  # Import TensorFlow first to trigger errors in suppressed context
  import tensorflow as tf

  import logging
  logging.getLogger('tensorflow').setLevel(logging.ERROR)
  logging.getLogger('transformers').setLevel(logging.ERROR)


# Now import everything normally (errors already suppressed)
import pandas as pd
import numpy as np
import re
from typing import List, Tuple

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import faiss
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
import time
from collections import defaultdict

print("Setup complete - all libraries loaded successfully!")

Setup complete - all libraries loaded successfully!


# Stage 1: Data Loading and Exploration

We load the Olist reviews dataset and perform initial exploration. The main dataset contains customer reviews with text, ratings, and metadata.

In [2]:
# Load the reviews dataset
# Load the reviews dataset
reviews_df = pd.read_csv('/content/olist_order_reviews_dataset.csv')
orders_df = pd.read_csv('/content/olist_orders_dataset.csv')
order_items_df = pd.read_csv('/content/olist_order_items_dataset.csv')
products_df = pd.read_csv('/content/olist_products_dataset.csv')
translation_df = pd.read_csv('/content/product_category_name_translation.csv')


print("Dataset loaded successfully!")
print(f"\nReviews shape: {reviews_df.shape}")
print(f"Orders shape: {orders_df.shape}")
print(f"Order items shape: {order_items_df.shape}")
print(f"Products shape: {products_df.shape}")

Dataset loaded successfully!

Reviews shape: (99224, 7)
Orders shape: (99441, 8)
Order items shape: (112650, 7)
Products shape: (32951, 9)


In [3]:

# Display first few rows
print("\nSample Reviews:")
reviews_df.head()



Sample Reviews:


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


# Initial Data Inspection¶


Checking data quality, missing values, and review text distribution.

In [4]:
# Check missing values in reviews
print("Missing values in reviews dataset:")
print(reviews_df.isnull().sum())

print("\n" + "="*50)
print("Review Score Distribution:")
print(reviews_df['review_score'].value_counts().sort_index())

print("\n" + "="*50)
# Check non-null reviews
non_null_reviews = reviews_df['review_comment_message'].notna().sum()
print(f"Reviews with text: {non_null_reviews} out of {len(reviews_df)} ({non_null_reviews/len(reviews_df)*100:.2f}%)")

Missing values in reviews dataset:
review_id                      0
order_id                       0
review_score                   0
review_comment_title       87656
review_comment_message     58247
review_creation_date           0
review_answer_timestamp        0
dtype: int64

Review Score Distribution:
review_score
1    11424
2     3151
3     8179
4    19142
5    57328
Name: count, dtype: int64

Reviews with text: 40977 out of 99224 (41.30%)



# Stage 2: Data Preprocessing and Enrichment

We clean the review text, combine title and message, and enrich reviews with product information. This creates a comprehensive search corpus.

In [5]:
def clean_text(text):
  """Clean and normalize text data"""
  if pd.isna(text):
        return ""

  # Convert to string and lowercase
  text = str(text).lower()


  # Remove special characters but keep spaces and basic punctuation
  text = re.sub(r'[^a-zA-Z0-9\s\.\,\!\?]', ' ', text)


  # Remove extra whitespace
  text = re.sub(r'\s+', ' ', text).strip()

  return text

# Create a copy for processing
df = reviews_df.copy()

# Clean review title and message
df['review_title_clean'] = df['review_comment_title'].apply(clean_text)
df['review_message_clean'] = df['review_comment_message'].apply(clean_text)


# Combine title and message
df['review_text'] = df['review_title_clean'] + ' ' + df['review_message_clean']
df['review_text'] = df['review_text'].str.strip()

df=df[df['review_text'].str.len()  >= 10].copy()

print(f"After cleaning: {len(df)} reviews remaining")
print(
    f"Average review length: {df['review_text'].str.len().mean():.0f} characters"
    )


After cleaning: 38968 reviews remaining
Average review length: 74 characters


In [6]:
# Enrich reviews with product information
# Merge with orders to get customer and product info
# Merge with order items to get product info
df = df.merge(order_items_df[['order_id', 'product_id']], on='order_id', how='left')

df

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,review_title_clean,review_message_clean,review_text,product_id
0,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,recebi bem antes do prazo estipulado.,recebi bem antes do prazo estipulado.,52c80cedd4e90108bf4fa6a206ef6b03
1,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,parab ns lojas lannister adorei comprar pela i...,parab ns lojas lannister adorei comprar pela i...,3880d25d502b15b1de6fddc42ad1d67a
2,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente. no site a marca do aparelh...,2018-05-22 00:00:00,2018-05-23 16:45:47,recomendo,aparelho eficiente. no site a marca do aparelh...,recomendo aparelho eficiente. no site a marca ...,61a4100ccd6d9c4c808a1fd954ddb8ad
3,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",2018-02-16 00:00:00,2018-02-20 10:52:22,,"mas um pouco ,travando...pelo valor ta boa.","mas um pouco ,travando...pelo valor ta boa.",acffe5d7cd56e6b564cf6841486644ff
4,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,Super recomendo,"Vendedor confiável, produto ok e entrega antes...",2018-05-23 00:00:00,2018-05-24 03:00:01,super recomendo,"vendedor confi vel, produto ok e entrega antes...","super recomendo vendedor confi vel, produto ok...",6871a3c157d6f51697e887f3c3598479
...,...,...,...,...,...,...,...,...,...,...,...
45945,98fffa80dc9acbde7388bef1600f3b15,d398e9c82363c12527f71801bf0e6100,4,,para este produto recebi de acordo com a compr...,2017-11-29 00:00:00,2017-11-30 15:52:51,,para este produto recebi de acordo com a compr...,para este produto recebi de acordo com a compr...,0bdd9a52d192d64770e33ae559512478
45946,df5fae90e85354241d5d64a8955b2b09,509b86c65fe4e2ad5b96408cfef9755e,5,,Entregou dentro do prazo. O produto chegou em ...,2018-02-07 00:00:00,2018-02-19 19:47:23,,entregou dentro do prazo. o produto chegou em ...,entregou dentro do prazo. o produto chegou em ...,b5e2d4b3fa2f9ecad4c1df951afcc8ce
45947,a709d176f59bc3af77f4149c96bae357,d5cb12269711bd1eaf7eed8fd32a7c95,3,,"O produto não foi enviado com NF, não existe v...",2018-05-19 00:00:00,2018-05-20 21:51:06,,"o produto n o foi enviado com nf, n o existe v...","o produto n o foi enviado com nf, n o existe v...",79da264732f717f10ebf5d102aa6c32a
45948,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43,,"excelente mochila, entrega super r pida. super...","excelente mochila, entrega super r pida. super...",8ba0118a487ec8671aed57e5ef846574


In [10]:
# Merge with order items to get product info
df = df.merge(order_items_df[['order_id', 'product_id']], on='order_id', how='left')
df

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,review_title_clean,review_message_clean,review_text,product_id_x,product_id_y,product_id
0,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,recebi bem antes do prazo estipulado.,recebi bem antes do prazo estipulado.,52c80cedd4e90108bf4fa6a206ef6b03,52c80cedd4e90108bf4fa6a206ef6b03,52c80cedd4e90108bf4fa6a206ef6b03
1,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,parab ns lojas lannister adorei comprar pela i...,parab ns lojas lannister adorei comprar pela i...,3880d25d502b15b1de6fddc42ad1d67a,3880d25d502b15b1de6fddc42ad1d67a,3880d25d502b15b1de6fddc42ad1d67a
2,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente. no site a marca do aparelh...,2018-05-22 00:00:00,2018-05-23 16:45:47,recomendo,aparelho eficiente. no site a marca do aparelh...,recomendo aparelho eficiente. no site a marca ...,61a4100ccd6d9c4c808a1fd954ddb8ad,61a4100ccd6d9c4c808a1fd954ddb8ad,61a4100ccd6d9c4c808a1fd954ddb8ad
3,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",2018-02-16 00:00:00,2018-02-20 10:52:22,,"mas um pouco ,travando...pelo valor ta boa.","mas um pouco ,travando...pelo valor ta boa.",acffe5d7cd56e6b564cf6841486644ff,acffe5d7cd56e6b564cf6841486644ff,acffe5d7cd56e6b564cf6841486644ff
4,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,Super recomendo,"Vendedor confiável, produto ok e entrega antes...",2018-05-23 00:00:00,2018-05-24 03:00:01,super recomendo,"vendedor confi vel, produto ok e entrega antes...","super recomendo vendedor confi vel, produto ok...",6871a3c157d6f51697e887f3c3598479,6871a3c157d6f51697e887f3c3598479,6871a3c157d6f51697e887f3c3598479
...,...,...,...,...,...,...,...,...,...,...,...,...,...
167085,98fffa80dc9acbde7388bef1600f3b15,d398e9c82363c12527f71801bf0e6100,4,,para este produto recebi de acordo com a compr...,2017-11-29 00:00:00,2017-11-30 15:52:51,,para este produto recebi de acordo com a compr...,para este produto recebi de acordo com a compr...,0bdd9a52d192d64770e33ae559512478,0bdd9a52d192d64770e33ae559512478,0bdd9a52d192d64770e33ae559512478
167086,df5fae90e85354241d5d64a8955b2b09,509b86c65fe4e2ad5b96408cfef9755e,5,,Entregou dentro do prazo. O produto chegou em ...,2018-02-07 00:00:00,2018-02-19 19:47:23,,entregou dentro do prazo. o produto chegou em ...,entregou dentro do prazo. o produto chegou em ...,b5e2d4b3fa2f9ecad4c1df951afcc8ce,b5e2d4b3fa2f9ecad4c1df951afcc8ce,b5e2d4b3fa2f9ecad4c1df951afcc8ce
167087,a709d176f59bc3af77f4149c96bae357,d5cb12269711bd1eaf7eed8fd32a7c95,3,,"O produto não foi enviado com NF, não existe v...",2018-05-19 00:00:00,2018-05-20 21:51:06,,"o produto n o foi enviado com nf, n o existe v...","o produto n o foi enviado com nf, n o existe v...",79da264732f717f10ebf5d102aa6c32a,79da264732f717f10ebf5d102aa6c32a,79da264732f717f10ebf5d102aa6c32a
167088,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43,,"excelente mochila, entrega super r pida. super...","excelente mochila, entrega super r pida. super...",8ba0118a487ec8671aed57e5ef846574,8ba0118a487ec8671aed57e5ef846574,8ba0118a487ec8671aed57e5ef846574


In [11]:
# 1. See how many product_id columns you actually have
print(df.filter(like='product_id').columns.tolist())

# 2. Drop the messy duplicates and keep only one clean one
cols_to_keep = ['product_id_x']  # or 'product_id_y' — pick the one that has values
cols_to_drop = [col for col in df.columns if col.startswith('product_id') and col not in cols_to_keep]

df = df.drop(columns=cols_to_drop)

# 3. Rename it back to normal
df = df.rename(columns= {'product_id_x': 'product_id' })

['product_id_x', 'product_id_y', 'product_id']


In [12]:
df['product_id'].value_counts()

Unnamed: 0_level_0,count
product_id,Unnamed: 1_level_1
05b515fdc76e888aada3c6d66c201dff,4410
270516a3f41dc035aa87d220228f844c,4410
44a5d24dd383324a421569ca697b13c2,2744
a62e25e09e05e6faf31d90c6ec1aa3d1,2349
d34c07a2d817ac73f4caf8c574215fed,2323
...,...
595e2ffe6a4ab5bc42d5e6cae20f2e10,1
a24e985fc6dbd5f0e50b2a013fd72ed4,1
8e80624d5c2d41d70df307dd52fef6c4,1
8e4359ade56e8a653deb63400dcd994c,1


In [13]:
# Merge with products to get category
# Merge with products to get category
df = df.merge(products_df[['product_id', 'product_category_name']],
              on='product_id', how='left')


In [14]:
df

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,review_title_clean,review_message_clean,review_text,product_id,product_category_name
0,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,recebi bem antes do prazo estipulado.,recebi bem antes do prazo estipulado.,52c80cedd4e90108bf4fa6a206ef6b03,ferramentas_jardim
1,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,parab ns lojas lannister adorei comprar pela i...,parab ns lojas lannister adorei comprar pela i...,3880d25d502b15b1de6fddc42ad1d67a,esporte_lazer
2,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente. no site a marca do aparelh...,2018-05-22 00:00:00,2018-05-23 16:45:47,recomendo,aparelho eficiente. no site a marca do aparelh...,recomendo aparelho eficiente. no site a marca ...,61a4100ccd6d9c4c808a1fd954ddb8ad,eletroportateis
3,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",2018-02-16 00:00:00,2018-02-20 10:52:22,,"mas um pouco ,travando...pelo valor ta boa.","mas um pouco ,travando...pelo valor ta boa.",acffe5d7cd56e6b564cf6841486644ff,beleza_saude
4,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,Super recomendo,"Vendedor confiável, produto ok e entrega antes...",2018-05-23 00:00:00,2018-05-24 03:00:01,super recomendo,"vendedor confi vel, produto ok e entrega antes...","super recomendo vendedor confi vel, produto ok...",6871a3c157d6f51697e887f3c3598479,informatica_acessorios
...,...,...,...,...,...,...,...,...,...,...,...,...
167085,98fffa80dc9acbde7388bef1600f3b15,d398e9c82363c12527f71801bf0e6100,4,,para este produto recebi de acordo com a compr...,2017-11-29 00:00:00,2017-11-30 15:52:51,,para este produto recebi de acordo com a compr...,para este produto recebi de acordo com a compr...,0bdd9a52d192d64770e33ae559512478,relogios_presentes
167086,df5fae90e85354241d5d64a8955b2b09,509b86c65fe4e2ad5b96408cfef9755e,5,,Entregou dentro do prazo. O produto chegou em ...,2018-02-07 00:00:00,2018-02-19 19:47:23,,entregou dentro do prazo. o produto chegou em ...,entregou dentro do prazo. o produto chegou em ...,b5e2d4b3fa2f9ecad4c1df951afcc8ce,esporte_lazer
167087,a709d176f59bc3af77f4149c96bae357,d5cb12269711bd1eaf7eed8fd32a7c95,3,,"O produto não foi enviado com NF, não existe v...",2018-05-19 00:00:00,2018-05-20 21:51:06,,"o produto n o foi enviado com nf, n o existe v...","o produto n o foi enviado com nf, n o existe v...",79da264732f717f10ebf5d102aa6c32a,telefonia
167088,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43,,"excelente mochila, entrega super r pida. super...","excelente mochila, entrega super r pida. super...",8ba0118a487ec8671aed57e5ef846574,papelaria


In [None]:
# # 1. See how many product_id columns you actually have
# print(df.filter(like='product_category_name').columns.tolist())

# # 2. Drop the messy duplicates and keep only one clean one
# cols_to_keep = ['product_category_name_x']  # or 'product_id_y' — pick the one that has values
# cols_to_drop = [col for col in df.columns if col.startswith('product_category_name') and col not in cols_to_keep]

# df = df.drop(columns=cols_to_drop)

# # 3. Rename it back to normal
# df = df.rename(columns= {'product_category_name_x': 'product_category_name' })

In [15]:
df

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,review_title_clean,review_message_clean,review_text,product_id,product_category_name
0,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,recebi bem antes do prazo estipulado.,recebi bem antes do prazo estipulado.,52c80cedd4e90108bf4fa6a206ef6b03,ferramentas_jardim
1,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,parab ns lojas lannister adorei comprar pela i...,parab ns lojas lannister adorei comprar pela i...,3880d25d502b15b1de6fddc42ad1d67a,esporte_lazer
2,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,recomendo,aparelho eficiente. no site a marca do aparelh...,2018-05-22 00:00:00,2018-05-23 16:45:47,recomendo,aparelho eficiente. no site a marca do aparelh...,recomendo aparelho eficiente. no site a marca ...,61a4100ccd6d9c4c808a1fd954ddb8ad,eletroportateis
3,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",2018-02-16 00:00:00,2018-02-20 10:52:22,,"mas um pouco ,travando...pelo valor ta boa.","mas um pouco ,travando...pelo valor ta boa.",acffe5d7cd56e6b564cf6841486644ff,beleza_saude
4,3948b09f7c818e2d86c9a546758b2335,e51478e7e277a83743b6f9991dbfa3fb,5,Super recomendo,"Vendedor confiável, produto ok e entrega antes...",2018-05-23 00:00:00,2018-05-24 03:00:01,super recomendo,"vendedor confi vel, produto ok e entrega antes...","super recomendo vendedor confi vel, produto ok...",6871a3c157d6f51697e887f3c3598479,informatica_acessorios
...,...,...,...,...,...,...,...,...,...,...,...,...
167085,98fffa80dc9acbde7388bef1600f3b15,d398e9c82363c12527f71801bf0e6100,4,,para este produto recebi de acordo com a compr...,2017-11-29 00:00:00,2017-11-30 15:52:51,,para este produto recebi de acordo com a compr...,para este produto recebi de acordo com a compr...,0bdd9a52d192d64770e33ae559512478,relogios_presentes
167086,df5fae90e85354241d5d64a8955b2b09,509b86c65fe4e2ad5b96408cfef9755e,5,,Entregou dentro do prazo. O produto chegou em ...,2018-02-07 00:00:00,2018-02-19 19:47:23,,entregou dentro do prazo. o produto chegou em ...,entregou dentro do prazo. o produto chegou em ...,b5e2d4b3fa2f9ecad4c1df951afcc8ce,esporte_lazer
167087,a709d176f59bc3af77f4149c96bae357,d5cb12269711bd1eaf7eed8fd32a7c95,3,,"O produto não foi enviado com NF, não existe v...",2018-05-19 00:00:00,2018-05-20 21:51:06,,"o produto n o foi enviado com nf, n o existe v...","o produto n o foi enviado com nf, n o existe v...",79da264732f717f10ebf5d102aa6c32a,telefonia
167088,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43,,"excelente mochila, entrega super r pida. super...","excelente mochila, entrega super r pida. super...",8ba0118a487ec8671aed57e5ef846574,papelaria


In [None]:
# # 1. See how many product_id columns you actually have
# print(df.filter(like='product_category_name_english').columns.tolist())

# # 2. Drop the messy duplicates and keep only one clean one
# cols_to_keep = ['product_category_name_english_x']  # or 'product_id_y' — pick the one that has values
# cols_to_drop = [col for col in df.columns if col.startswith('product_category_name_english') and col not in cols_to_keep]

# df = df.drop(columns=cols_to_drop)

# # 3. Rename it back to normal
# df = df.rename(columns= {'product_category_name_english_x': 'product_category_name_english' })

In [None]:
df

In [16]:
# Merge with translation for English category names
df = df.merge(translation_df, on='product_category_name', how='left')


In [None]:
df

In [17]:
# Fill missing category with 'unknown'
df['product_category_name_english'] = df['product_category_name_english'].fillna('unknown')


In [18]:

print("Data enrichment complete!")
print(f"\nFinal dataset shape: {df.shape}")
print(f"\nCategories available: {df['product_category_name_english'].nunique()}")

Data enrichment complete!

Final dataset shape: (167090, 13)

Categories available: 72


# Sample Enriched Data

Sample Enriched Data
Let's inspect a few enriched reviews with product information.

In [19]:
# Display sample enriched data
sample_cols = ['review_text', 'review_score', 'product_category_name_english']
print("Sample enriched reviews:\n")
df[sample_cols].head(10)

Sample enriched reviews:



Unnamed: 0,review_text,review_score,product_category_name_english
0,recebi bem antes do prazo estipulado.,5,garden_tools
1,parab ns lojas lannister adorei comprar pela i...,5,sports_leisure
2,recomendo aparelho eficiente. no site a marca ...,4,small_appliances
3,"mas um pouco ,travando...pelo valor ta boa.",4,health_beauty
4,"super recomendo vendedor confi vel, produto ok...",5,computers_accessories
5,"gostaria de saber o que houve, sempre recebi e...",2,cool_stuff
6,n o chegou meu produto p ssimo,1,luggage_accessories
7,timo loja nota 10,5,health_beauty
8,obrigado pela aten ao amim dispensada,5,auto
9,a compra foi realizada facilmente. a entrega f...,5,electronics


# Building the Search Corpus

We prepare the final search corpus by creating a clean dataset with all necessary fields. We'll also create a subset for faster prototyping if needed.

In [20]:
from SearchCorpus import SearchCorpusBuilder

builder = SearchCorpusBuilder(df)
corpus = builder.build_corpus()

Using a sample of 30000 reviews for the search system

Final corpus size: 30000 documents
Average text length: 85 characters


In [None]:
!ls -l /content


In [21]:
search_corpus=builder.corpus
search_corpus

Unnamed: 0,review_id,review_text,review_score,product_category_name_english,order_id
0,03129dea7c12fa5878b2e629ccdf2ce6,comprei 14 unidades e recebi somente 9. ainda ...,1,telephony,9bdc4d4c71aa1de4606060929dee888c
1,c5b3db66a5535c6eaba777c380dac7cf,"parab ns,chegou muito bem embalado...recomendo.",5,housewares,3158707358b447fc894a497446d1acbe
2,05d3e5fc1e9c0bc96049d5c8373a2bf8,recebi produto errado. necessito urgente que s...,2,health_beauty,ad5d83949a78b7f4a88b59f991bce882
3,8c411799802145765084ae272c6977e4,nao recebi todo material veio somente uma past...,1,sports_leisure,df13cbaf3230c62a4b582317936e8a39
4,bcab607c602aa2ebf22f6889eb896823,super recomendo fiquei muito satisfeita com o ...,5,construction_tools_lights,715b8576b74d53796bcbb107a201eb55
...,...,...,...,...,...
29995,24feefc49d5adfaeaf2d4c24adff98c5,"entrega antes do prazo, produto de acordo com ...",5,furniture_decor,5ea599a9a67fae7fd7120b2b9add8a60
29996,f83706652e0db3c57c07a8ec0807f41f,recomendado recebemos o produto.,5,watches_gifts,8b249044e8ce6377509b32dc840bfe76
29997,2f2b7e1caca93d34d13a39d53f8a6e60,otimo primeira vez que comprei lente de contat...,5,health_beauty,b149acf7e57747cd966a1f1a11e98e05
29998,6cc6312c741405ce7052899d16ddf6ad,paguei por 6 cadeiras e entregaram apenas 5,1,office_furniture,9aec4e1ae90b23c7bf2d2b3bfafbd943


# Stage 4: BM25 Index Creation (Lexical Search)

BM25 is a probabilistic retrieval function that ranks documents based on query term frequency. It's fast and effective for keyword matching.

**Tokenizer**

In [22]:
def tokenize(text):
    """Simple whitespace tokenizer"""
    return text.lower().split()


In [23]:
from BM25IndexCreation import BM25SearchEngine

bm25_engine = BM25SearchEngine(search_corpus, tokenize)


Tokenizing documents...
Building BM25 index...
BM25 Index created for 30000 documents!

Sample reviews (checking language):
1. comprei 14 unidades e recebi somente 9. ainda n o foi resolvida a quest o....
2. parab ns,chegou muito bem embalado...recomendo....
3. recebi produto errado. necessito urgente que seja feita a troca. estou sem lente...

Detected: Reviews are in Portuguese
Recommendation: Use Portuguese queries for BM25, or rely more on semantic search


In [24]:
bm25_engine.search("fast delivery", top_n=5)


Unnamed: 0,review_id,review_text,review_score,product_category_name_english,order_id,bm25_score


<pre>
Testing BM25 Search
Quick test to verify BM25 is working correctly.

Important Note: Language Considerations
The Olist dataset contains reviews in Portuguese. This affects our search components differently:

BM25 (Keyword Search):

Requires exact word matches
English queries get zero scores on Portuguese text
Solution: Use Portuguese queries for BM25
BERT Semantic Search:

The multilingual models can understand cross-language similarity
Works reasonably well even with English queries on Portuguese text
Better at capturing meaning across languages
Hybrid System Advantage:

Even if BM25 returns zero scores, semantic search still finds relevant results
Cross-encoder reranking works across languages
System remains functional with English queries due to semantic component
For best results with this Portuguese dataset, use Portuguese queries. However, the system will still work with English queries thanks to the semantic search component.
</pre>


In [25]:
# Test BM25 with Portuguese queries (since reviews are in Portuguese)
test_queries_pt = [
    "entrega rapida",  # fast delivery
    "produto bom",     # good product
    "qualidade"        # quality
]

print("Testing BM25 with Portuguese queries:\n")
print("=" * 80)

for query in test_queries_pt:
    print(f"\nQUERY →", query.upper())
    print("—" * 80)

    # Run search and show results beautifully
    results = bm25_engine.search(
        query=query,
        top_n=5,

    )
    # Optional: show as nice table if you want
    if not results.empty:
        display(results[['review_text', 'review_score', 'bm25_score']].round(4))
    else:
        print("   → No results found")

    print("=" * 80)

Testing BM25 with Portuguese queries:


QUERY → ENTREGA RAPIDA
————————————————————————————————————————————————————————————————————————————————


Unnamed: 0,review_text,review_score,bm25_score
4311,entrega rapida,5,13.0038
17927,entrega rapida,3,13.0038
27294,entrega rapida,4,13.0038
12213,entrega rapida,4,13.0038
13000,entrega rapida,4,13.0038



QUERY → PRODUTO BOM
————————————————————————————————————————————————————————————————————————————————


Unnamed: 0,review_text,review_score,bm25_score
20875,bom bom produto,4,6.7081
19550,bom bom produto,4,6.7081
4645,bom bom produto,4,6.7081
22410,bom bom produto,4,6.7081
11693,muito bom produto muito bom,4,6.3014



QUERY → QUALIDADE
————————————————————————————————————————————————————————————————————————————————


Unnamed: 0,review_text,review_score,bm25_score
3343,excelente qualidade,5,5.6982
14211,boa qualidade,5,5.6982
6620,tima qualidade,5,5.6982
24039,qualidade otima.,5,5.6982
1397,produto de boa qualidade produto de boa qualid...,5,5.467




# BERT Semantic Search Setup

We use SentenceTransformers to encode reviews into dense vectors. This captures semantic meaning beyond exact keyword matches

<pre>
Transformer → Pooling → Sentence Embedding
</pre>?

In [26]:
# Load pre-trained sentence transformer model
print("Loading BERT model for semantic search...")
print("This may take a minute on first run...")

model_name = 'all-MiniLM-L6-v2'  # Fast and efficient model

semantic_model = SentenceTransformer(model_name)

print(f"Model loaded: {model_name}")
print(f"Embedding dimension:{semantic_model.get_sentence_embedding_dimension()}")

Loading BERT model for semantic search...
This may take a minute on first run...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Model loaded: all-MiniLM-L6-v2
Embedding dimension:384


# Creating Document Embeddings
We encode all reviews into vector representations. This is the most time-consuming step but only needs to be done once.

In [27]:
from DocumentEmbedding import SemanticEmbeddingEngine

semantic_engine = SemanticEmbeddingEngine(
    dataframe=search_corpus,
    text_column="review_text",
    model_name=model_name,
    batch_size=32
)



Loading semantic model: all-MiniLM-L6-v2 ...
Encoding documents into embeddings...
['comprei 14 unidades e recebi somente 9. ainda n o foi resolvida a quest o.', 'parab ns,chegou muito bem embalado...recomendo.', 'recebi produto errado. necessito urgente que seja feita a troca. estou sem lentes de contato 3,00. veio a de 1,50. quero devolver s essa e preciso que chegue a correta urgente.', 'nao recebi todo material veio somente uma pastilha de 06 e nao recebi o cloro ainda !!!', 'super recomendo fiquei muito satisfeita com o produto,fiz um pedido especial para entrega ,e eles acolheram este pedido,que pra mim fez toda a diferen a.muito feliz obrigada nota 10.', 'recomendado', 'confi vel produto correspondeu expectativa. entrega antes do prazo previsto.', 'excelente ,entrega antes do prazo , timo atendimento.', 'produto excelente', 'falta uma parte da compra.', 'nao recomendo mais sempre comprei nas lojas lannister, mas desta vez meu produto veio faltando um. e quando ligo n o atende, s

Batches:   0%|          | 0/938 [00:00<?, ?it/s]


Embeddings created!
Shape: (30000, 384)


In [28]:
def embed_query(query):
        # Encode query
        query_emb = semantic_engine.model.encode(query, convert_to_numpy=True)

        if query_emb is None or len(query_emb) == 0:
            raise ValueError("Query embedding is empty — model encoding failed")
        return query_emb

In [29]:
results = semantic_engine.search("good quality product", top_n=5)



# Building FAISS Index for Fast Vector Search

<pre>
FAISS (Facebook AI Similarity Search) is a library designed for efficient similarity search on high-dimensional vectors. It is mostly used for embeddings from NLP models, images, or other feature vectors.

Purpose: Quickly find nearest neighbors for a query vector among millions of embeddings.

Why fast: It uses optimized vector indexing, quantization, and sometimes GPU acceleration.

Typical workflow:

Normalize or preprocess embeddings.

Build an index.

Add vectors to the index.

Query the index for nearest neighbors.
</pre>

In [30]:
corpus_embeddings = semantic_engine.corpus_embeddings
test_query_semantic="good quality product"
query_embedding =embed_query(test_query_semantic)

In [31]:
corpus_embeddings.shape[1]

384

In [33]:
import faiss
import numpy as np

from FAISSIndex import FAISSIndex
faiss_index = FAISSIndex(corpus_embeddings)
indices, scores = faiss_index.search(query_embedding, top_k=5)
print(indices[0],scores[0])

for i, (idx, score) in enumerate(zip(indices, scores), 1):
    print(f"\n{i}. Similarity Score: {score:.3f}")
    print(f"   Review: {search_corpus.iloc[idx]['review_text'][:100]}...")

16543 0.37111622

1. Similarity Score: 0.371
   Review: timos produtos super recomendo...

2. Similarity Score: 0.371
   Review: timos produtos super recomendo...

3. Similarity Score: 0.371
   Review: timos produtos super recomendo...

4. Similarity Score: 0.371
   Review: timos produtos super recomendo...

5. Similarity Score: 0.371
   Review: timos produtos super recomendo...


# Stage 6: Cross-Encoder Reranker Setup
Cross-encoders provide precise relevance scoring by jointly encoding query and document. Much slower but more accurate than bi-encoders.

We use a cross-encoder for reranking because it gives far more accurate relevance scoring than a bi-encoder—but it’s too slow to use on every document.

In [34]:
# Load cross-encoder for reranking
print("Loading Cross-Encoder model for reranking...")

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

print("Cross-Encoder loaded successfully!")

Loading Cross-Encoder model for reranking...


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Cross-Encoder loaded successfully!



# Stage 7: Hybrid Search Pipeline Implementation
Now we combine everything into a unified search pipeline with three
stages:

1.   BM25 retrieval
2.   BERT semantic retrieval
3. Fusion and reranking











| Rank | Doc |
| ---- | --- |
| 1    | A   |
| 2    | B   |


| Rank | Doc |
| ---- | --- |
| 1    | B   |
| 2    | C   |


RRF fusion does this:

Doc A → only from BM25

Doc B → from BM25 AND semantic → higher fused score

Doc C → only from semantic

Final ranking becomes:

B (appears in both lists → highest fusion score)

A

C

In [35]:
def reciprocal_rank_fusion(bm25_results, semantic_results, k=60):
  """
    Combine rankings from BM25 and semantic search using Reciprocal Rank Fusion.

    RRF score = sum(1 / (k + rank_i)) for each system i
  """
  scores = defaultdict(float)
  # Add BM25 scores
  for rank, (idx, score) in enumerate(bm25_results, 1):
    scores[idx] += 1 / (k + rank)

  # Add semantic scores
  for rank, (idx, score) in enumerate(semantic_results, 1):
    scores[idx] += 1 / (k + rank)
   # Sort by combined score
  sorted_results = sorted(scores.items(), key = lambda x: x[1] , reverse=True)
  return sorted_results


In [49]:
def hybrid_search(query, top_k_retrieval=100, top_k_rerank=10)->dict[List]:
    """
    Complete hybrid search pipeline:
    1. Retrieve candidates using BM25 and BERT
    2. Fuse results using RRF
    3. Rerank top candidates using Cross-Encoder
    """
    # Stage 1a: BM25 Retrieval
    query_tokens = tokenize(query)
    bm25_scores = bm25_engine.bm25.get_scores(query_tokens)
    bm25_top_idx = np.argsort(bm25_scores)[-top_k_retrieval:][::-1]

    bm25_results = [(idx, bm25_scores[idx]) for idx in bm25_top_idx]
    # Stage 1b: Semantic Retrieval
    query_embedding = semantic_model.encode([query],
                                            convert_to_numpy=True)
    query_embedding_normalized = query_embedding / np.linalg.norm(query_embedding,
                                                                  axis=1,
                                                                  keepdims=True)
    query_embedding = query_embedding_normalized.astype('float32')  # CORRECT

    distances, indices = faiss_index.index.search(query_embedding, top_k_retrieval)
    semantic_results = [(idx, score) for idx, score in zip(indices[0], distances[0])]


    # Stage 2: Fusion
    fused_results = reciprocal_rank_fusion(bm25_results, semantic_results)

    # Get top candidates for reranking
    # Get top candidates for reranking
    candidate_indices = [idx for idx, _ in fused_results[:top_k_retrieval]]

    candidate_texts = [search_corpus.iloc[idx]['review_text'] for idx in candidate_indices]

    #Stage 3: Cross-Encoder Reranking
    query_doc_pairs =[[query, doc] for doc in candidate_texts]
    rerank_scores = cross_encoder.predict(query_doc_pairs)
    print(rerank_scores)
    # Combine indices with rerank scores
    reranked_results = list(zip(candidate_indices, rerank_scores))

    reranked_results = sorted(reranked_results, key=lambda x: x[1], reverse=True)
    # Return top-k final results
    final_results = []
    for idx, score in reranked_results[:top_k_rerank]:
        result = {
            'review_id': search_corpus.iloc[idx]['review_id'],
            'review_text': search_corpus.iloc[idx]['review_text'],
            'review_score': search_corpus.iloc[idx]['review_score'],
            'category': search_corpus.iloc[idx]['product_category_name_english'],
            'relevance_score': float(score)
        }
        final_results.append(result)

    return final_results

print("Hybrid search pipeline ready!")

Hybrid search pipeline ready!


In [51]:
# Test the complete pipeline with Portuguese queries
# Note: Semantic search (BERT) works across languages better than BM25

test_queries = [
    "entrega rapida",           # fast delivery
    "produto ruim",             # bad product
    "excelente atendimento",    # excellent service
    "produto com defeito"       # defective product
]

print("Testing Hybrid Search System")
print("Note: Using Portuguese queries since reviews are in Portuguese")
print("Semantic search will still capture meaning even with English queries\n")


for query in test_queries:
  print(f"\n{'=' * 70}")
  print(f"Query: '{query}'")
  print('='*70)

  results = hybrid_search(query, top_k_rerank=3)

  for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"  Relevance Score: {result['relevance_score']:.4f}")
    print(f"  Rating : {'*' * int(result['review_score'])}")
    print(f"  Category: {result['category']}")
    print(f"  Review: {result['review_text'][:150]}...")

# Also test with English query to show semantic search capability
print(f"\n{'='*70}")
print("Testing with English query (Semantic search handles this)")
print('='*70)
english_query = "fast delivery"
print(f"\nQuery: '{english_query}'")
results = hybrid_search(english_query, top_k_rerank=3)
for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"  Relevance Score: {result['relevance_score']:.4f}")
    print(f"  Rating: {'*' * int(result['review_score'])}")
    print(f"  Review: {result['review_text'][:100]}...")


Testing Hybrid Search System
Note: Using Portuguese queries since reviews are in Portuguese
Semantic search will still capture meaning even with English queries


Query: 'entrega rapida'
[7.406952  7.406952  7.406952  7.406952  7.406952  7.406952  7.406952
 7.7460957 7.7460957 7.7460957 7.7460957 7.7460957 7.7460957 7.7460957
 7.7460957 7.7460957 7.7460957 7.7460957 7.7460957 7.7460957 7.7460957
 7.7460957 7.7460957 7.7460957 7.7460957 7.7460957 7.7460957 7.7460957
 7.7460957 7.7460957 7.7460957 7.7460957 7.7460957 7.546741  7.7460957
 7.7460957 7.837149  7.4498014 7.170081  7.614563  7.3920584 7.390757
 7.4841433 7.390757  7.390757  7.4684167 7.390757  7.390757  7.390757
 7.4684167 7.4684167 7.390757  7.4684167 7.390757  7.4684167 7.390757
 7.4684167 7.4684167 7.4080925 7.4684167 7.4684167 7.4684167 7.4684167
 7.2611136 7.4684167 7.6425014 7.4684167 7.3206167 7.69081   7.105157
 7.195527  6.697219  7.1115236 7.1115236 7.2002687 6.952218  6.952218
 6.952218  6.952218  6.952218  6.42459


# Stage 8: Interactive Search Interface
A simple interactive UI for testing the search system in real-time

In [52]:
# Create interactive search interface
search_input = widgets.Text(
    value='',
    placeholder='Enter your search query...',
    description='Search:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='70%')
)

search_button = widgets.Button(
    description='Search',
    button_style='primary',
    layout=widgets.Layout(width='15%')
)

num_results_slider = widgets.IntSlider(
    value=5,
    min=1,
    max=20,
    step=1,
    description='Results:',
    style={'description_width': 'initial'}
)

output_area = widgets.Output()

def on_search_click(b):
    """Handle search button click"""
    with output_area:
        clear_output()

        query = search_input.value.strip()
        if not query:
            print("Please enter a search query.")
            return

        print(f"Searching for: '{query}'")
        print("Please wait...\n")

        start_time = time.time()
        results = hybrid_search(query, top_k_rerank=num_results_slider.value)
        search_time = time.time() - start_time

        print(f"Found {len(results)} results in {search_time:.2f} seconds")
        print("\n" + "="*80 + "\n")

        for i, result in enumerate(results, 1):
            stars = '★' * int(result['review_score']) + '☆' * (5 - int(result['review_score']))

            print(f"Result {i}")
            print(f"  Relevance: {result['relevance_score']:.4f} | Rating: {stars} | Category: {result['category']}")
            print(f"  Review: {result['review_text']}")
            print("\n" + "-"*80 + "\n")

search_button.on_click(on_search_click)

# Display interface
print("Interactive Search Interface")
print("="*80)
display(widgets.VBox([
    widgets.HBox([search_input, search_button]),
    num_results_slider,
    output_area
]))

Interactive Search Interface


VBox(children=(HBox(children=(Text(value='', description='Search:', layout=Layout(width='70%'), placeholder='E…

# Stage 9: Performance Analysis
Let's analyze the performance and characteristics of our search system.

In [53]:
# Analyze search performance with multiple queries
analysis_queries = [
    "entrega rápida",
    "boa qualidade",
    "má experiência",
    "altamente recomendado",
    "desperdício de dinheiro"
]


print("Performance Analysis")
print("="*80)

total_time = 0
for query in analysis_queries:
    start = time.time()
    results = hybrid_search(query, top_k_rerank=10)
    elapsed = time.time() - start
    total_time += elapsed

    avg_score = np.mean([r['relevance_score'] for r in results])

    print(f"\nQuery: '{query}'")
    print(f"  Time: {elapsed:.3f}s | Avg Relevance: {avg_score:.4f}")

print(f"\n{'='*80}")
print(f"Average search time: {total_time / len(analysis_queries):.3f} seconds")

Performance Analysis
[ 7.4069519e+00  7.4069519e+00  7.4069519e+00  7.4069519e+00
  7.4069519e+00  7.4069519e+00  7.4069519e+00  7.7460957e+00
  7.7460957e+00  7.7460957e+00  7.7460957e+00  7.7460957e+00
  7.7460957e+00  7.7460957e+00  7.7460957e+00  7.7460957e+00
  7.7460957e+00  7.7460957e+00  7.7460957e+00  7.7460957e+00
  6.9522181e+00  7.7460957e+00  6.9522181e+00  7.5467410e+00
  7.7460957e+00  7.7460957e+00  7.1115236e+00  6.5449409e+00
  7.7460957e+00  6.9522181e+00 -1.3989460e-01  6.9522181e+00
  1.4389207e+00  1.4389207e+00  3.8413890e-02  7.4132673e-02
  1.2002015e+00  7.4132673e-02  7.4132673e-02  1.2002015e+00
  7.7460957e+00  7.4132673e-02  1.2002015e+00  7.7460957e+00
  7.4132673e-02  1.7137241e+00  7.7460957e+00  2.1285021e+00
  1.6384573e+00  7.7460957e+00  2.7932112e+00  7.7460957e+00
  8.9978373e-01  1.9017580e+00  1.0672872e+00  7.7460957e+00
  8.9830875e-01  7.7460957e+00  1.7137241e+00  2.7932112e+00
  2.9179764e+00  1.9017580e+00  7.7460957e+00  2.8415940e+00
  7

# Stage 10: System Summary and Conclusions


<pre>

What We Built
This notebook demonstrates a production-grade hybrid search system that combines:

1. Lexical Search (BM25)

Fast keyword-based retrieval
Excellent for exact term matching
Low computational cost
2. Semantic Search (BERT Embeddings)

Understands meaning beyond keywords
Captures conceptual similarity
Handles synonyms and paraphrases
3. Precision Reranking (Cross-Encoder)

Accurate relevance scoring
Joint query-document encoding
Final quality assurance layer
Key Achievements
Indexed and searchable corpus of product reviews
Sub-second search response times
Hybrid approach captures both keywords and semantics
Reranking ensures highest quality results
Interactive interface for real-time testing
Real-World Applications
This architecture is used by:

E-commerce product search (Amazon, eBay)
Document retrieval systems
Customer support chatbots
Content recommendation engines
Potential Improvements
Add filtering by category, rating, date
Implement query expansion and spell correction
Add personalization based on user history
Deploy as a REST API service
Scale to millions of documents using distributed systems
Performance Characteristics
Retrieval stage: Fast (100ms range)
Reranking stage: Moderate (depends on candidate size)
Overall: Suitable for real-time applications
Accuracy: High precision through multi-stage pipeline
</pre>

In [54]:
# Final statistics
print("System Statistics")
print("="*80)
print(f"Total documents indexed: {len(search_corpus):,}")
print(f"Embedding dimension: {corpus_embeddings.shape[1]}")
print(f"Average document length: {search_corpus['review_text'].str.split().apply(len).mean():.0f} words")
print(f"\nModels used:")
print(f"  - Semantic encoder: {model_name}")
print(f"  - Cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2")
print(f"  - BM25: Okapi BM25")
print("\nSearch system ready for queries!")

System Statistics
Total documents indexed: 30,000
Embedding dimension: 384
Average document length: 16 words

Models used:
  - Semantic encoder: all-MiniLM-L6-v2
  - Cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
  - BM25: Okapi BM25

Search system ready for queries!
