<a href="https://colab.research.google.com/github/KJOELJOYSON2427/Fine_Tunning_Bert_Query_Aware_Retrieval_-_Reranking_Olist_Dataset_using_Pytorch/blob/main/Fine_Tunning_Bert_Query_Aware_Retrieval_%2B_Reranking_Olist_Dataset_using_Pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# üîç Hybrid Search Engine on Olist E-commerce Reviews

This notebook demonstrates a **production-grade hybrid search system** on the **Olist e-commerce reviews dataset**. Instead of just performing traditional **sentiment analysis** or **exploratory data analysis (EDA)**, we are building a **real-world search engine** that leverages multiple advanced techniques:

---

## **Architecture Overview**

Our search system combines three complementary approaches to provide **fast and accurate results**:

1. **BM25 (Lexical Search)**
   - üîπ Fast keyword-based retrieval
   - üîπ Excellent for exact matches and surface-level search

2. **BERT Embeddings (Semantic Search)**
   - üîπ Understands meaning beyond keywords
   - üîπ Captures context and semantic similarity between queries and reviews

3. **Cross-Encoder Reranking**
   - üîπ Precision scoring for top results
   - üîπ Reranks candidate documents from BM25 and BERT for maximum relevance

---

## **Why Hybrid Search?**

Modern search engines like **Amazon**, **Google**, and other e-commerce platforms use a hybrid approach to **balance speed and accuracy**:

- üîπ **BM25** for quick filtering  
- üîπ **BERT embeddings** for semantic understanding  
- üîπ **Cross-encoder** for precise ranking  

By combining these, we ensure that our search engine returns **both relevant and contextually meaningful results**.

---

Let's dive in and **build a search engine that feels like the real deal!**




# Given a user query like "fast delivery" or "poor quality product", retrieve the most relevant product reviews from thousands of customer reviews using a two-stage retrieval system

# Stage 0: Environment Setup

In [2]:
import sys
import subprocess
import warnings
import os

# Redirect all stderr to devnull temporarily
import io
from contextlib import redirect_stderr, redirect_stdout

# Capture and suppress ALL output during installation
with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):

  # Fix protobuf
  subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", "protobuf"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "protobuf==3.20.3"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

  #Install Packages
  # Install packages
  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "sentence-transformers"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "rank-bm25"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
  subprocess.run([sys.executable, "-m", "pip", "install", "-q", "faiss-cpu"],
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

  # Suppress warnings
  warnings.filterwarnings('ignore')
  os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
  os.environ['CUDA_VISIBLE_DEVICES'] = ''

  # Import TensorFlow first to trigger errors in suppressed context
  import tensorflow as tf

  import logging
  logging.getLogger('tensorflow').setLevel(logging.ERROR)
  logging.getLogger('transformers').setLevel(logging.ERROR)


# Now import everything normally (errors already suppressed)
import pandas as pd
import numpy as np
import re
from typing import List, Tuple

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import faiss
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
import time
from collections import defaultdict

print("Setup complete - all libraries loaded successfully!")

Setup complete - all libraries loaded successfully!


# Stage 1: Data Loading and Exploration

We load the Olist reviews dataset and perform initial exploration. The main dataset contains customer reviews with text, ratings, and metadata.

In [3]:
# Load the reviews dataset
# Load the reviews dataset
reviews_df = pd.read_csv('/content/olist_order_reviews_dataset.csv')
orders_df = pd.read_csv('/content/olist_orders_dataset.csv')
order_items_df = pd.read_csv('/content/olist_order_items_dataset.csv')
products_df = pd.read_csv('/content/olist_products_dataset.csv')
translation_df = pd.read_csv('/content/product_category_name_translation.csv')


print("Dataset loaded successfully!")
print(f"\nReviews shape: {reviews_df.shape}")
print(f"Orders shape: {orders_df.shape}")
print(f"Order items shape: {order_items_df.shape}")
print(f"Products shape: {products_df.shape}")

Dataset loaded successfully!

Reviews shape: (99224, 7)
Orders shape: (99441, 8)
Order items shape: (112650, 7)
Products shape: (32951, 9)


In [9]:

# Display first few rows
print("\nSample Reviews:")
reviews_df.head()



Sample Reviews:


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parab√©ns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


# Initial Data Inspection¬∂


Checking data quality, missing values, and review text distribution.

In [12]:
# Check missing values in reviews
print("Missing values in reviews dataset:")
print(reviews_df.isnull().sum())

print("\n" + "="*50)
print("Review Score Distribution:")
print(reviews_df['review_score'].value_counts().sort_index())

print("\n" + "="*50)
# Check non-null reviews
non_null_reviews = reviews_df['review_comment_message'].notna().sum()
print(f"Reviews with text: {non_null_reviews} out of {len(reviews_df)} ({non_null_reviews/len(reviews_df)*100:.2f}%)")

Missing values in reviews dataset:
review_id                      0
order_id                       0
review_score                   0
review_comment_title       87656
review_comment_message     58247
review_creation_date           0
review_answer_timestamp        0
dtype: int64

Review Score Distribution:
review_score
1    11424
2     3151
3     8179
4    19142
5    57328
Name: count, dtype: int64

Reviews with text: 40977 out of 99224 (41.30%)



# Stage 2: Data Preprocessing and Enrichment

We clean the review text, combine title and message, and enrich reviews with product information. This creates a comprehensive search corpus.