##**Product Recommendations model using Distilroberta model and FAISS**

In this notebook, we take a readily available product dataset (from Instacart) and see if we can use existing downstream models (transformer based NLP models like DistillRoberta) to generate product embeddings. Then we try to use these embeddings for similar product recommendations and to also search items corresponding to a specific query. 

We use Facebook's FAISS package to find Approximate Nearest Neighbours. 



In [2]:
#importing the libraries

import pandas as pd
import numpy as np
import datetime as dt
import time 

In [3]:
#loading the data into dataframe
df=pd.read_csv("/content/products.csv")

In [4]:
df.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   product_id     49688 non-null  int64 
 1   product_name   49688 non-null  object
 2   aisle_id       49688 non-null  int64 
 3   department_id  49688 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [6]:
#checking for null values
df.isnull().sum()

product_id       0
product_name     0
aisle_id         0
department_id    0
dtype: int64

In [8]:
#checking for the duplicates
df.duplicated().sum()

0

In [9]:
df['product_name']=df['product_name'].astype(str)

In [None]:
%pip install faiss-gpu ##Installing GPU version of faiss

In [None]:
%pip install sentence_transformers ## For textual similarity, using pretrained models

In [12]:
import faiss
from sentence_transformers import SentenceTransformer, util

In [None]:
# Using pretrained BERT for item embeddings.On GPU Loads the distil roberta model,which was trained on millions of data
model = SentenceTransformer('paraphrase-distilroberta-base-v1') 


In [27]:
# preparing data to feed into the SentenceTransformer model

prod=df.product_name.drop_duplicates()
sentences=prod.tolist()
print("Number of Unique Sentences in Product Name ",len(sentences))

Number of Unique Sentences in Product Name  49688


In [28]:
#Generating embeddings for each product

embeddings=model.encode(sentences)

In [29]:
faiss.normalize_L2(embeddings) ## Normalising the Embeddings
print("Shape of the EMbeddings is ",embeddings.shape)

Shape of the EMbeddings is  (49688, 768)


##Partitioning The Index
Faiss allows us to add multiple steps that can optimize our search
using many different methods. A popular approach is to partition the index into Voronoi cells. 

Using this method, we would take a query vector xq, identify the cell it belongs to, and then use our IndexFlatL2 (or another metric) to search between the query vector and all other vectors belonging to that specific cell.

So, we are reducing the scope of our search, producing an approximate answer, rather than exact (as produced through exhaustive search).



**To implement this, we first initialize our index using IndexFlatL2 — but this time, we are using the L2 index as a quantizer step — which we feed into the partitioning IndexIVFFlat index.**



All of our indexes so far have stored our vectors as full (eg Flat) vectors. Now, in very large datasets this can quickly become a problem.

**Fortunately, Faiss comes with the ability to compress our vectors using Product Quantization (PQ).**


PQ achieves approximated similarity operation by compressing the vectors/embeddings themselves, which consists of three steps.

1. Split the original vector into several subvectors.
2. For each set of subvectors, we perform a clustering operation — creating multiple centroids for each sub-vector set.
3. In our vector of sub-vectors, we replace each sub-vector with the ID of it’s nearest set-specific centroid.

In [30]:
# Initializing hyper parameters 

nlist=16 # How many cells or number of search spaces
dim=768 #We get a 768 dimension vector using Roberta. So we will create FAISS index with dimaensions - 768
m = 50  # number of centroid IDs in final compressed vectors, This is a hyperparameter, and indicates number of clusters of a vector to be split into
bits = 8 # number of bits in each centroid

quantiser = faiss.IndexFlatL2(dim)
index = faiss.IndexIVFPQ (quantiser, dim, nlist, m, bits)
index.train(embeddings) ## This step, will do the clustering and create the clusters
print(index.is_trained)
faiss.write_index(index, "trained.index")

True


In [31]:
#creating ids (index olumn) associated with product_name index

df_prod=pd.DataFrame(prod).reset_index()
df_prod.head()

Unnamed: 0,index,product_name
0,0,Chocolate Sandwich Cookies
1,1,All-Seasons Salt
2,2,Robust Golden Unsweetened Oolong Tea
3,3,Smart Ones Classic Favorites Mini Rigatoni Wit...
4,4,Green Chile Anytime Sauce


In [32]:
# Adding the embeddings to the trained Index

ids=df_prod['index'].tolist()
ids=np.array(ids)

index.add_with_ids(embeddings,ids)
print(index.ntotal)

49688


In [33]:
faiss.write_index(index,"block.index")


**If approximate search with IndexIVFFlat returns suboptimal results, we can improve accuracy by increasing the search scope. We do this by increasing the nprobe attribute value — which defines how many nearby cells to search.**



In [42]:
# defining a function to find top 15 Similar Items for a new query

def searchFAISSIndex(data,id_col_name,query,index,nprobe,model,topk=15):

    ## Convert the query into embeddings
    query_embedding=model.encode([query])[0]
    dim=query_embedding.shape[0]
    query_embedding=query_embedding.reshape(1,dim)
    faiss.normalize_L2(query_embedding)
  
    
    index.nprobe=nprobe
    
    D,I=index.search(query_embedding,topk) 
    ids=[i for i in I][0]
    L2_score=[d for d in D][0]
    inner_product=[calculateInnerProduct(l2) for l2 in L2_score]
    search_result=pd.DataFrame()
    search_result[id_col_name]=ids
    search_result['cosine_sim']=inner_product
    search_result['L2_score']=L2_score
    dat=data[data[id_col_name].isin(ids)]
    dat=pd.merge(dat,search_result,on=id_col_name)
    dat=dat.sort_values('cosine_sim',ascending=False)
    return dat

In [35]:
import math 
def calculateInnerProduct(L2_score):
    return (2-math.pow(L2_score,2))/2

In [None]:
search_result = []
import time
start = time.time()
for query in df_prod.values:
  search_result.append(searchFAISSIndex(df_prod,"index",str(query),index,nprobe=10,model=model,topk=15))
end = time.time()  

In [89]:
print("TIME TAKEN =",(end-start)/60,"mins")

TIME TAKEN = 65.32744178374608 mins


In [83]:
# Testing the results 
search_result[9909]

Unnamed: 0,index,product_name,cosine_sim,L2_score
4,9909,Vitamin Code Raw Prenatal Probiotic Immune Sup...,0.834938,0.574564
14,47377,Women's Prenatal 1 with DHA & Folic Acid Softg...,0.74351,0.716227
13,44720,Men's One Food-Based Multivitamin,0.74004,0.721055
0,4411,Complete Prenatal System Food-Based Multivitamin,0.735882,0.726798
3,9614,Just One Prenatal Advanced Multivitamin,0.73256,0.731354
7,23876,Vitamin Code Raw Prenatal Vegetarian Capsules,0.730357,0.73436
2,6595,Vitamin Code Raw B-12,0.729363,0.735713
11,41522,PreNatal Multivitamin Adult,0.727055,0.738844
1,5401,Immune Support Supplement Vitamin C Effervesce...,0.726097,0.740139
9,32703,Vitamin Code Men's Raw Whole Food Mutlivitamin,0.725446,0.741019


In [69]:
search_result[5371]

Unnamed: 0,index,product_name,cosine_sim,L2_score
0,5371,Oats Apple Cinnamon Blended Low-Fat Greek Yogurt,0.87911,0.491712
5,33675,Blended Apple Cinnamon Greek Yoghurt,0.85314,0.541959
10,43901,Coconut Blended Low-Fat Greek Yogurt,0.815978,0.606667
9,40088,Yogurt Blends Apple Cinnamon,0.813518,0.610708
2,20665,Apple Cinnamon Yogurt,0.813037,0.611496
13,45438,Plenti Greek Apple Cinnamon Oatmeal with Yogurt,0.800053,0.632371
11,45143,Pumpkin Spice Blended Low-Fat Greek Yogurt,0.796527,0.637924
3,21275,Slow Churned Fat Free Yogurt Blends Vanilla,0.796353,0.638196
14,48625,Coconut Blended Greek Yogurt,0.79531,0.639828
4,30142,Mighty Oats Blueberry Blended Low-Fat Greek Yo...,0.795096,0.640163


In [68]:
search_result[7648]

Unnamed: 0,index,product_name,cosine_sim,L2_score
3,7648,Whole Wheat Tandoori Naan,0.826357,0.589309
6,26664,Tandoor Baked Naan,0.785465,0.655034
0,665,Whole Wheat Loaves,0.728162,0.737344
7,27541,Whole Wheat Fajita,0.726005,0.740263
13,45653,Whole Wheat,0.725487,0.740963
10,31085,Whole Wheat Pita,0.724341,0.742508
1,680,Whole Wheat Mini Toasts,0.722334,0.745206
11,34560,Naan,0.71947,0.749039
12,44740,Honey Whole Wheat,0.719045,0.749607
2,2822,Whole Wheat Penne Rigate,0.718864,0.749848


In [80]:
search_result[44335]

Unnamed: 0,index,product_name,cosine_sim,L2_score
11,46279,Light & Fit Greek Blueberry Greek Yogurt,0.83766,0.569806
10,44335,"Light & Fit Greek Yogurt, Cherry",0.837365,0.570324
1,5196,Light & Fit Greek Yogurt,0.830834,0.581664
7,35481,Peach Light & Fit Greek Yogurt,0.813203,0.611223
12,46326,Raspberry Light & Fit Greek Yogurt,0.81049,0.615646
5,19976,Light & Fit Greek Cherry Yogurt,0.810101,0.616277
4,16552,Light & Fit Strawberry Greek Yogurt,0.805795,0.623225
8,37703,"Organic Strawberry, Apple & Beet Greek Yogurt ...",0.798577,0.634702
14,47983,Light & Fit Peach Yogurt,0.796766,0.637548
2,7531,Light & Fit Strawberry Yogurt,0.796728,0.637608


In [98]:
# finding Similar items for a new test query (which was not in the training)

query="Ginger Unsweetened Tea bags"
search=searchFAISSIndex(df_prod,"index",str(query),index,nprobe=10,model=model,topk=10)
print(search[['product_name','cosine_sim','L2_score']])

                                   product_name  cosine_sim  L2_score
2  Unsweetened Family-Sized Black Iced Tea Bags    0.784311  0.656794
9        Green Naturally Decaffeinated Tea Bags    0.758601  0.694836
6   100% Real Tea Leaves Decaffeinated Tea Bags    0.757629  0.696234
0                   Orange Ginger Mint Tea Bags    0.756649  0.697641
1                                Nut Goodie Bag    0.756377  0.698029
8            Honey Lemon Ginseng Green Tea Bags    0.754790  0.700300
3               Iced Tea Decaffeinated Tea Bags    0.752159  0.704047
7                                    Ginger Tea    0.749801  0.707388
4                  Decaffeinated Green Tea Bags    0.745500  0.713443
5   Green Tea with Lemon Decaffeinated Tea Bags    0.743762  0.715874
