Install the Libraries

In [4]:
#Install Packages
%pip install faiss-cpu
%pip install sentence-transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

In [2]:
df = pd.read_csv("/workspaces/LLM-Project/Sample_Files/sample_text.csv")
df.shape

(8, 2)

In [3]:
df

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mumbai this october,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity in terms of low budget vacation places,Travel


Now we convert each row of text into vectors/embeddings

### Step 1 : Create source embeddings for the text column

In [6]:
from sentence_transformers import SentenceTransformer


  from tqdm.autonotebook import tqdm, trange


In [7]:
encoder = SentenceTransformer("all-mpnet-base-v2")
vectors = encoder.encode(df.text)



In [9]:
vectors.shape

(8, 768)

In [8]:
dim = vectors.shape[1]
dim

768

### Step 2 : Build a FAISS Index for vectors

Refer - faiss.ai page for more detrails

import FAISS Library

In [12]:
import faiss
# It creates an index that uses a eucledian distance or L2 distance similar like a database index
# So here we are creating a index of size 768
index = faiss.IndexFlatL2(dim)

It created some empty indexes

In [13]:
index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x79b1a464c690> >

### Step 3 : Normalize the source vectors (as we are using L2 distance to measure similarity) and add to the index

In [14]:
index.add(vectors)

In [15]:
index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x79b1a464c690> >

### Step 4 : Encode search text using same encorder and normalize the output vector

In [26]:
# search_query = "I want to buy a polo t-shirt"
search_query = "looking for places to visit during the holidays"
# search_query = "An apple a day keeps the doctor away"
vec = encoder.encode(search_query)
vec.shape

(768,)

But search Vector expects two dimensional array so we are to use numpy and

convert this vector into two dimensional array

In [27]:
import numpy as np
svec = np.array(vec).reshape(1,-1)
svec.shape

(1, 768)

In [28]:
# faiss.normalize_L2(svec)

### Step 5: Search for similar vector in the FAISS index created

In [29]:
index.search(svec, k=2)
#it returns a tuple and the first one is the distances the second one is the index in
#our original data frame , in the original data, text rows at 3 and are similar...that is
#2	These are the latest fashion trends for this week	Fashion
#3	Vibrant color jeans for male are becoming a trend	Fashion

(array([[0.9273729, 1.1601744]], dtype=float32), array([[6, 7]]))

In [30]:
# here k is the number of nearest similar vectors
distances, I = index.search(svec, k=2)
distances
# it returns the distances

array([[0.9273729, 1.1601744]], dtype=float32)

In [31]:
# it returns the indexes
I

array([[6, 7]])

In [32]:
df.loc[I[0]]

Unnamed: 0,text,category
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity in terms of low budget vacation places,Travel


In [33]:
search_query

'looking for places to visit during the holidays'

You can see that the two results from the dataframe are similar to a search_query