Building a vector-based search engine with Sentence Transformers and Faiss

https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8
    
Huggingface’s Transformers that provides access to state-of-the-art, pretrained models. Using pretrained models has many advantages:

* They usually produce high-quality embeddings as they were trained on large amounts of text data.
* They don’t require you to create a custom tokeniser as Transformers come with their own methods.
* It’s straightforward to fine-tune a model to your task.



- The most naive way to retrieve relevant documents would be to measure the cosine similarity between the query vector and every document vector in our database and return those with the highest score. Unfortunately, this is very slow in practice.

- To build our semantic search engine we will use Sentence Transformers that fine-tune BERT-based models to produce semantically meaningful embeddings of long-text sequences.

- The preferred approach is to use Faiss, a library for efficient similarity search and clustering of dense vectors. Faiss offers a large collection of indexes and composite indexes.
Moreover, given a GPU, Faiss scales up to billions of vectors

In [None]:
# pip install sentence_transformers

In [None]:
# pip install faiss.cpu

In [None]:
# %autoreload 2
# Used to import data from local.
import pandas as pd

# Used to create the dense document vectors.
import torch
from sentence_transformers import SentenceTransformer

# Used to create and store the Faiss index.
import faiss
import numpy as np
import pickle
from pathlib import Path

# Used to do vector searches and display the results.
# from vector_engine.utils import vector_search, id2details

In [None]:
import numpy as np


def vector_search(query, model, index, num_results=10):
    """Tranforms query to vector using a pretrained, sentence-level 
    DistilBERT model and finds similar vectors using FAISS.
    Args:
        query (str): User query that should be more than a sentence long.
        model (sentence_transformers.SentenceTransformer.SentenceTransformer)
        index (`numpy.ndarray`): FAISS index that needs to be deserialized.
        num_results (int): Number of results to return.
    Returns:
        D (:obj:`numpy.array` of `float`): Distance between results and query.
        I (:obj:`numpy.array` of `int`): Paper ID of the results.
    
    """
    vector = model.encode(list(query))
    # D, I = index.search(np.array(vector), k=num_results)
    D, I = index.search(np.array(vector).astype("float32"), k=num_results)
    return D, I

def id2details(df, I, column):
    """Returns the paper titles based on the paper index."""
    return [list(df[df.id == idx][column]) for idx in I[0]]

Preprocess the Data 

In [None]:
df = pd.read_csv("data/ready_for_model.csv")

topic_lst = df.main_topics.unique().tolist()
for topic in topic_lst:
    f_name = "ready_for_model_{}.csv".format(topic)
    df[df.main_topics == topic].to_csv("data/{}".format(f_name))

In [None]:
# import glob

# file_lst = glob.glob("*.csv")
# file_lst

['ready_for_model_bird_flamingo.csv',
 'ready_for_model_bird_tailorbird.csv',
 'ready_for_model_bird_peacock.csv',
 'ready_for_model_insect_ant.csv',
 'ready_for_model_insect_beetle.csv']

In [None]:
df_topic_ant = pd.read_csv('ready_for_model.csv/ready_for_model_insect_ant.csv')
df_topic_ant.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Title,Video_ID,Category,Age_Restricted,final_corrected_version_sentences_txt,final_corrected_version_txt,sentence_level_timstamp_min_sec,sentence_level_timstamp_max_sec,sentence_level_timstamp_min_minute,sentence_level_timstamp_max_minute,duration,Length_(min),Views_(thous),main_topics,reference_text,reference_text_cleaned
0,0,0,Children Learn About The Ant,cXUCUvcscXs,Ant,False,welcome to my treehouse kind of look at,welcome to my treehouse kind of look at. this ...,0.149,7.259,0.002483,0.120983,7.11,5.333333,911.658,insect_ant,['Ants are one of the most common insects that...,Ants are one of the most common insects that l...
1,1,1,Children Learn About The Ant,cXUCUvcscXs,Ant,False,this this is my ant farm answer amazing,welcome to my treehouse kind of look at. this ...,2.669,9.42,0.044483,0.157,6.751,5.333333,911.658,insect_ant,['Ants are one of the most common insects that...,Ants are one of the most common insects that l...
2,2,2,Children Learn About The Ant,cXUCUvcscXs,Ant,False,creatures lots of insects live by,welcome to my treehouse kind of look at. this ...,7.259,11.28,0.120983,0.188,4.021,5.333333,911.658,insect_ant,['Ants are one of the most common insects that...,Ants are one of the most common insects that l...
3,3,3,Children Learn About The Ant,cXUCUvcscXs,Ant,False,themselves and have to find their food,welcome to my treehouse kind of look at. this ...,9.42,14.19,0.157,0.2365,4.77,5.333333,911.658,insect_ant,['Ants are one of the most common insects that...,Ants are one of the most common insects that l...
4,4,4,Children Learn About The Ant,cXUCUvcscXs,Ant,False,all on their own but ants live with lots,welcome to my treehouse kind of look at. this ...,11.28,16.52,0.188,0.275333,5.24,5.333333,911.658,insect_ant,['Ants are one of the most common insects that...,Ants are one of the most common insects that l...


In [None]:
df_topic_flamingo = pd.read_csv("data/ready_for_model_bird_flamingo.csv")


In [None]:
df_topic_tailorbird = pd.read_csv('data/ready_for_model_bird_tailorbird.csv')


In [None]:
df_topic_peacock = pd.read_csv('data/ready_for_model_bird_peacock.csv')


In [None]:
df_topic_beetle = pd.read_csv('data/ready_for_model_insect_beetle.csv')

In [None]:
def BERT_Search_Engine_Preprocess(df_topic):
  
  reference_text_df = df_topic[['reference_text_cleaned', 'Video_ID','main_topics']].drop_duplicates().reset_index(drop = True)
  reference_text_df = reference_text_df[reference_text_df['Video_ID'].str.endswith("_v")].reset_index(drop = True)
  reference_text_df['Video_ID'] = reference_text_df['main_topics']

  text_lvl_df = df_topic[['final_corrected_version_txt', 'Video_ID', 'Title','main_topics']].drop_duplicates()
  # text_lvl_df = pd.concat([text_lvl_df, text_lvl_df_reference ], axis = 0)
  text_lvl_df = text_lvl_df.reset_index(drop = False).reset_index()
  text_lvl_df =text_lvl_df.rename(columns = {'level_0': 'id_txt'})
  text_lvl_df = text_lvl_df.drop(columns='index')


  sentences_lvl_df = df_topic[['final_corrected_version_sentences_txt', 'Video_ID', 'main_topics',  'sentence_level_timstamp_min_sec', 'sentence_level_timstamp_max_sec']].drop_duplicates()
  # sentences_lvl_df = sentences_lvl_df[~sentences_lvl_df['Video_ID'].str.endswith("_v")]
  # sentences_lvl_df = pd.concat([sentences_lvl_df, sentence_lvl_df_reference ], axis = 0)
  sentences_lvl_df = sentences_lvl_df.reset_index(drop = False).reset_index()
  sentences_lvl_df =sentences_lvl_df.rename(columns = {'level_0': 'index_sentence'})
  sentences_lvl_df = sentences_lvl_df.drop(columns='index')
  sentences_lvl_df[~sentences_lvl_df['Video_ID'].str.endswith("_v")]

  return reference_text_df, text_lvl_df, sentences_lvl_df


In [None]:
def BERT_Search_Engine(text_lvl_df, sentences_lvl_df):

  # Instantiate the sentence-level DistilBERT
  model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
  # Check if GPU is available and use it
  if torch.cuda.is_available():
      model = model.to(torch.device("cuda"))
  # print(model.device)
  
  # Convert text to vectors
  embeddings_txt = model.encode(text_lvl_df.final_corrected_version_txt.to_list(), show_progress_bar=True)
  # convert sentence to vectors
  embeddings_sentences = model.encode(sentences_lvl_df.final_corrected_version_sentences_txt.to_list(), show_progress_bar=True)
  
  # index the text embedding:
  # Step 1: Change data type
  embeddings_txt = np.array([embedding for embedding in embeddings_txt]).astype("float32")

  # Step 2: Instantiate the index
  index = faiss.IndexFlatL2(embeddings_txt.shape[1])

  # Step 3: Pass the index to IndexIDMap
  index = faiss.IndexIDMap(index)

  # Step 4: Add vectors and their IDs
  index.add_with_ids(embeddings_txt, text_lvl_df.id_txt.values)

  # print(f"Number of vectors in the Faiss index in Text Lvl: {index.ntotal}")


  # index the sentences embedding:
  # Step 1: Change data type
  embeddings_sentences = np.array([embedding for embedding in embeddings_sentences]).astype("float32")

  # Step 2: Instantiate the index
  index_sentence = faiss.IndexFlatL2(embeddings_sentences.shape[1])

  # Step 3: Pass the index to IndexIDMap
  index_sentence = faiss.IndexIDMap(index_sentence)

  # Step 4: Add vectors and their IDs
  index_sentence.add_with_ids(embeddings_sentences, sentences_lvl_df.index_sentence.values)

  # print(f"Number of vectors in the Faiss index in Sentences Lvl: {index_sentence.ntotal}")

  return model, index, index_sentence


In [None]:
#### Ant Ranking

In [None]:
reference_text_df, text_lvl_df, sentences_lvl_df = BERT_Search_Engine_Preprocess(df_topic = df_topic_ant)
model, index, index_sentence = BERT_Search_Engine(text_lvl_df = text_lvl_df, sentences_lvl_df = sentences_lvl_df)
query = reference_text_df.iloc[0][0]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/146 [00:00<?, ?it/s]

In [None]:
D, I = vector_search([query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

# Fetch the video id based on their index
rank_num = 1
print("Under the main topic: Insect_Ant: \n")
for i in I.tolist()[0]:
  top_txt_result_df = text_lvl_df[text_lvl_df['id_txt']==i]['Title'].values[0]
  print("Rank{} : Video - {}".format(rank_num,top_txt_result_df))
  rank_num = rank_num + 1

L2 distance: [69.26106262207031, 79.56642150878906, 85.86907958984375, 91.19791412353516, 115.78446197509766, 133.18643188476562, 139.6414031982422, 139.88050842285156, 145.55471801757812, 157.7864990234375]

MAG paper IDs: [0, 3, 18, 7, 10, 1, 17, 8, 2, 6]
Under the main topic: Insect_Ant: 

Rank1 : Video - Children Learn About The Ant
Rank2 : Video - Ants | Science for Kids
Rank3 : Video - insects_intro_ant_v
Rank4 : Video - How Do Ants Find Food? | Animal Science for Kids
Rank5 : Video - National Geographic Readers: Ants
Rank6 : Video - 🐜 ALL ABOUT ANTS 🐜 | 5 AMAZING FACTS ABOUT ANTS 🤯 | EXPLORER MAX
Rank7 : Video - What if there were no Ants? + more videos | #aumsum #kids #science #education #children
Rank8 : Video - The Life Cycle Of An Ant - Ant Life Cycle Lesson For Kids
Rank9 : Video - My Animal Friends -The Different Types of Ants | Bugs for Kids | Wizz | TV Shows for Kids
Rank10 : Video - Army Ant 🐜 | Amazing Animals


In [None]:
## start to fetch the top 10 sentences under Video "insects_intro_ant_v"
D, I = vector_search([query], model, index_sentence, num_results=100)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')
# Fetch the video id based on their index
rank_num = 1
print("Under the Video 'insects_intro_ant_v': \n")
for i in I.tolist()[0]:
  top_sentence_result = sentences_lvl_df[sentences_lvl_df['index_sentence']==i]['final_corrected_version_sentences_txt'].values[0]
  top_sentence_video_result = sentences_lvl_df[sentences_lvl_df['index_sentence']==i]['Video_ID'].values[0]
  if top_sentence_video_result == "insects_intro_ant_v":
    print("\nRank{} : Video - {}\nSentence:\n{}".format(rank_num,top_sentence_video_result,top_sentence_result ))
    rank_num = rank_num + 1


L2 distance: [94.9134521484375, 125.75140380859375, 130.35272216796875, 141.81874084472656, 144.03453063964844, 144.4069061279297, 145.5682373046875, 152.12161254882812, 152.85801696777344, 153.13104248046875, 158.89987182617188, 159.02020263671875, 162.29281616210938, 163.1450653076172, 165.9448699951172, 167.31411743164062, 167.5380401611328, 167.62600708007812, 167.92446899414062, 169.4123077392578, 169.8798370361328, 171.6204071044922, 172.45797729492188, 173.6826629638672, 174.26608276367188, 175.33599853515625, 175.40699768066406, 175.9908905029297, 176.941650390625, 177.18211364746094, 177.3096160888672, 178.3369903564453, 178.84349060058594, 179.11492919921875, 180.47044372558594, 180.60462951660156, 181.3361053466797, 182.46067810058594, 182.91844177246094, 183.4300079345703, 184.39605712890625, 186.29959106445312, 187.77862548828125, 188.35816955566406, 188.5408935546875, 188.5891876220703, 188.6169891357422, 189.2292022705078, 190.9556427001953, 190.962890625, 191.2497558593

In [None]:
## Flamingo

In [None]:
reference_text_df, text_lvl_df, sentences_lvl_df = BERT_Search_Engine_Preprocess(df_topic = df_topic_flamingo )
model, index, index_sentence = BERT_Search_Engine(text_lvl_df = text_lvl_df, sentences_lvl_df = sentences_lvl_df)
query = reference_text_df.iloc[0][0]

In [None]:
D, I = vector_search([query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

# Fetch the video id based on their index
rank_num = 1
print("Under the main topic: Insect_Ant: \n")
for i in I.tolist()[0]:
  top_txt_result_df = text_lvl_df[text_lvl_df['id_txt']==i]['Title'].values[0]
  print("Rank{} : Video - {}".format(rank_num,top_txt_result_df))
  rank_num = rank_num + 1

In [None]:
## start to fetch the top 10 sentences under Video "flamingo_intro_ant_v"
D, I = vector_search([query], model, index_sentence, num_results=100)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')


# Fetch the video id based on their index
rank_num = 1
print("Under the Video 'birds_intro_flamingo_v': \n")
for i in I.tolist()[0]:
  top_sentence_result = sentences_lvl_df[sentences_lvl_df['index_sentence']==i]['final_corrected_version_sentences_txt'].values[0]
  top_sentence_video_result = sentences_lvl_df[sentences_lvl_df['index_sentence']==i]['Video_ID'].values[0]
  if top_sentence_video_result == "birds_intro_flamingo_v":
    print("\nRank{} : Video - {}\nSentence:\n{}".format(rank_num,top_sentence_video_result,top_sentence_result ))
    rank_num = rank_num + 1


In [None]:
## Tailorbird  

In [None]:

reference_text_df, text_lvl_df, sentences_lvl_df = BERT_Search_Engine_Preprocess(df_topic = df_topic_tailorbird)
model, index, index_sentence = BERT_Search_Engine(text_lvl_df = text_lvl_df, sentences_lvl_df = sentences_lvl_df)
query = reference_text_df.iloc[0][0]

In [None]:
D, I = vector_search([query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

# Fetch the video id based on their index
rank_num = 1
print("Under the main topic: bird_tailorbird: \n")
for i in I.tolist()[0]:
  top_txt_result_df = text_lvl_df[text_lvl_df['id_txt']==i]['Title'].values[0]
  print("Rank{} : Video - {}".format(rank_num,top_txt_result_df))
  rank_num = rank_num + 1

In [None]:
## start to fetch the top 10 sentences under Video "tailorbird_intro_ant_v"
D, I = vector_search([query], model, index_sentence, num_results=100)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')


# Fetch the video id based on their index
rank_num = 1
print("Under the Video 'birds_intro_tailor_bird_v': \n")
for i in I.tolist()[0]:
  top_sentence_result = sentences_lvl_df[sentences_lvl_df['index_sentence']==i]['final_corrected_version_sentences_txt'].values[0]
  top_sentence_video_result = sentences_lvl_df[sentences_lvl_df['index_sentence']==i]['Video_ID'].values[0]
  if top_sentence_video_result == "birds_intro_tailor_bird_v":
    print("\nRank{} : Video - {}\nSentence:\n{}".format(rank_num,top_sentence_video_result,top_sentence_result ))
    rank_num = rank_num + 1


In [None]:
# peacock

In [None]:

reference_text_df, text_lvl_df, sentences_lvl_df = BERT_Search_Engine_Preprocess(df_topic = df_topic_peacock)
model, index, index_sentence = BERT_Search_Engine(text_lvl_df = text_lvl_df, sentences_lvl_df = sentences_lvl_df)
query = reference_text_df.iloc[0][0]

In [None]:
D, I = vector_search([query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

# Fetch the video id based on their index
rank_num = 1
print("Under the main topic: bird_peacock: \n")
for i in I.tolist()[0]:
  top_txt_result_df = text_lvl_df[text_lvl_df['id_txt']==i]['Title'].values[0]
  print("Rank{} : Video - {}".format(rank_num,top_txt_result_df))
  rank_num = rank_num + 1

In [None]:
## start to fetch the top 10 sentences under Video "birds_intro_peacock_v"
D, I = vector_search([query], model, index_sentence, num_results=100)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')


# Fetch the video id based on their index
rank_num = 1
print("Under the Video 'birds_intro_peacock_v': \n")
for i in I.tolist()[0]:
  top_sentence_result = sentences_lvl_df[sentences_lvl_df['index_sentence']==i]['final_corrected_version_sentences_txt'].values[0]
  top_sentence_video_result = sentences_lvl_df[sentences_lvl_df['index_sentence']==i]['Video_ID'].values[0]
  if top_sentence_video_result == "birds_intro_peacock_v":
    print("\nRank{} : Video - {}\nSentence:\n{}".format(rank_num,top_sentence_video_result,top_sentence_result ))
    rank_num = rank_num + 1


In [None]:
## beetle 

In [None]:

reference_text_df, text_lvl_df, sentences_lvl_df = BERT_Search_Engine_Preprocess(df_topic = df_topic_beetle)
model, index, index_sentence = BERT_Search_Engine(text_lvl_df = text_lvl_df, sentences_lvl_df = sentences_lvl_df)
query = reference_text_df.iloc[0][0]

In [None]:
D, I = vector_search([query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

# Fetch the video id based on their index
rank_num = 1
print("Under the main topic: insects_beetle: \n")
for i in I.tolist()[0]:
  top_txt_result_df = text_lvl_df[text_lvl_df['id_txt']==i]['Title'].values[0]
  print("Rank{} : Video - {}".format(rank_num,top_txt_result_df))
  rank_num = rank_num + 1

In [None]:
## start to fetch the top 10 sentences under Video "insects_intro_beetle_v"
D, I = vector_search([query], model, index_sentence, num_results=100)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')


# Fetch the video id based on their index
rank_num = 1
print("Under the Video 'insects_intro_beetle_v': \n")
for i in I.tolist()[0]:
  top_sentence_result = sentences_lvl_df[sentences_lvl_df['index_sentence']==i]['final_corrected_version_sentences_txt'].values[0]
  top_sentence_video_result = sentences_lvl_df[sentences_lvl_df['index_sentence']==i]['Video_ID'].values[0]
  if top_sentence_video_result == "insects_intro_beetle_v":
    print("\nRank{} : Video - {}\nSentence:\n{}".format(rank_num,top_sentence_video_result,top_sentence_result ))
    rank_num = rank_num + 1


# New Section

# New Section