<a href="https://www.kaggle.com/code/neesham/semantic-search-for-beginner?scriptVersionId=118005578" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# But what is semantic search?

Semantic search is a technique for searching text data using meaning or context to match the search query with relevant results. Unlike traditional search algorithms that rely solely on keywords, semantic search uses Natural Language Processing (NLP) and Machine Learning (ML) to understand the intent behind the search query and the context of the text data. This results in more accurate and relevant search results compared to keyword-based search.

In this notebook, we will explore the basics of semantic. If you want to experiment with semantic search hit the copy and edit button.

*Upvote the notebook if you found it usefull ❤️!*

# Importing Packages

In [1]:
import numpy as np 
import pandas as pd 

# Installing datasets, evalute, transformers and faiss (Facebook AI Similarity Search).

In [2]:
!pip install faiss-gpu
!pip install datasets evaluate transformers[sentencepiece]

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m317.9 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0
[0m

## Loading IMDB Movies Dataset

In [3]:
df = pd.read_csv("/kaggle/input/imdb-dataset-of-top-1000-movies-and-tv-shows/imdb_top_1000.csv")

df.columns

Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
       'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
       'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
      dtype='object')

In [4]:
# We only need Series_Title, Genre, Overview and Director for search purpose.

df = df[['Series_Title', 'Genre', 'Overview', 'Director']]

df.head()

Unnamed: 0,Series_Title,Genre,Overview,Director
0,The Shawshank Redemption,Drama,Two imprisoned men bond over a number of years...,Frank Darabont
1,The Godfather,"Crime, Drama",An organized crime dynasty's aging patriarch t...,Francis Ford Coppola
2,The Dark Knight,"Action, Crime, Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan
3,The Godfather: Part II,"Crime, Drama",The early life and career of Vito Corleone in ...,Francis Ford Coppola
4,12 Angry Men,"Crime, Drama",A jury holdout attempts to prevent a miscarria...,Sidney Lumet


## Converting pandas dataframe to Huggingface dataset.

In [5]:
from datasets import Dataset

movie_dataset = Dataset.from_pandas(df)

movie_dataset

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director'],
    num_rows: 1000
})

In [6]:
# Concatenating all the text field so that we can make a single embedding vector for all the relevant data.

def concatenate_text(data):
    
    return {"text": data['Series_Title'] + '\n' + data['Genre'] + '\n' + data['Overview'] + '\n' + data['Director']}


movie_dataset = movie_dataset.map(concatenate_text)

movie_dataset

  0%|          | 0/1000 [00:00<?, ?ex/s]

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director', 'text'],
    num_rows: 1000
})

### Result of concatenation

In [7]:
movie_dataset['text'][0]

'The Shawshank Redemption\nDrama\nTwo imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.\nFrank Darabont'

## Importing Model and Tokenizer from HuggingFace

In [8]:
from transformers import AutoTokenizer, TFAutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = TFAutoModel.from_pretrained(model_ckpt, from_pt=True)

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

2023-02-02 08:36:32.870889: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMPNetModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFMPNetModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMPNetModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFMPNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMPNetModel for predictions without further training.


> We need a single vector for our data so we need to average our token embeddings.One popular approach is to perform CLS pooling on our model’s outputs, where we simply collect the last hidden state for the special [CLS] token. The following function does the trick for us:

In [9]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="tf"
    )
    encoded_input = {k: v for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

## Debugging the Output

In [10]:
#embedding = get_embeddings(movie_dataset['text'][0])

#embedding

# Now let's apply the function to the whole dataset.

### This will take some time so be patient 🙃.

In [11]:
embeddings_dataset = movie_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).numpy()[0]}
)

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [12]:
# Debugging

embeddings_dataset

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director', 'text', 'embeddings'],
    num_rows: 1000
})

In [13]:
# Uncomment to debug the output

#embeddings_dataset['embeddings'][0]

## Using FAISS for efficient similarity search

In [14]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director', 'text', 'embeddings'],
    num_rows: 1000
})

## Testing

In [15]:
question = "Batman"
question_embedding = get_embeddings([question]).numpy()
question_embedding.shape

(1, 768)

In [16]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

samples_df

Unnamed: 0,Series_Title,Genre,Overview,Director,text,embeddings,scores
4,Joker,"Crime, Drama, Thriller","In Gotham City, mentally troubled comedian Art...",Todd Phillips,"Joker\nCrime, Drama, Thriller\nIn Gotham City,...","[0.20968076586723328, -0.3021985590457916, -0....",36.938065
3,The Dark Knight,"Action, Crime, Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"The Dark Knight\nAction, Crime, Drama\nWhen th...","[0.12353729456663132, 0.12746326625347137, -0....",30.461723
2,Batman Begins,"Action, Adventure","After training with his mentor, Batman begins ...",Christopher Nolan,"Batman Begins\nAction, Adventure\nAfter traini...","[-0.11327164620161057, 0.41753411293029785, -0...",30.283798
1,Batman: Mask of the Phantasm,"Animation, Action, Crime",Batman is wrongly implicated in a series of mu...,Kevin Altieri,"Batman: Mask of the Phantasm\nAnimation, Actio...","[-0.0030361339449882507, -0.05288544297218323,...",27.874849
0,The Dark Knight Rises,"Action, Adventure",Eight years after the Joker's reign of anarchy...,Christopher Nolan,"The Dark Knight Rises\nAction, Adventure\nEigh...","[0.037607744336128235, 0.4416905641555786, -0....",27.826258


In [17]:
for _, row in samples_df.iterrows():
    print(f"Series Title: {row.Series_Title}")
    print(f"Overview: {row.Overview}")
    print(f"Genre: {row.Genre}")
    print(f"Scores: {row.scores}")
    print("=" * 50)
    print()

Series Title: Joker
Overview: In Gotham City, mentally troubled comedian Arthur Fleck is disregarded and mistreated by society. He then embarks on a downward spiral of revolution and bloody crime. This path brings him face-to-face with his alter-ego: the Joker.
Genre: Crime, Drama, Thriller
Scores: 36.93806457519531

Series Title: The Dark Knight
Overview: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.
Genre: Action, Crime, Drama
Scores: 30.46172332763672

Series Title: Batman Begins
Overview: After training with his mentor, Batman begins his fight to free crime-ridden Gotham City from corruption.
Genre: Action, Adventure
Scores: 30.283798217773438

Series Title: Batman: Mask of the Phantasm
Overview: Batman is wrongly implicated in a series of murders of mob bosses actually done by a new vigilante assassin.
Genre: Animation, Action, Crime
Score