<a href="https://colab.research.google.com/github/Data93/dbt-tutorial/blob/main/Semantic_Search_for_beginner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Semantic search is a technique for searching text data using meaning or context to match the search query with relevant results. Unlike traditional search algorithms that rely solely on keywords, **semantic search uses Natural Language Processing (NLP) and Machine Learning (ML)** to understand the intent behind the search query and the context of the text data. This results in more accurate and relevant search results compared to keyword-based search.



In this notebook, we will explore the basics of semantic. If you want to experiment with semantic search hit the copy and edit button.

## Importing package

In [2]:
# import package
import pandas as pd
import numpy as np

## Installing datasets, evalute, transformers and faiss (Facebook AI Similarity Search).

In [3]:
# install package
!pip install faiss-gpu
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Loading IMDB Movies Dataset

In [6]:
# read dataset
df = pd.read_csv('/content/imdb_top_1000.csv')
df.columns

Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
       'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
       'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
      dtype='object')

In [24]:
# sort list column
df = df[['Series_Title','Genre','Overview','Director']]
df.head()

Unnamed: 0,Series_Title,Genre,Overview,Director
0,The Shawshank Redemption,Drama,Two imprisoned men bond over a number of years...,Frank Darabont
1,The Godfather,"Crime, Drama",An organized crime dynasty's aging patriarch t...,Francis Ford Coppola
2,The Dark Knight,"Action, Crime, Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan
3,The Godfather: Part II,"Crime, Drama",The early life and career of Vito Corleone in ...,Francis Ford Coppola
4,12 Angry Men,"Crime, Drama",A jury holdout attempts to prevent a miscarria...,Sidney Lumet


## Converting pandas dataframe to Huggingface dataset.

In [9]:
from datasets import Dataset
movie_dataset = Dataset.from_pandas(df)
movie_dataset

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director'],
    num_rows: 1000
})

In [10]:
def concatenate_text(data):

  return{'text' : data ['Series_Title'] + '\n' + data['Genre'] + '\n' + data['Overview'] + '\n' + data['Director']}

movie_dataset = movie_dataset.map(concatenate_text)
movie_dataset

  0%|          | 0/1000 [00:00<?, ?ex/s]

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director', 'text'],
    num_rows: 1000
})

## Result of concatenate

In [11]:
movie_dataset['text'][0]

'The Shawshank Redemption\nDrama\nTwo imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.\nFrank Darabont'

## Importing Model and Tokenizer from HuggingFace

In [12]:
from transformers import AutoTokenizer, TFAutoModel
model_ckpt = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = TFAutoModel.from_pretrained(model_ckpt, from_pt=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMPNetModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFMPNetModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMPNetModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFMPNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMPNetModel for predictions without further training.


We need a single vector for our data so we need to average our token embeddings.One popular approach is to perform CLS pooling on our model’s outputs, where we simply collect the last hidden state for the special [CLS] token. The following function does the trick for us:

In [13]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="tf"
    )
    encoded_input = {k: v for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

## Debugging the output

In [14]:
embedding = get_embeddings(movie_dataset['text'][0])
embedding

<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-7.18399882e-03,  2.33954906e-01, -2.60172129e-01,
        -6.36960268e-02, -1.56177580e-01,  2.67546624e-02,
         9.96265858e-02,  1.24723703e-01, -2.56213173e-02,
        -1.04040280e-01, -5.99312484e-02,  2.91882604e-01,
        -6.50561452e-02, -1.22967556e-01, -2.43148118e-01,
        -2.25370079e-01,  2.35540569e-02,  3.25509429e-01,
         1.09136336e-01,  1.10371508e-01,  1.92035675e-01,
        -3.07481401e-02,  7.17380196e-02, -1.50342107e-01,
        -1.32835597e-01, -8.00719410e-02, -4.46838550e-02,
         3.91451940e-02,  1.96486056e-01,  4.71022986e-02,
         1.23555809e-01,  3.45926106e-01,  5.05600721e-02,
         4.23077554e-01, -9.63609491e-05, -3.80907714e-01,
         3.84394377e-02, -2.01974824e-01,  4.27673943e-02,
         6.32811412e-02, -1.92354351e-01,  2.51721978e-01,
         5.62862530e-02, -2.44347811e-01,  8.79958123e-02,
         4.80005085e-01,  1.22185543e-01,  2.33895555e-02,
      

## Now let's apply the function to the whole dataset.

In [15]:
embeddings_dataset = movie_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).numpy()[0]}
)



Keras weights file (<HDF5 file "variables.h5" (mode r+)>) saving:
...layers
......tfmp_net_main_layer
.........embeddings
............LayerNorm
...............vars
..................0
..................1
............dropout
...............vars
............vars
...............0
...............1
.........encoder
............layer
...............tfmp_net_layer
..................attention
.....................LayerNorm
........................vars
...........................0
...........................1
.....................attn
........................dropout
...........................vars
........................k
...........................vars
..............................0
..............................1
........................o
...........................vars
..............................0
..............................1
........................q
...........................vars
..............................0
..............................1
........................v
............

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [17]:
# debugging
embeddings_dataset

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director', 'text', 'embeddings'],
    num_rows: 1000
})

In [18]:
#Uncomment to debug the output

embeddings_dataset['embeddings'][0]

[-0.0071839988231658936,
 0.23395490646362305,
 -0.26017212867736816,
 -0.06369602680206299,
 -0.1561775803565979,
 0.02675466239452362,
 0.0996265858411789,
 0.12472370266914368,
 -0.025621317327022552,
 -0.10404027998447418,
 -0.05993124842643738,
 0.29188260436058044,
 -0.06505614519119263,
 -0.12296755611896515,
 -0.24314811825752258,
 -0.22537007927894592,
 0.023554056882858276,
 0.3255094289779663,
 0.10913633555173874,
 0.11037150770425797,
 0.19203567504882812,
 -0.030748140066862106,
 0.07173801958560944,
 -0.15034210681915283,
 -0.13283559679985046,
 -0.08007194101810455,
 -0.04468385502696037,
 0.03914519399404526,
 0.19648605585098267,
 0.047102298587560654,
 0.12355580925941467,
 0.34592610597610474,
 0.05056007206439972,
 0.4230775535106659,
 -9.636094910092652e-05,
 -0.38090771436691284,
 0.03843943774700165,
 -0.20197482407093048,
 0.042767394334077835,
 0.06328114122152328,
 -0.19235435128211975,
 0.25172197818756104,
 0.05628625303506851,
 -0.24434781074523926,
 0.087

## Using FAISS for efficient similarity search

In [19]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director', 'text', 'embeddings'],
    num_rows: 1000
})

## Testing

In [25]:
question = "Drama"
question_embedding = get_embeddings([question]).numpy()
question_embedding.shape

(1, 768)

In [26]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

samples_df

Unnamed: 0,Series_Title,Genre,Overview,Director,text,embeddings,scores
4,The Wrestler,"Drama, Sport","A faded professional wrestler must retire, but...",Darren Aronofsky,"The Wrestler\nDrama, Sport\nA faded profession...","[0.0618869811296463, 0.05696062743663788, -0.1...",38.84132
3,I Am Sam,Drama,A mentally handicapped man fights for custody ...,Jessie Nelson,I Am Sam\nDrama\nA mentally handicapped man fi...,"[0.16064363718032837, -0.17399075627326965, -0...",38.419151
2,Billy Elliot,"Drama, Music",A talented young boy becomes torn between his ...,Stephen Daldry,"Billy Elliot\nDrama, Music\nA talented young b...","[-0.09522397071123123, -0.2975098192691803, -0...",37.304703
1,Happiness,"Comedy, Drama",The lives of several individuals intertwine as...,Todd Solondz,"Happiness\nComedy, Drama\nThe lives of several...","[-0.07972872257232666, 0.013374537229537964, -...",36.939465
0,Detachment,Drama,A substitute teacher who drifts from classroom...,Tony Kaye,Detachment\nDrama\nA substitute teacher who dr...,"[-0.035436052829027176, -0.3090512156486511, -...",36.616508


## Result

In [27]:
for _, row in samples_df.iterrows():
    print(f"Series Title: {row.Series_Title}")
    print(f"Overview: {row.Overview}")
    print(f"Genre: {row.Genre}")
    print(f"Scores: {row.scores}")
    print("=" * 50)
    print()

Series Title: The Wrestler
Overview: A faded professional wrestler must retire, but finds his quest for a new life outside the ring a dispiriting struggle.
Genre: Drama, Sport
Scores: 38.8413200378418

Series Title: I Am Sam
Overview: A mentally handicapped man fights for custody of his 7-year-old daughter and in the process teaches his cold-hearted lawyer the value of love and family.
Genre: Drama
Scores: 38.419151306152344

Series Title: Billy Elliot
Overview: A talented young boy becomes torn between his unexpected love of dance and the disintegration of his family.
Genre: Drama, Music
Scores: 37.30470275878906

Series Title: Happiness
Overview: The lives of several individuals intertwine as they go about their lives in their own unique ways, engaging in acts society as a whole might find disturbing in a desperate search for human connection.
Genre: Comedy, Drama
Scores: 36.9394645690918

Series Title: Detachment
Overview: A substitute teacher who drifts from classroom to classroom 

**Conclusion:**

In this notebook we experimented with semantic search. We get to know about what is semantic search and how it can be used to search efficiently. Do you know YouTube uses semantic search? Due to this you get better search results.

**Activity:**

Try to search a song on Spotify be its lyrics, after that search the same song on YouTube.


Credit : https://www.kaggle.com/code/neesham/semantic-search-for-beginners/notebook