## Semantic Similarity using TF-IDF + Bert(Pooled) technique
#### I used SNLI (Stanford Natural Language Inference) Corpus to predict sentence semantic similarity with Transformers.


### Installing and importing libraries

In [1]:
pip install pandas

Collecting pandas
  Using cached pandas-2.0.3-cp38-cp38-win_amd64.whl (10.8 MB)
Collecting numpy>=1.20.3
  Using cached numpy-1.24.4-cp38-cp38-win_amd64.whl (14.9 MB)
Collecting tzdata>=2022.1
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Installing collected packages: tzdata, numpy, pandas
Successfully installed numpy-1.24.4 pandas-2.0.3 tzdata-2023.3
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\users\real\desktop\myenv\scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.3.2-cp38-cp38-win_amd64.whl (9.3 MB)
Collecting scipy>=1.5.0
  Using cached scipy-1.10.1-cp38-cp38-win_amd64.whl (42.2 MB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Collecting joblib>=1.1.1
  Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.3.2 scikit-learn-1.3.2 scipy-1.10.1 threadpoolctl-3.2.0
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\users\real\desktop\myenv\scripts\python.exe -m pip install --upgrade pip' command.


In [3]:
pip install transformers

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
Collecting regex!=2019.12.17
  Using cached regex-2023.10.3-cp38-cp38-win_amd64.whl (269 kB)
Collecting huggingface-hub<1.0,>=0.16.4
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
Collecting filelock
  Using cached filelock-3.12.4-py3-none-any.whl (11 kB)
Collecting tqdm>=4.27
  Using cached tqdm-4.66.1-py3-none-any.whl (78 kB)
Collecting safetensors>=0.3.1
  Downloading safetensors-0.4.0-cp38-none-win_amd64.whl (277 kB)
Collecting tokenizers<0.15,>=0.14
  Downloading tokenizers-0.14.1-cp38-none-win_amd64.whl (2.2 MB)
Collecting fsspec>=2023.5.0
  Downloading fsspec-2023.10.0-py3-none-any.whl (166 kB)
Collecting huggingface-hub<1.0,>=0.16.4
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
Installing collected packages: tqdm, fsspec, filelock, huggingface-hub, tokenizers, safetensors, regex, transformers
Successfully installed filelock-3.12.4 fsspec-2023.10.0 huggingface-hub

You should consider upgrading via the 'c:\users\real\desktop\myenv\scripts\python.exe -m pip install --upgrade pip' command.


In [5]:
pip install torch

Collecting torch
  Using cached torch-2.1.0-cp38-cp38-win_amd64.whl (192.3 MB)
Collecting networkx
  Using cached networkx-3.1-py3-none-any.whl (2.1 MB)
Collecting sympy
  Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting mpmath>=0.19
  Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Installing collected packages: mpmath, sympy, networkx, torch
Successfully installed mpmath-1.3.0 networkx-3.1 sympy-1.12 torch-2.1.0
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\users\real\desktop\myenv\scripts\python.exe -m pip install --upgrade pip' command.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

### Upload the SNLI data 

In [1]:
!curl -LO https://raw.githubusercontent.com/MohamadMerchant/SNLI/master/data.tar.gz
!tar -xvzf data.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0 11.1M    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
 48 11.1M   48 5505k    0     0  5505k      0  0:00:02  0:00:01  0:00:01 2985k
 98 11.1M   98 11.0M    0     0  5656k      0  0:00:02  0:00:02 --:--:-- 3977k
100 11.1M  100 11.1M    0     0  5728k      0  0:00:02  0:00:02 --:--:-- 3984k
x SNLI_Corpus/
x SNLI_Corpus/snli_1.0_dev.csv
x SNLI_Corpus/snli_1.0_train.csv
x SNLI_Corpus/snli_1.0_test.csv


#### Reading the SNLI dataset


In [2]:
snli_df = pd.read_csv('snli_1.0_train.csv', nrows=1000)
snli_df

Unnamed: 0,similarity,sentence1,sentence2
0,neutral,A person on a horse jumps over a broken down a...,A person is training his horse for a competition.
1,contradiction,A person on a horse jumps over a broken down a...,"A person is at a diner, ordering an omelette."
2,entailment,A person on a horse jumps over a broken down a...,"A person is outdoors, on a horse."
3,neutral,Children smiling and waving at camera,They are smiling at their parents
4,entailment,Children smiling and waving at camera,There are children present
...,...,...,...
995,entailment,A man on a street in a bright t-shirt holds so...,A man is showing a woman something
996,neutral,A child with a yellow cup and milk all over hi...,The child spilled his milk.
997,contradiction,A child with a yellow cup and milk all over hi...,The child has a clean face.
998,entailment,A child with a yellow cup and milk all over hi...,The child had milk all over his face.


In [3]:
print(f"Total train samples : {snli_df.shape[0]}")

Total train samples : 1000


### Dataset overview:

##### * sentence1: The premise caption that was supplied to the author of the pair.
##### * sentence2: The hypothesis caption that was written by the author of the pair.
##### * similarity: This is the label chosen by the majority of annotators.
##### Note: Where no majority exists, the label "-" is used (we will skip such samples here).


##### Here are the "similarity" label values in our dataset:##### * Contradiction: The sentences share no similarity.##### * Entailment: The sentences have similar meaning.##### * Neutral: The sentences are neutral.





In [4]:
print(f"Sentence1: {snli_df.loc[1, 'sentence1']}")
print(f"Sentence2: {snli_df.loc[1, 'sentence2']}")
print(f"Similarity: {snli_df.loc[1, 'similarity']}")


Sentence1: A person on a horse jumps over a broken down airplane.
Sentence2: A person is at a diner, ordering an omelette.
Similarity: contradiction


#### Preprocessing

In [5]:
# We have some NaN entries in our train data, we will simply drop them.
print("Number of missing values")
print(snli_df.isnull().sum())
snli_df.dropna(axis=0, inplace=True)

Number of missing values
similarity    0
sentence1     0
sentence2     0
dtype: int64


Distribution of our training targets.


In [6]:
print("Train Target Distribution")
print(snli_df.similarity.value_counts())

Train Target Distribution
similarity
entailment       334
neutral          332
contradiction    332
-                  2
Name: count, dtype: int64


The value "-" appears as part of our training and validation targets. We will skip these samples.


In [7]:
snli_df = (
    snli_df[snli_df.similarity != "-"]
    .sample(frac=1.0, random_state=42)
    .reset_index(drop=True)
)

In [8]:
snli_df

Unnamed: 0,similarity,sentence1,sentence2
0,neutral,A woman in capri jeans crouches on the edge of...,A woman is very eager to touch the water
1,entailment,Woman at Walmart check-out having her grocerie...,A woman is in Walmart
2,neutral,People on bicycles waiting at an intersection.,People are on mountain bikes.
3,contradiction,A man and a woman are standing next to sculptu...,Three people are looking at painting at a scho...
4,entailment,People in orange vests and blue pants with a y...,the runners waited to start the race
...,...,...,...
993,entailment,A skier slides along a metal rail.,A skier is near the rail.
994,entailment,An Asian woman in a blue top and green headsca...,An Asian woman is smiling at while another lad...
995,contradiction,Two people wearing blue clothing are making ha...,A man is sitting with his hands in his pockets.
996,neutral,"A man in a gold foils skirt, sitting at a comp...",he is covering up his face


#### Sample search query


In [9]:
search_query = "A person is walking on a beach."

#### Define the chunk size 


In [10]:
chunk_size = 300

#### Initialize TF-IDF vectorizer


In [11]:
tfidf_vectorizer = TfidfVectorizer()

#### Function to get BERT (pooled) embeddings


In [12]:
def get_bert_pooled_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    outputs = model(**inputs)
    pooled_embedding = outputs.pooler_output
    return pooled_embedding

#### Initialize BERT tokenizer and model


In [13]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

#### Calculate the number of chunks needed


In [14]:
num_chunks = len(snli_df) // chunk_size + 1

#### Initialize an empty list to store similarity scores


In [15]:
similarity_scores = []

In [20]:
print(type(query_tfidf))
print(type(query_bert_embedding))

<class 'scipy.sparse._csr.csr_matrix'>
<class 'torch.Tensor'>


In [21]:
def weighted_average_similarity(cosine_sim, euclidean_dist, weight=0.5):
    return weight * cosine_sim + (1 - weight) * (1 / (1 + euclidean_dist))

#### Process data in chunks


In [22]:
for i in range(num_chunks):
    chunk_start = i * chunk_size
    chunk_end = (i + 1) * chunk_size
    
    # Extract sentence pairs and labels from the dataset chunk
    chunk_data = snli_df[chunk_start:chunk_end]
    chunk_sentence_pairs = chunk_data[['sentence1', 'sentence2']].values.tolist()
    chunk_labels = chunk_data['similarity'].values

    # Compute TF-IDF vectors for the chunk
    tfidf_vectors = tfidf_vectorizer.fit_transform([' '.join(pair) for pair in chunk_sentence_pairs])

    # Compute BERT (pooled) embeddings for the chunk
    bert_embeddings = [get_bert_pooled_embedding(' '.join(pair)) for pair in chunk_sentence_pairs]

    # Preprocess the search query
    query_tfidf = tfidf_vectorizer.transform([search_query])
    query_bert_embedding = get_bert_pooled_embedding(search_query)

    # Calculate similarity scores for the chunk
    for j, pair in enumerate(chunk_sentence_pairs):
        cosine_sim = cosine_similarity(query_tfidf.toarray(), tfidf_vectors[j])
        euclidean_dist = euclidean_distances(query_bert_embedding.detach().numpy(), bert_embeddings[j].detach().numpy())
        weighted_sim = weighted_average_similarity(cosine_sim[0][0], euclidean_dist[0][0])
        similarity_scores.append((j + chunk_start, cosine_sim[0][0], euclidean_dist[0][0], weighted_sim, chunk_labels[j]))


In [43]:
similarity_scores

[(0, 0.04006722746106942, 8.892938, 0.07057471862505582, 'neutral'),
 (1, 0.021418652141788494, 5.7490416, 0.08479391952821155, 'entailment'),
 (2, 0.09605325854230122, 2.427765, 0.19389427646090143, 'neutral'),
 (3, 0.0, 5.0432715, 0.08273664298526574, 'contradiction'),
 (4, 0.0, 10.756921, 0.04252814218011401, 'entailment'),
 (5, 0.0, 3.4979722, 0.11116120158404627, 'neutral'),
 (6, 0.0, 4.8742256, 0.08511760232691493, 'entailment'),
 (7, 0.021865280925434716, 4.093399, 0.10909891318209207, 'neutral'),
 (8, 0.01792310684760326, 2.9625814, 0.13514192677631798, 'contradiction'),
 (9, 0.062309996940060494, 2.936513, 0.15817096583032222, 'entailment'),
 (10, 0.04324958708445728, 3.5824518, 0.1307366881347708, 'neutral'),
 (11, 0.03203408693466361, 4.5985446, 0.10532596854261714, 'entailment'),
 (12, 0.0, 3.9483624, 0.10104353007882076, 'entailment'),
 (13, 0.09232822939146387, 3.2084446, 0.16497285475946366, 'entailment'),
 (14, 0.02343009991348863, 4.5757327, 0.10138936294607423, 'entai

#### Rank and retrieve the top 10 related reviews


In [37]:
top_10_reviews = sorted(similarity_scores, key=lambda x: x[3], reverse=True)[:10]

In [38]:
sentence_pairs = snli_df[['sentence1', 'sentence2']].values.tolist()

#### Display the similarity report with labels


In [41]:
print(f"Search query: {search_query}")
print()
for index, cosine_sim, euclidean_dist, weighted_sim, label in top_10_reviews:
    print(f"Sentence 1: {sentence_pairs[index][0]}")
    print(f"Sentence 2: {sentence_pairs[index][1]}")
    print(f"Label: {label}")
    print(f"Cosine Similarity: {cosine_sim}")
    print(f"Euclidean Distance: {euclidean_dist}")
    print(f"Weighted Similarity: {weighted_sim}")
    print()


Search query: A person is walking on a beach.

Sentence 1: A person on a horse jumps over a broken down airplane.
Sentence 2: A person is at a diner, ordering an omelette.
Label: contradiction
Cosine Similarity: 0.3852066660262905
Euclidean Distance: 1.995300054550171
Weighted Similarity: 0.35953151746680234

Sentence 1: A man, woman, and child enjoying themselves on a beach.
Sentence 2: A family of three is at the beach.
Label: entailment
Cosine Similarity: 0.37232723081343044
Euclidean Distance: 2.7458860874176025
Weighted Similarity: 0.31964338183087193

Sentence 1: A small white dog running on a pebble covered beach.
Sentence 2: A dog on the beach.
Label: entailment
Cosine Similarity: 0.4363744143304071
Euclidean Distance: 4.857994079589844
Weighted Similarity: 0.3035406563504933

Sentence 1: A woman on the side of a street is making food on her cart.
Sentence 2: A person is cooking.
Label: entailment
Cosine Similarity: 0.30726884434619123
Euclidean Distance: 2.342668056488037
Weig