# Install SentenceTranformer

https://www.sbert.net/index.html
Sentence Tranformer is a python framework for sentence embedding and it also has some utils to find similar sentences. Its based on pytorch and is very easy to use.

In [None]:
!pip install sentence_transformers

# Find embedding for sentences

In [54]:
from sentence_transformers import SentenceTransformer

# We can use any of the pretrained models 
# https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/
model = SentenceTransformer('all-MiniLM-L6-v2')

# Let us define the sentences for which we need to find the embeddings
sentence_list = ["The baby cried for milk", 
             "The car drove away.", 
             "Dog lives in kennel",
             "The baby laughed", 
             "The kid was playing"]

sentence_embeddings = model.encode(sentence_list, convert_to_tensor=True)

# Let us pring the embeddings to see how it looks
print(sentence_embeddings)

tensor([[ 0.0231,  0.0091, -0.0034,  ...,  0.0315,  0.1159,  0.0116],
        [ 0.0097,  0.1495,  0.0064,  ...,  0.0608,  0.0304,  0.0521],
        [ 0.0093, -0.0191,  0.0488,  ...,  0.0649,  0.0221,  0.0260],
        [ 0.0338,  0.0065, -0.0824,  ...,  0.0313,  0.1070,  0.0199],
        [-0.0116,  0.0609, -0.0174,  ...,  0.0740,  0.0490,  0.0823]])


## Analyze the size of embeddings

In [53]:
sentence_embeddings.shape

torch.Size([5, 384])

As we can see the dimension of the embedding vector produced by all-MiniLM-L6-v2 is 384

# We will use 2 approaches to find similar sentences
In Approach 1, we will use the Util module from SentenceTransformer
In Approach 2, we will use Facebook AI Similarity Search (Faiss) library

# Approach 1 Find similar sentences using SentenceTransformer Util

In [None]:
SentenceTransformer has some utils to find similar sentences.

Let us say we want user to input a query and find the most similar sentences from the above sentence list

In [58]:
user_query = ["The baby cried for food"]
user_query_embedding = model.encode(user_query)
user_query_embedding.shape

(1, 384)

In [None]:
Find cosine similarity scores between the user query and the sentences

In [59]:
from sentence_transformers import util
cosine_scores = util.cos_sim( user_query_embedding,sentence_embeddings)
cosine_scores = cosine_scores[0].tolist()
cosine_scores

[0.8192180395126343,
 0.16249854862689972,
 0.0750042274594307,
 0.6280651688575745,
 0.30924367904663086]

### combine sentences and cosine_scores

In [60]:
# combine sentences & cosine_scores
sentence_and_scores = list(zip(sentence_list, cosine_scores))
sentence_and_scores

[('The baby cried for milk', 0.8192180395126343),
 ('The car drove away.', 0.16249854862689972),
 ('Dog lives in kennel', 0.0750042274594307),
 ('The baby laughed', 0.6280651688575745),
 ('The kid was playing', 0.30924367904663086)]

### Higher scores means most similar and lower scores means less similar

In [61]:

#Sort by descending order of score. In other words, from most similar to least similar
sentence_and_scores = sorted(sentence_and_scores, key=lambda y: y[1], reverse=True)

print ('user_query',user_query)

print ('sentence_and_scores')
for i in sentence_and_scores:
    print(i)



user_query ['The baby cried for food']
sentence_and_scores
('The baby cried for milk', 0.8192180395126343)
('The baby laughed', 0.6280651688575745)
('The kid was playing', 0.30924367904663086)
('The car drove away.', 0.16249854862689972)
('Dog lives in kennel', 0.0750042274594307)


As we can see from above, The sentence ''The baby cried for milk'' is most similar to user query and 'Dog lives in kennel' is least similar


# Approach 2 Find similar sentences using Facebook AI Similarity Search (Faiss) library

In [None]:
https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

## Install Faiss

In [67]:
!pip install faiss-cpu -qU

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [68]:
user_query = ["The baby cried for food"]
user_query_embedding = model.encode(user_query)
user_query_embedding.shape

(1, 384)

In [69]:
import faiss

## create Index and add the sentence embedding vectors to the index

In [70]:
dimension= list(sentence_embeddings[0].shape)[0]
sentence_index = faiss.IndexFlatL2(dimension)
sentence_index

<faiss.swigfaiss.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x2b2e49cc0> >

In [71]:
sentence_index.add(sentence_embeddings)
sentence_index.ntotal

5

In [None]:
As we can see, all the 5 sentence embeddings have been indexed.

## search index for the vectors similar to user query

In [72]:
top_k = 5 # let us say we want top 5 matches
# search the index
D, I = sentence_index.search(user_query_embedding, top_k)  
print ('Distances', D)
print ('Index', I)

Distances [[0.36156428 0.74386954 1.3815129  1.6750029  1.8499917 ]]
Index [[0 3 4 1 2]]


# Above gives the distances and index in sorted order. The most similar ones appear at the top of the list and the least similar ones at the bottom.
For example, index 0 with distance 0.36156428 is at the top of the list. This corresponds to the original sentence
"The baby cried for milk".
The next similar one is index 3 with distance of 0.74386954. This corresponds to the original sentence
"The baby laughed".

In [35]:
type(I[0])

numpy.ndarray

### Combine sentences and index based on similarity index order provided by I

In [73]:
import pandas as pd
similarity_data_df = pd.DataFrame({'similarity_index': I[0]})
similarity_data_df


Unnamed: 0,similarity_index
0,0
1,3
2,4
3,1
4,2


In [74]:

sentence_df = pd.DataFrame(sentence_list, columns= ['sentences'])
sentence_df

Unnamed: 0,sentences
0,The baby cried for milk
1,The car drove away.
2,Dog lives in kennel
3,The baby laughed
4,The kid was playing


In [75]:
user_query_similar_sentence_df = pd.merge(similarity_data_df,sentence_df,left_index=True,right_index=True)

user_query_similar_sentence_df.sort_values(by='similarity_index')

Unnamed: 0,similarity_index,sentences
0,0,The baby cried for milk
3,1,The baby laughed
4,2,The kid was playing
1,3,The car drove away.
2,4,Dog lives in kennel


As we can see from above, The sentence ''The baby cried for milk'' is most similar to user query and 'Dog lives in kennel' is least similar.
The result is same as the one we obtained by using the Sentencetransformer Util.