<a href="https://colab.research.google.com/github/MWFK/NLP-from-Zero-to-Hero/blob/main/01.%20SentenceTransformers_PyTorch_Cosine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Objectives

model = SentenceTransformer('stsb-roberta-large')

For other models

https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0

We can choose other metrics too

util.pytorch_cos_sim(embedding1, embedding2)


The main library that we are going to use to compute semantic similarity is SentenceTransformers (Github source link), a simple library that provides an easy method to calculate dense vector representations (e.g. embeddings) for texts. It contains many state-of-the-art pretrained models that are fine-tuned for various applications. One of the primary tasks that it supports is Semantic Textual Similarity, which is the one we will focus on in this post.

To install SentenceTransformers, you will have to install the dependencies Pytorch and Transformers first.

After defining our model, we can now compute the similarity score of two sentences. As discussed in the introduction, the approach is to use the model to encode the two sentences, and then calculating the cosine similarity of the resulting two embeddings. The final result will be the semantic similarity score.

In general, we can use different formulas to calculate the final similarity score (e.g. dot product, Jaccard, etc.), but in this case, we are using cosine similarity due to its properties. The more important factor is the embeddings, which is produced by the model, so it is important to use a decent encoding model.

### Libs

In [None]:
!pip install sentence-transformers

In [None]:
!pip install transformers

In [None]:
!pip install torch

In [9]:
import numpy as np
from sentence_transformers import SentenceTransformer, util # util.pytorch_cos_sim(text_1, text_2)

### Modeling two sentences

In [6]:
model = SentenceTransformer('stsb-roberta-large')

Downloading:   0%|          | 0.00/748 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/674 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [12]:
sentence1 = "I like Python because I can build AI applications"
sentence2 = "I like Python because I can do data analytics"

# encode sentences to get their embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# compute similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)

print("Sentence 1:", sentence1)
print("Sentence 2:", sentence2)
print("Similarity score:", cosine_scores.item())

Sentence 1: I like Python because I can build AI applications
Sentence 2: I like Python because I can do data analytics
Similarity score: 0.8015284538269043


### Modeling two lists

In [16]:
sentences1 = ["I like Python because I can build AI applications", "The cat sits on the ground"]   
sentences2 = ["I like Python because I can do data analytics"    , "The cat walks on the sidewalk"]

# encode list of sentences to get their embeddings
embedding1 = model.encode(sentences1, convert_to_tensor=True)
embedding2 = model.encode(sentences2, convert_to_tensor=True)

# compute similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)

for i in range(len(sentences1)):
    for j in range(len(sentences2)):
        print("########## Sentence ", i, " with Sentence ", j," ##########")
        print("Sentence 1:", sentences1[i])
        print("Sentence 2:", sentences2[j])
        print("Similarity Score:", cosine_scores[i][j].item(), "\n")


########## Sentence  0  with Sentence  0  ##########
Sentence 1: I like Python because I can build AI applications
Sentence 2: I like Python because I can do data analytics
Similarity Score: 0.8015284538269043 

########## Sentence  0  with Sentence  1  ##########
Sentence 1: I like Python because I can build AI applications
Sentence 2: The cat walks on the sidewalk
Similarity Score: -0.031109800562262535 

########## Sentence  1  with Sentence  0  ##########
Sentence 1: The cat sits on the ground
Sentence 2: I like Python because I can do data analytics
Similarity Score: 0.11328643560409546 

########## Sentence  1  with Sentence  1  ##########
Sentence 1: The cat sits on the ground
Sentence 2: The cat walks on the sidewalk
Similarity Score: 0.4038149118423462 

