# Similarity

This notebook demonstrates how to utilize different models to assess text similarity using the code from Model.py. Additionally, it provides explanations on how to utilize the code effectively.

The following sentences will serve as input texts for the demonstration:

In [1]:
sentences = [
    "Ramona is a software engineer who loves coding.",
    "Jorge enjoys reading books in his free time.",
    "Marco sat with his cat under the tree. ",
    "The dog chased the monster in the park.",
    "John and Mary went to the beach to fish for sharks.",
    "Pizza is a common choice for food used to poison without detection. ",
    "The Mona Lisa is a famous painting by Leonardo da Vinci.",
    "Mount Everest is the tallest mountain in the world.",
    "The Great Wall of China is a UNESCO World Heritage Site.",
    "Beethoven composed Symphony No. 9 in D minor.",
    "The Earth revolves around the Sun.",
    "Water boils at 100 degrees Celsius.",
    "The capital of France is Paris."
]

In [2]:
# to access the code in similarity's folder

import sys
import os
sys.path.append(os.path.dirname(os.getcwd()))

In [3]:
# Library
# pip install -U sentence-transformers
from similarity.Model import Model


  from .autonotebook import tqdm as notebook_tqdm


### all-MiniLM-L6-v2

This model is particularly efficient in terms of size and computation. It uses MiniLM Architecture, a compact version of the BERT model based on the Transformer architecture.

In [4]:
# 8.4s aprox. -> best one (shortest time)
# Name of the SentenceTransformer model to use
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
# Create an instance of the class
Model_instance = Model(sentences, model_name)



In [5]:
# 1.2s aprox.
# Calculate and display similarity between the texts; 2.8 seg aprox.
similarity_results = Model_instance.calculate_similarity()

# Print similarity results
print("Similarity results:")
for result in similarity_results:
    print(result)

Similarity results:
{'text_id1': 0, 'text_id2': 1, 'similarity': 0.15973473}
{'text_id1': 0, 'text_id2': 2, 'similarity': 0.08184144}
{'text_id1': 0, 'text_id2': 3, 'similarity': 0.055600077}
{'text_id1': 0, 'text_id2': 4, 'similarity': 0.14304018}
{'text_id1': 0, 'text_id2': 5, 'similarity': 0.03466011}
{'text_id1': 0, 'text_id2': 6, 'similarity': 0.046214525}
{'text_id1': 0, 'text_id2': 7, 'similarity': 0.0221178}
{'text_id1': 0, 'text_id2': 8, 'similarity': -0.037078075}
{'text_id1': 0, 'text_id2': 9, 'similarity': 0.065347075}
{'text_id1': 0, 'text_id2': 10, 'similarity': -0.0359989}
{'text_id1': 0, 'text_id2': 11, 'similarity': -0.016028827}
{'text_id1': 0, 'text_id2': 12, 'similarity': 0.022813626}
{'text_id1': 1, 'text_id2': 2, 'similarity': 0.15311188}
{'text_id1': 1, 'text_id2': 3, 'similarity': -0.070045315}
{'text_id1': 1, 'text_id2': 4, 'similarity': 0.1129448}
{'text_id1': 1, 'text_id2': 5, 'similarity': 0.018555589}
{'text_id1': 1, 'text_id2': 6, 'similarity': 0.08810095}

### paraphrase-multilingual-MiniLM-L12-v2

The main difference between all-MiniLM-L6-v2 and MiniLM-L12-v2 is the number of layers (L6 = 6; L12 = 12), which affects the size, performance, and computational requirements. Furthermore, paraphrase-multilingual-MiniLM-L12-v2 is specifically tailored for paraphrase detection and semantic similarity tasks, while all-MiniLM-L6-v2 is a general purpose model.

In [6]:
# 15.1s aprox
# Name of the SentenceTransformer model to use
model_name = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'

# Create an instance of the class
Model_instance = Model(sentences, model_name)



In [7]:
# 0.1s aprox.
# Calculate and display similarity between the texts; 2.8 seg aprox.
similarity_results = Model_instance.calculate_similarity()

# Print similarity results
print("Similarity results:")
for result in similarity_results:
    print(result)

Similarity results:
{'text_id1': 0, 'text_id2': 1, 'similarity': 0.20741838}
{'text_id1': 0, 'text_id2': 2, 'similarity': 0.22545971}
{'text_id1': 0, 'text_id2': 3, 'similarity': -0.03478617}
{'text_id1': 0, 'text_id2': 4, 'similarity': -0.04152015}
{'text_id1': 0, 'text_id2': 5, 'similarity': 0.10810108}
{'text_id1': 0, 'text_id2': 6, 'similarity': -0.13943344}
{'text_id1': 0, 'text_id2': 7, 'similarity': -0.077267304}
{'text_id1': 0, 'text_id2': 8, 'similarity': -0.08653814}
{'text_id1': 0, 'text_id2': 9, 'similarity': 0.16482526}
{'text_id1': 0, 'text_id2': 10, 'similarity': -0.023441494}
{'text_id1': 0, 'text_id2': 11, 'similarity': -0.08971386}
{'text_id1': 0, 'text_id2': 12, 'similarity': -0.18638454}
{'text_id1': 1, 'text_id2': 2, 'similarity': 0.2598318}
{'text_id1': 1, 'text_id2': 3, 'similarity': -0.122574516}
{'text_id1': 1, 'text_id2': 4, 'similarity': -0.10347986}
{'text_id1': 1, 'text_id2': 5, 'similarity': 0.07109058}
{'text_id1': 1, 'text_id2': 6, 'similarity': 0.105367

### all-mpnet-base-v2

MPNet (Mobileformer) model, a transformer-based model developed by Microsoft Research. It usually works best when you need fast responses, privacy, or computing close to the data source.

In [8]:
# 43s aprox
# Name of the SentenceTransformer model to use
model_name = 'sentence-transformers/all-mpnet-base-v2'

# Create an instance of the class
Model_instance = Model(sentences, model_name)



In [9]:
# 0.2s aprox.
# Calculate and display similarity between the texts; 2.8 seg aprox.
similarity_results = Model_instance.calculate_similarity()

# Print similarity results
print("Similarity results:")
for result in similarity_results:
    print(result)

Similarity results:
{'text_id1': 0, 'text_id2': 1, 'similarity': 0.20646062}
{'text_id1': 0, 'text_id2': 2, 'similarity': -0.0046621906}
{'text_id1': 0, 'text_id2': 3, 'similarity': -0.08455665}
{'text_id1': 0, 'text_id2': 4, 'similarity': 0.017528454}
{'text_id1': 0, 'text_id2': 5, 'similarity': 0.1032789}
{'text_id1': 0, 'text_id2': 6, 'similarity': 0.13802837}
{'text_id1': 0, 'text_id2': 7, 'similarity': -0.0075252354}
{'text_id1': 0, 'text_id2': 8, 'similarity': 0.017472018}
{'text_id1': 0, 'text_id2': 9, 'similarity': -0.032865025}
{'text_id1': 0, 'text_id2': 10, 'similarity': 0.033637233}
{'text_id1': 0, 'text_id2': 11, 'similarity': -0.020414997}
{'text_id1': 0, 'text_id2': 12, 'similarity': 0.033197947}
{'text_id1': 1, 'text_id2': 2, 'similarity': 0.0645885}
{'text_id1': 1, 'text_id2': 3, 'similarity': -0.010923749}
{'text_id1': 1, 'text_id2': 4, 'similarity': 0.08825572}
{'text_id1': 1, 'text_id2': 5, 'similarity': 0.02201602}
{'text_id1': 1, 'text_id2': 6, 'similarity': 0.128

### paraphrase-multilingual-mpnet-base-v2

paraphrase-multilingual-mpnet-base-v2 is specifically tailored for paraphrase detection and semantic similarity tasks, while all-mpnet-base-v2 is a general purpose model.

In [10]:
# 1m 11.4s aprox.
# Name of the SentenceTransformer model to use
model_name = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'

# Create an instance of the class
Model_instance = Model(sentences, model_name)



In [11]:
# 0.3s
# Calculate and display similarity between the texts; 2.8 seg aprox.
similarity_results = Model_instance.calculate_similarity()

# Print similarity results
print("Similarity results:")
for result in similarity_results:
    print(result)

Similarity results:
{'text_id1': 0, 'text_id2': 1, 'similarity': 0.30339304}
{'text_id1': 0, 'text_id2': 2, 'similarity': 0.089907266}
{'text_id1': 0, 'text_id2': 3, 'similarity': -0.011231708}
{'text_id1': 0, 'text_id2': 4, 'similarity': 0.014406752}
{'text_id1': 0, 'text_id2': 5, 'similarity': 0.07260013}
{'text_id1': 0, 'text_id2': 6, 'similarity': 0.20138864}
{'text_id1': 0, 'text_id2': 7, 'similarity': -0.015593786}
{'text_id1': 0, 'text_id2': 8, 'similarity': 0.012397654}
{'text_id1': 0, 'text_id2': 9, 'similarity': 0.024085607}
{'text_id1': 0, 'text_id2': 10, 'similarity': 0.005255349}
{'text_id1': 0, 'text_id2': 11, 'similarity': 0.030095711}
{'text_id1': 0, 'text_id2': 12, 'similarity': 0.031220898}
{'text_id1': 1, 'text_id2': 2, 'similarity': 0.26790628}
{'text_id1': 1, 'text_id2': 3, 'similarity': 0.014833647}
{'text_id1': 1, 'text_id2': 4, 'similarity': 0.028790228}
{'text_id1': 1, 'text_id2': 5, 'similarity': 0.07778351}
{'text_id1': 1, 'text_id2': 6, 'similarity': 0.06813

### LaBSE

LaBSE specializes in creating embeddings that are effective across various languages for cross-lingual tasks (Language-agnostic BERT Sentence Embedding).

In [16]:
# 18.4s aprox.
# Name of the SentenceTransformer model to use
model_name = 'sentence-transformers/LaBSE'

# Create an instance of the class
Model_instance = Model(sentences, model_name)



In [17]:
# 0.2s aprox.
# Calculate and display similarity between the texts; 2.8 seg aprox.
similarity_results = Model_instance.calculate_similarity()

# Print similarity results
print("Similarity results:")
for result in similarity_results:
    print(result)

Similarity results:
{'text_id1': 0, 'text_id2': 1, 'similarity': 0.3342187}
{'text_id1': 0, 'text_id2': 2, 'similarity': 0.27191776}
{'text_id1': 0, 'text_id2': 3, 'similarity': 0.19035715}
{'text_id1': 0, 'text_id2': 4, 'similarity': 0.19265829}
{'text_id1': 0, 'text_id2': 5, 'similarity': 0.20391954}
{'text_id1': 0, 'text_id2': 6, 'similarity': 0.34882292}
{'text_id1': 0, 'text_id2': 7, 'similarity': 0.13913995}
{'text_id1': 0, 'text_id2': 8, 'similarity': 0.08449406}
{'text_id1': 0, 'text_id2': 9, 'similarity': 0.28190082}
{'text_id1': 0, 'text_id2': 10, 'similarity': 0.20073593}
{'text_id1': 0, 'text_id2': 11, 'similarity': 0.1667537}
{'text_id1': 0, 'text_id2': 12, 'similarity': 0.23500073}
{'text_id1': 1, 'text_id2': 2, 'similarity': 0.37539992}
{'text_id1': 1, 'text_id2': 3, 'similarity': 0.25723955}
{'text_id1': 1, 'text_id2': 4, 'similarity': 0.28703022}
{'text_id1': 1, 'text_id2': 5, 'similarity': 0.18691957}
{'text_id1': 1, 'text_id2': 6, 'similarity': 0.20378728}
{'text_id1

### DistilUSE

DistilUSE (distiluse-base-multilingual-cased-v2) is tailored for generating multilingual sentence embeddings primarily for tasks like sentence similarity and semantic search

In [13]:
# 49.4s aprox.
# Name of the SentenceTransformer model to use
model_name = 'sentence-transformers/distiluse-base-multilingual-cased-v2'

# Create an instance of the class
Model_instance = Model(sentences, model_name)

In [15]:
#0.1s aprox.
# Calculate and display similarity between the texts; 2.8 seg aprox.
similarity_results = Model_instance.calculate_similarity()

# Print similarity results
print("Similarity results:")
for result in similarity_results:
    print(result)

Similarity results:
{'text_id1': 0, 'text_id2': 1, 'similarity': 0.18313915}
{'text_id1': 0, 'text_id2': 2, 'similarity': 0.1076456}
{'text_id1': 0, 'text_id2': 3, 'similarity': 0.051257897}
{'text_id1': 0, 'text_id2': 4, 'similarity': 0.026669824}
{'text_id1': 0, 'text_id2': 5, 'similarity': 0.06693362}
{'text_id1': 0, 'text_id2': 6, 'similarity': 0.26562876}
{'text_id1': 0, 'text_id2': 7, 'similarity': 0.2319214}
{'text_id1': 0, 'text_id2': 8, 'similarity': 0.07335695}
{'text_id1': 0, 'text_id2': 9, 'similarity': 0.06130746}
{'text_id1': 0, 'text_id2': 10, 'similarity': 0.05805084}
{'text_id1': 0, 'text_id2': 11, 'similarity': 0.054492675}
{'text_id1': 0, 'text_id2': 12, 'similarity': 0.14478637}
{'text_id1': 1, 'text_id2': 2, 'similarity': 0.1645833}
{'text_id1': 1, 'text_id2': 3, 'similarity': 0.013779677}
{'text_id1': 1, 'text_id2': 4, 'similarity': 0.08498677}
{'text_id1': 1, 'text_id2': 5, 'similarity': 0.098824814}
{'text_id1': 1, 'text_id2': 6, 'similarity': 0.08114214}
{'text