**Sentence similarity** is an essential task in natural language processing (NLP), which measures how close two sentences are in meaning. It has numerous applications, including text classification, summarization, information retrieval, and chatbot development. This tutorial will guide you on implementing sentence similarity using the Sentence Transformer library in Python. We'll also dive into understanding each line of code used for calculating sentence embeddings and their cosine similarities.

In [None]:
!pip install -U sentence-transformers

In [None]:
!pip install torch

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['sorting algorithms', 'That is a very happy person']

model = SentenceTransformer('thenlper/gte-large')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

tensor([[0.7201]])


**pip install -U sentence-transformers**: Upgrades or installs the latest version of the Sentence Transformers package from PyPI. Sentence Transformers provides pre-trained models that convert sentences into vector space representations (sentence embeddings).

The **'thenlper/gte-large'** model generates high-quality sentence embeddings suitable for comparing semantic similarity between pairs of sentences.

**!pip install torch**: Install Pytorch if not already installed, ensuring compatibility with your system configuration. Sentence Transformers use Pytorch deep learning framework under the hood.

Step 2: Import necessary modules

In [None]:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

**torch.nn.functional as F**: Imports specific functions like F.normalize() from Pytorch.

**Tensor**: A tensor class provided by Pytorch representing multi-dimensional arrays.

**AutoTokenizer, AutoModel**: Classes imported from Hugging Face Transformers, providing functionalities for encoding input texts and generating corresponding sentence embeddings.

Step 3: Define helper function

In [None]:
def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:

In [None]:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

**last_hidden = last_hidden_states.masked_fill(...)**: Replaces invalid positions marked by ~attention_mask with zeros.

**return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]**: Calculates the sum of all elements across axis 1 (sequence length), then divides it by the number of actual tokens present in respective vectors after accounting for padding.

Step 4: Prepare input data & generate sentence embeddings

In [None]:
input_texts = [
    "what is the capital of Philippines?",
    "how to implement quick sort in python?",
    "Manila",
    "sorting algorithms"
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large")
model = AutoModel.from_pretrained("thenlper/gte-large")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

Initialize input_texts list containing sample questions.

* Load pre-trained tokenizer and model.
* Encode input texts using tokenizer returning batched tensors ready for feeding into the model.
* Generate output features from the model given encoded inputs.
* Compute sentence embeddings via our previously defined average_pool method.

Step 5: Normalizing embeddings (Optional)

In [None]:
embeddings = F.normalize(embeddings, p=2, dim=1)

Normalizes the embeddings so they have unit length, typically improving downstream tasks' performance since distances become comparable regardless of magnitudes.

Step 6: Computing pairwise similarity scores

In [None]:
# (Optionally) normalize embeddings
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())


[[63.88489532470703, 87.74205780029297, 66.94505310058594]]


Computes pairwise cosine similarities between the first input sentence ("what is the capital of Philippines?") against remaining ones, multiplying results by 100 for readability purposes.