# Sentence Embeddings 📝

## Introduction
This notebook demonstrates how to use the `sentence-transformers` library to generate sentence embeddings for various natural language processing (NLP) tasks such as similarity measurement, clustering, and more. We will go through the steps of setting up the environment, preparing the data, (building the pipelines), training the model, and evaluating the results.

### Install and Import Libraries

Here is a brief description of the required libraries:

- The sentence-transformers library is particularly useful for tasks requiring understanding or comparing sentence meanings. Common use cases include sentence embeddings, and text similarity.

- The SentenceTransformer class is highly useful for transforming textual data into a format that can be used for machine learning models, similarity tasks, or information retrieval.

- The util module in sentence_transformers provides utility functions for common tasks related to embeddings, such as Calculating cosine similarity between embeddings and Clustering sentences based on similarity.

- PyTorch is an open-source deep learning framework developed by Facebook's AI Research lab. It provides flexibility and speed for building, training, and deploying deep learning models. 

  

In [1]:
# Install the sentence-transformers library
%pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.3.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.3.1-py3-none-any.whl (268 kB)
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-3.3.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Suppress non-critical log messages
from transformers.utils import logging
logging.set_verbosity_error()

# Import necessary libraries
from sentence_transformers import SentenceTransformer, util
import torch

  from .autonotebook import tqdm as notebook_tqdm


### Build the `sentence embedding` pipeline using 🤗 Transformers Library

The SentenceTransformer class is highly useful for transforming textual data into a format that can be used for machine learning models, similarity tasks, or information retrieval. To sum up, build the Sentence Embedding Pipeline allows to transform textual data into embeddings that can be used for machine learning models, similarity tasks, or information retrieval.

#### Load Sentence Embedding Model

The all-MiniLM-L6-v2 model (to test the similarity of sentences) for sentence similarity tasks has been selected. More info on [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [3]:
# Load the pre-trained model for generating sentence embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")

 - Example 1: Check Similarity Among Sentences with No Similarity

In [4]:
# Define three sentences that have no similarity
sentences1 = [
    'The cat sits outside',
    'A man is playing guitar',
    'The movies are awesome'
]

# Encode the sentences to generate embeddings
embeddings1 = model.encode(sentences1, convert_to_tensor=True)

# Display the embeddings
print(embeddings1)

tensor([[ 0.1392,  0.0030,  0.0470,  ...,  0.0641, -0.0163,  0.0636],
        [ 0.0227, -0.0014, -0.0056,  ..., -0.0225,  0.0846, -0.0283],
        [-0.1043, -0.0628,  0.0093,  ...,  0.0020,  0.0653, -0.0150]])


**Explanation output**: The embeddings generated for each sentence are low, indicating that the sentences have no similarity.

 - Example 2: Check Similarity Among Another Set of Sentences with No Similarity

In [5]:
# Define another set of three sentences that have no similarity
sentences2 = [
    'The dog plays in the garden',
    'A woman watches TV',
    'The new movie is so great'
]

# Encode the sentences to generate embeddings
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

# Display the embeddings
print(embeddings2)

tensor([[ 0.0163, -0.0700,  0.0384,  ...,  0.0447,  0.0254, -0.0023],
        [ 0.0054, -0.0920,  0.0140,  ...,  0.0167, -0.0086, -0.0424],
        [-0.0842, -0.0592, -0.0010,  ..., -0.0157,  0.0764,  0.0389]])


**Explanation output**: The embeddings generated for each sentence are low, indicating that the sentences have no similarity.

## Cosine Similarity
Here, we will calculate the cosine similarity between the embeddings of the two sets of sentences to measure how similar they are to each other.

* Calculate the cosine similarity between two sentences (sentence1 and sentence2, respectively from example1 and example2) as a measure of how similar they are to each other.

In [6]:
# Calculate the cosine similarity between the embeddings of the two sets of sentences
cosine_scores = util.cos_sim(embeddings1, embeddings2)

# Display the cosine similarity scores
print(cosine_scores)

tensor([[ 0.2838,  0.1310, -0.0029],
        [ 0.2277, -0.0327, -0.0136],
        [-0.0124, -0.0465,  0.6571]])


- Display Cosine Similarity Scores

In [7]:
# Print the cosine similarity scores for each pair of sentences
for i in range(len(sentences1)):
    print(f"{sentences1[i]} \t\t {sentences2[i]} \t\t Score: {cosine_scores[i][i]:.4f}")

The cat sits outside 		 The dog plays in the garden 		 Score: 0.2838
A man is playing guitar 		 A woman watches TV 		 Score: -0.0327
The movies are awesome 		 The new movie is so great 		 Score: 0.6571


**Explanation output**: The similarity between the third pair of sentences is relatively high (score: 0.6571), indicating some level of similarity.

 - Example 3: Check Similarity Among Sentences with Similarity

In [8]:
# Define two sentences that have similarity
sentences3 = [
    'She loves reading books',
    'She likes to read stories'
]

# Encode the sentences to generate embeddings
embeddings3 = model.encode(sentences3, convert_to_tensor=True)

# Display the embeddings
print(embeddings3)

tensor([[ 8.1847e-02, -1.8762e-02,  6.0246e-02,  4.2736e-02, -1.0077e-01,
          7.7356e-02,  4.2394e-02,  4.8213e-02,  3.7733e-02,  7.1035e-02,
         -2.8980e-02,  3.2012e-02, -1.2292e-02,  1.0678e-02, -3.6665e-02,
          1.9533e-02, -5.3808e-02, -1.1441e-02, -3.5750e-02, -3.6846e-02,
         -3.0612e-02,  3.4660e-02,  6.8313e-02,  3.0411e-03, -4.4332e-02,
          9.1062e-03, -4.4125e-02, -8.7327e-02, -5.4015e-02, -1.8325e-02,
         -2.1602e-02,  2.5704e-02, -7.2077e-02,  9.2353e-03, -2.5856e-02,
          3.8939e-02,  6.9202e-02,  3.0186e-02,  4.0532e-02,  2.1817e-04,
         -4.9267e-02, -6.1422e-02, -3.7412e-02,  1.4324e-02, -4.2883e-02,
         -3.8788e-02,  3.9759e-02,  3.9910e-02,  4.9719e-02, -1.1457e-02,
         -4.7881e-02,  4.4858e-02, -1.5185e-01, -1.5263e-02,  1.6128e-02,
          6.4546e-02, -4.1270e-02, -1.1895e-02,  1.2485e-02, -3.2212e-02,
         -1.2699e-02,  4.1639e-02, -3.3134e-04, -1.4763e-03, -1.8282e-02,
         -1.0024e-02,  3.3911e-02,  2.

**Explanation output**: The embeddings generated for these sentences are high, indicating that there is a similarity between the sentences.

### Conclusion
This notebook showed how to use the sentence-transformers library to generate sentence embeddings and measure sentence similarity. These embeddings are useful for tasks like clustering and text classification. The all-MiniLM-L6-v2 model was chosen for its balance of performance and efficiency, though trade-offs depend on task needs. In summary, this model is efficient and versatile for general use but may need fine-tuning or a stronger model for specialized tasks. It excels in real-time applications and large-scale processing due to its speed but may fall short for domain-specific or high-accuracy demands.

### Next Steps
- Try this model with your own sentences!