# Lesson 4: Sentence Embeddings

A sentence embedding is a numerical vector that represents the meaning of a full sentence in a way that machines can understand. It transforms natural language into a fixed-size set of numbers (like a list of 384 or 768 floats) that captures its semantic content—not just the words, but the overall intent or meaning.

- our target is finding the sentence similarity which is important for grouping 

In [1]:
from transformers.utils import logging
logging.set_verbosity_error()

### Build the `sentence embedding` pipeline using 🤗 Transformers Library

In [2]:
from sentence_transformers import SentenceTransformer

In [3]:
model = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

More info on [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [4]:
sentences1 = ['The cat sits outside',
              'A man is playing guitar',
              'The movies are awesome']

In [5]:
embeddings1 = model.encode(sentences1, convert_to_tensor=True)

In [6]:
embeddings1

tensor([[ 0.1392,  0.0030,  0.0470,  ...,  0.0641, -0.0163,  0.0636],
        [ 0.0227, -0.0014, -0.0056,  ..., -0.0225,  0.0846, -0.0283],
        [-0.1043, -0.0628,  0.0093,  ...,  0.0020,  0.0653, -0.0150]])

In [7]:
sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

In [8]:
embeddings2 = model.encode(sentences2, 
                           convert_to_tensor=True)

In [9]:
print(embeddings2)

tensor([[ 0.0163, -0.0700,  0.0384,  ...,  0.0447,  0.0254, -0.0023],
        [ 0.0054, -0.0920,  0.0140,  ...,  0.0167, -0.0086, -0.0424],
        [-0.0842, -0.0592, -0.0010,  ..., -0.0157,  0.0764,  0.0389]])


* Calculate the cosine similarity between two sentences as a measure of how similar they are to each other.

In [10]:
from sentence_transformers import util

In [11]:
cosine_scores = util.cos_sim(embeddings1,embeddings2)

In [12]:
print(cosine_scores)

tensor([[ 0.2838,  0.1310, -0.0029],
        [ 0.2277, -0.0327, -0.0136],
        [-0.0124, -0.0465,  0.6571]])


In [13]:
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                 sentences2[i],
                                                 cosine_scores[i][i]))

The cat sits outside 		 The dog plays in the garden 		 Score: 0.2838
A man is playing guitar 		 A woman watches TV 		 Score: -0.0327
The movies are awesome 		 The new movie is so great 		 Score: 0.6571


In [14]:
sentence3= 'I like programming'

In [15]:
sentence4="Coding is fun"

In [17]:
embeddings3 = model.encode(sentence3, 
                           convert_to_tensor=True)

In [18]:
embeddings4= model.encode(sentence4, 
                           convert_to_tensor=True)

In [20]:
print(embeddings3)

tensor([-5.1719e-02, -2.0952e-02, -8.2500e-03,  2.4515e-02, -9.9439e-03,
        -6.8515e-02,  1.0331e-01,  4.1197e-02,  3.2924e-02,  6.6939e-02,
        -4.8645e-02, -2.8012e-02,  1.4286e-02,  1.2630e-02,  3.2221e-02,
        -2.6675e-02, -5.2256e-02, -5.6366e-02,  1.3948e-02, -7.5305e-02,
        -1.4374e-01,  1.0994e-02, -3.1063e-03, -4.0515e-02,  5.1756e-02,
         1.1957e-01, -5.8505e-03, -2.2785e-02,  2.4844e-02, -6.3212e-02,
        -1.0977e-01,  1.0584e-01,  5.9859e-02,  4.4884e-02,  1.3550e-02,
         4.4947e-02,  8.2325e-02, -5.5439e-02,  2.3372e-02, -2.3878e-03,
        -8.2266e-02,  2.7210e-02,  6.4359e-02, -3.9953e-02, -2.0606e-02,
        -4.2883e-02, -1.6732e-02, -7.3027e-02,  7.1685e-02,  3.6505e-02,
         3.6103e-02, -7.3423e-03, -1.2763e-02, -3.9547e-02, -2.4636e-02,
         6.2700e-03,  2.3235e-02,  4.9753e-02,  8.7373e-03, -5.3747e-02,
        -3.4452e-02, -8.2123e-04, -2.0995e-02,  4.6273e-02,  5.6043e-02,
        -1.8466e-02, -1.3684e-02,  5.2412e-02,  2.5

In [21]:
print(embeddings4)

tensor([-3.4755e-02, -1.7553e-02, -3.2809e-02, -1.0212e-02, -1.7489e-02,
        -2.5481e-02,  6.6068e-02, -6.8483e-03,  1.9094e-02,  7.1942e-02,
         1.4288e-02, -2.4821e-02,  1.2111e-03,  4.4834e-03,  6.2238e-03,
        -5.6727e-02, -7.4205e-02,  1.5211e-03, -4.7156e-03, -5.6913e-02,
        -2.7228e-02, -7.0040e-02, -1.2898e-03, -2.9783e-02,  6.5100e-02,
         7.0677e-02, -4.6424e-02, -3.0245e-02,  6.7596e-02, -2.8450e-02,
        -6.0638e-02,  9.4772e-02,  7.2307e-02,  4.5318e-02, -7.6876e-03,
         4.4793e-02,  7.6149e-02, -2.9679e-02, -2.2681e-02,  2.4527e-02,
        -1.3476e-01,  1.4117e-02,  5.9468e-02,  1.5393e-02,  4.3307e-02,
         2.2789e-02,  4.3088e-03, -5.6026e-02,  4.6085e-03,  2.0330e-03,
         1.3967e-02, -1.7977e-02, -1.8176e-02, -4.2996e-02,  3.9040e-02,
        -4.8974e-02,  5.6025e-02, -2.8895e-02, -8.4968e-03,  1.8888e-02,
        -6.3683e-03,  1.8053e-02, -1.6402e-02,  4.2971e-02,  5.5569e-02,
        -4.5516e-02, -1.6897e-02,  1.6301e-02, -1.6

In [22]:
cosine_scores = util.cos_sim(embeddings3,embeddings4)

In [23]:
print(cosine_scores)

tensor([[0.6548]])


## A random question hit my mind if these different sentences are similar let's try out the same sentence

In [25]:
sentence5="Prensu is a good guy"

In [26]:
sentence6="Prensu is decent guy"

In [28]:
embeddings5= model.encode(sentence5, 
                           convert_to_tensor=True)

In [27]:
embeddings6= model.encode(sentence6, 
                           convert_to_tensor=True)

In [31]:
cosine_scores = util.cos_sim(embeddings5,embeddings6)

In [32]:
print(cosine_scores)

tensor([[0.9632]])


Yeah 96% Similar