# Embeddings With Sentence-Transformers

I will work through two examples of using the sentence-transformer, the first time I will use the `sentence-transformer` library then only the `transformers` library to do the same approach from scratch.

The purpose is to show that it's not too difficult to create the sentence-transformer funcationality from scratch and that the BERT model can be fine-tuned and applied to similarity tasks relatively easily.

* Inspired from this video: https://www.youtube.com/watch?v=Ey81KfQ3PQU

# BERT Background:
* import tokenizer, size 512
    * first token is ['CLS'], empty tokkens are assigned ['PAD']
* 12 Encoder Blocks
    * Each block has an input of (1x768x512)
* Output is 512x1 (flattens out the 767)


# Using transformer library

In [1]:
sentences = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "Standing on one's head at job interviews forms a lasting impression.",
    "It took him a month to finish the meal.",
    "He found a leprechaun in his walnut shell."
]


In [2]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')

# initialize dictionary that will contain tokenized sentences
tokens = {'input_ids': [], 'attention_mask': []}

for sentence in sentences:
    # tokenize sentence and append to dictionary lists max_length=128 because it's a BERT destill
    # returns dictionary of lists of tensors
    # we need to pull input and attention mask from dictionary
    new_tokens = tokenizer.encode_plus(sentence, max_length=128, truncation=True,
                                       padding='max_length', return_tensors='pt')
    # append new tokens to dictionary
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    # append new attention mask to dictionary
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)okenizer_config.json: 100%|██████████| 399/399 [00:00<00:00, 2.02MB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 625/625 [00:00<00:00, 5.04MB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 4.37MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 2.96MB/s]
Downloading (…)in/added_tokens.json: 100%|██████████| 2.00/2.00 [00:00<00:00, 5.97kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 521kB/s]
Downloading pytorch_model.bin: 100%|██████████| 438M/438M [00:05<00:00, 80.7MB/s] 


In [3]:
# 6 sentences, 128 tokens per sentence
tokens['input_ids'].shape

torch.Size([6, 128])

In [4]:
outputs = model(**tokens)
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [21]:
# last_hidden_state is the last layer of the model
embeddings = outputs.last_hidden_state
embeddings

torch.Size([6, 128, 768])


tensor([[[-6.9230e-02,  6.2300e-01,  3.5369e-02,  ...,  8.0334e-01,
           1.6314e+00,  3.2812e-01],
         [ 3.6729e-02,  6.8419e-01,  1.9460e-01,  ...,  8.4759e-02,
           1.4747e+00, -3.0080e-01],
         [-1.2140e-02,  6.5431e-01, -7.2718e-02,  ..., -3.2600e-02,
           1.7717e+00, -6.8121e-01],
         ...,
         [ 1.9532e-01,  1.1085e+00,  3.3905e-01,  ...,  1.2826e+00,
           1.0114e+00, -7.2754e-02],
         [ 9.0217e-02,  1.0288e+00,  3.2973e-01,  ...,  1.2940e+00,
           9.8651e-01, -1.1125e-01],
         [ 1.2404e-01,  9.7365e-01,  3.9329e-01,  ...,  1.1359e+00,
           8.7685e-01, -1.0435e-01]],

        [[-3.2124e-01,  8.2512e-01,  1.0554e+00,  ..., -1.8555e-01,
           1.5169e-01,  3.9366e-01],
         [-7.1457e-01,  1.0297e+00,  1.1217e+00,  ...,  3.3118e-02,
           2.3820e-01, -1.5632e-01],
         [-2.3522e-01,  1.1353e+00,  8.5941e-01,  ..., -4.3096e-01,
          -2.7241e-02, -2.9676e-01],
         ...,
         [-5.4000e-01,  3

In [6]:
# 6 sentences, 128 tokens per sentence, 768 features per token
embeddings.shape

torch.Size([6, 128, 768])

In [22]:
# remove padding tokens with attention mask
# need to add the 768 features per token
attention_mask = tokens['attention_mask']
print('attention mask size is: ', attention_mask.shape)
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
print('mask size is: ', mask.shape)

attention mask size is:  torch.Size([6, 128])
mask size is:  torch.Size([6, 128, 768])


In [9]:
mask

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 

In [24]:
# apply mask to embeddings to only get non-padded tokens
masked_embeddings = embeddings * mask
print('masked embeddings size: ', masked_embeddings.shape)
print('we need to sum acros the 128 tokens to get a single vector for each sentence')
masked_embeddings

masked embeddings size:  torch.Size([6, 128, 768])
we need to sum acros the 128 tokens to get a single vector for each sentence


tensor([[[-0.0692,  0.6230,  0.0354,  ...,  0.8033,  1.6314,  0.3281],
         [ 0.0367,  0.6842,  0.1946,  ...,  0.0848,  1.4747, -0.3008],
         [-0.0121,  0.6543, -0.0727,  ..., -0.0326,  1.7717, -0.6812],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000]],

        [[-0.3212,  0.8251,  1.0554,  ..., -0.1855,  0.1517,  0.3937],
         [-0.7146,  1.0297,  1.1217,  ...,  0.0331,  0.2382, -0.1563],
         [-0.2352,  1.1353,  0.8594,  ..., -0.4310, -0.0272, -0.2968],
         ...,
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000]],

        [[-0.7576,  0.8399, -0.3792,  ...,  0.1271,  1.2514,  0.1365],
         [-0.6591,  0.7614, -0.4662,  ...,  0

In [12]:
# need to get mean pooling of non-padded tokens
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([6, 768])

In [25]:
# counts of features that are not padding (1 if not padding, 0 if padding)
counts = torch.clamp(mask.sum(1), min=1e-9)
counts.shape

torch.Size([6, 768])

In [26]:
counts

tensor([[15., 15., 15.,  ..., 15., 15., 15.],
        [22., 22., 22.,  ..., 22., 22., 22.],
        [15., 15., 15.,  ..., 15., 15., 15.],
        [16., 16., 16.,  ..., 16., 16., 16.],
        [12., 12., 12.,  ..., 12., 12., 12.],
        [14., 14., 14.,  ..., 14., 14., 14.]])

In [29]:
mean_pooled = summed / counts
mean_pooled.shape

torch.Size([6, 768])

In [28]:
mean_pooled

tensor([[ 0.0745,  0.8637,  0.1795,  ...,  0.7734,  1.7247, -0.1803],
        [-0.3715,  0.9729,  1.0840,  ..., -0.2552, -0.2759,  0.0358],
        [-0.5030,  0.7950, -0.1240,  ...,  0.1441,  0.9704, -0.1791],
        [-0.0132,  0.9773,  1.4516,  ..., -0.8462, -1.4004, -0.4118],
        [-0.2019,  0.0597,  0.8603,  ..., -0.0100,  0.8431, -0.0841],
        [-0.2131,  1.0175, -0.8833,  ...,  0.7371,  0.1947, -0.3011]],
       grad_fn=<DivBackward0>)

In [17]:
from sklearn.metrics.pairwise import cosine_similarity
# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()

# calculate
cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:]
)

array([[0.3308892 , 0.7219259 , 0.17475471, 0.44709635, 0.5548363 ]],
      dtype=float32)

These similarities translate to:

| Index | Sentence | Similarity |
| --- | --- | --- |
| 1 | "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go." | 0.3309 |
| 2 | "The person box was packed with jelly many dozens of months later." | 0.7219 |
| 3 | "Standing on one's head at job interviews forms a lasting impression." | 0.1748 |
| 4 | "It took him a month to finish the meal." | 0.4471 |
| 5 | "He found a leprechaun in his walnut shell." | 0.5548 |


So, as intended, the most similar sentence is that in index **2** - which contains the same meaning as our first sentence, without using the same words:

`"Three years later, the coffin was still full of Jello."`

# Using the sentence-transformer

In [18]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = model.encode(sentences)
sentence_embeddings


Downloading (…)821d1/.gitattributes: 100%|██████████| 391/391 [00:00<00:00, 3.31MB/s]
Downloading (…)_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 717kB/s]
Downloading (…)8d01e821d1/README.md: 100%|██████████| 3.95k/3.95k [00:00<00:00, 29.2MB/s]
Downloading (…)d1/added_tokens.json: 100%|██████████| 2.00/2.00 [00:00<00:00, 11.4kB/s]
Downloading (…)01e821d1/config.json: 100%|██████████| 625/625 [00:00<00:00, 5.00MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 122/122 [00:00<00:00, 836kB/s]
Downloading pytorch_model.bin: 100%|██████████| 438M/438M [00:05<00:00, 79.2MB/s] 
Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 345kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 932kB/s]
Downloading (…)821d1/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 8.70MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 399/399 [00:00<00:00, 5.06MB/s]
Downloading (…)8d01e821d1/vocab.txt: 100%|█████████

array([[ 0.07446156,  0.86369616,  0.17946291, ...,  0.77344   ,
         1.7247493 , -0.1802747 ],
       [-0.37146357,  0.97290134,  1.0839922 , ..., -0.25521314,
        -0.27593705,  0.03575896],
       [-0.50298285,  0.79498583, -0.12402609, ...,  0.14406338,
         0.9703752 , -0.179116  ],
       [-0.01324293,  0.97728604,  1.4515941 , ..., -0.84616524,
        -1.4004319 , -0.41184407],
       [-0.20192575,  0.05970386,  0.8602744 , ..., -0.01000801,
         0.84306234, -0.08407753],
       [-0.21311863,  1.017493  , -0.88327694, ...,  0.7371028 ,
         0.1946914 , -0.30111343]], dtype=float32)

In [19]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

array([[0.3308892 , 0.7219259 , 0.17475471, 0.44709635, 0.5548363 ]],
      dtype=float32)

These similarities translate to almost the exact same values as we calculated before:

| Index | Sentence | Similarity (before) | New similarity |
| --- | --- | --- | --- |
| 1 | "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go." | 0.3309 | 0.3309 |
| 2 | "The person box was packed with jelly many dozens of months later." | 0.7219 | 0.7219 |
| 3 | "Standing on one's head at job interviews forms a lasting impression." | 0.1748 | 0.174**7** |
| 4 | "It took him a month to finish the meal." | 0.4471 | 0.447**2** |
| 5 | "He found a leprechaun in his walnut shell." | 0.5548 | 0.554**7** |

So, using `sentence-transformers` can make life much easier. But either option produces the same outcome.