<a href="https://colab.research.google.com/github/Azizkhaled/NLP_with_Aziz/blob/main/Similarity_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install transformers

## Building Dense Vectors Using Transformers

From hugging face, We will be using the [sentence-transformers/stsb-distilbert-base](https://huggingface.co/sentence-transformers/stsb-distilbert-base) model to build our dense vectors.

### Initialize our model and tokenizer:

In [4]:
from transformers import AutoTokenizer, AutoModel
import torch

In [5]:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/stsb-distilbert-base')
model = AutoModel.from_pretrained('sentence-transformers/stsb-distilbert-base')

Downloading (…)okenizer_config.json:   0%|          | 0.00/489 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/539 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

### Tokenize a sentence

In [6]:
text = 'Hello universe, glad to be alive'

In [7]:
tokens = tokenizer.encode_plus(text, max_length=128,
                               truncation=True, padding='max_length',
                               return_tensors='pt')

In [8]:
outputs = model(**tokens)
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.7525,  0.4197, -0.7203,  ...,  0.5391,  0.3917, -0.4467],
         [-0.7132,  0.6708, -0.4577,  ...,  0.7561,  0.5422, -0.6433],
         [-0.4035,  0.6712,  0.0242,  ...,  0.5391,  0.2289, -0.0792],
         ...,
         [-1.0886,  0.3778, -0.6074,  ...,  0.5937,  0.4756, -0.2495],
         [-0.9791,  0.2046, -0.6026,  ...,  0.8414,  0.4806, -0.2560],
         [-0.5645,  0.2576, -0.6433,  ...,  0.6893,  0.2765, -0.0449]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

### 1. Get the dense vectors embeddings

The dense vector representations of our text are contained within the outputs 'last_hidden_state' tensor, which we access like so:

In [12]:
embeddings = outputs.last_hidden_state
embeddings

tensor([[[-0.7525,  0.4197, -0.7203,  ...,  0.5391,  0.3917, -0.4467],
         [-0.7132,  0.6708, -0.4577,  ...,  0.7561,  0.5422, -0.6433],
         [-0.4035,  0.6712,  0.0242,  ...,  0.5391,  0.2289, -0.0792],
         ...,
         [-1.0886,  0.3778, -0.6074,  ...,  0.5937,  0.4756, -0.2495],
         [-0.9791,  0.2046, -0.6026,  ...,  0.8414,  0.4806, -0.2560],
         [-0.5645,  0.2576, -0.6433,  ...,  0.6893,  0.2765, -0.0449]]],
       grad_fn=<NativeLayerNormBackward0>)

In [11]:
embeddings.shape

torch.Size([1, 128, 768])

### 2. Perform *mean pooling*

We need to perform a mean pooling operation on them to create a single vector encoding (the sentence embedding). To do this mean pooling operation we will need to multiply each value in our embeddings tensor by it's respective attention_mask value - so that we ignore non-real tokens.

#### a. Resize our attention_mask tensor

To perform mean pooling , we first resize our attention_mask tensor

In [13]:
attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([1, 128])

#### b. Expand our attention mask
we need to expand our attention mask up to the same size of our embeddings.

In [14]:
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

torch.Size([1, 128, 768])

In [18]:
attention_mask, attention_mask.shape

(tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0]]),
 torch.Size([1, 128]))

In [20]:
mask, mask[0][0].shape

(tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]]),
 torch.Size([768]))

#### c. Multiply the two tensors to apply the attention mask

In [21]:
masked_embeddings = embeddings * mask
masked_embeddings.shape

torch.Size([1, 128, 768])

In [22]:
masked_embeddings

tensor([[[-0.7525,  0.4197, -0.7203,  ...,  0.5391,  0.3917, -0.4467],
         [-0.7132,  0.6708, -0.4577,  ...,  0.7561,  0.5422, -0.6433],
         [-0.4035,  0.6712,  0.0242,  ...,  0.5391,  0.2289, -0.0792],
         ...,
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000, -0.0000]]],
       grad_fn=<MulBackward0>)

d. Sum the remained of the embeddings along axis 1

In [23]:
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([1, 768])

In [24]:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

torch.Size([1, 768])

#### e. Calcualte the mean

In [25]:
mean_pooled = summed / summed_mask

And that is how we calculate our dense similarity vector.

In [26]:
mean_pooled

tensor([[-3.8108e-01,  5.8393e-01, -5.3531e-01, -1.0056e+00, -7.8539e-01,
          2.3249e-01, -1.8562e-01,  4.8350e-01, -7.9765e-01, -4.1529e-01,
         -2.0013e-01,  1.2518e-01, -3.9587e-01,  5.9414e-01, -9.2893e-02,
         -2.7468e-01,  7.0809e-01,  3.7565e-02, -8.2779e-01,  3.0793e-01,
         -8.3559e-02, -4.4371e-01, -1.3940e-01, -1.2848e-01, -7.6873e-01,
         -2.9797e-01,  7.9975e-01,  4.9578e-01,  1.1253e-01,  2.9063e-01,
         -7.0664e-01, -5.8204e-01,  2.4675e-01,  8.6831e-02, -5.0094e-01,
          1.2004e+00, -2.1038e+00, -2.3079e-01, -2.4776e-01, -7.4685e-01,
          6.2815e-01, -1.1186e-01,  9.1576e-01, -9.7953e-02, -1.4730e+00,
          2.0458e-01, -9.5463e-01, -7.6667e-02, -4.0136e-01,  7.5325e-01,
          3.4341e-01, -3.4113e-01, -2.0175e-01, -4.9423e-01,  2.3378e-01,
          5.1864e-01, -1.8927e+00,  1.1531e+00,  8.5056e-01, -2.2447e-01,
         -3.1461e-01, -9.9360e-01,  9.3596e-01,  6.8447e-01,  1.5344e-01,
          5.5445e-02, -3.8706e-01, -4.