# Building Dense Vectors Using Transformers

We will be using the [`sentence-transformers/stsb-distilbert-base`](https://huggingface.co/sentence-transformers/stsb-distilbert-base) model to build our dense vectors.

In [1]:
from transformers import AutoTokenizer, AutoModel
import torch

First we initialize our model and tokenizer:

In [2]:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/stsb-distilbert-base')
model = AutoModel.from_pretrained('sentence-transformers/stsb-distilbert-base')

Downloading:   0%|          | 0.00/539 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/489 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Then we tokenize a sentence just as we have been doing before:

In [4]:
text = "hello world what a time to be alive!"

tokens = tokenizer(text, max_length=128,
                   truncation=True, padding='max_length',
                   return_tensors='pt')

We process these tokens through our model:

In [14]:
outputs = model(**tokens)
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.9489,  0.6905, -0.2188,  ...,  0.0161,  0.5874, -0.1449],
         [-0.6643,  1.1984, -0.1346,  ...,  0.4839,  0.6338, -0.5003],
         [-0.3289,  0.6412,  0.2473,  ..., -0.0965,  0.4298,  0.0515],
         ...,
         [-0.7853,  0.8094, -0.2639,  ...,  0.2177,  0.3335,  0.1107],
         [-0.7528,  0.6285, -0.0088,  ...,  0.1024,  0.4585,  0.1720],
         [-1.0754,  0.4878, -0.3458,  ...,  0.2764,  0.5604,  0.1236]]],
       grad_fn=<NativeLayerNormBackward>), hidden_states=None, attentions=None)

The dense vector representations of our `text` are contained within the `outputs` **'last_hidden_state'** tensor, which we access like so:

In [15]:
embeddings = outputs.last_hidden_state
embeddings

tensor([[[-0.9489,  0.6905, -0.2188,  ...,  0.0161,  0.5874, -0.1449],
         [-0.6643,  1.1984, -0.1346,  ...,  0.4839,  0.6338, -0.5003],
         [-0.3289,  0.6412,  0.2473,  ..., -0.0965,  0.4298,  0.0515],
         ...,
         [-0.7853,  0.8094, -0.2639,  ...,  0.2177,  0.3335,  0.1107],
         [-0.7528,  0.6285, -0.0088,  ...,  0.1024,  0.4585,  0.1720],
         [-1.0754,  0.4878, -0.3458,  ...,  0.2764,  0.5604,  0.1236]]],
       grad_fn=<NativeLayerNormBackward>)

In [16]:
embeddings.shape

torch.Size([1, 128, 768])

After we have produced our dense vectors `embeddings`, we need to perform a *mean pooling* operation on them to create a single vector encoding (the **sentence embedding**). To do this mean pooling operation we will need to multiply each value in our `embeddings` tensor by it's respective `attention_mask` value - so that we ignore non-real tokens.

To perform this operation, we first resize our `attention_mask` tensor:

In [17]:
attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([1, 128])

Each vector above represents a single token attention mask - each token now has a vector of size 768 representing it's *attention_mask* status. Then we multiply the two tensors to apply the attention mask:

In [25]:
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

torch.Size([1, 128, 768])

In [36]:
masked_embeddings = embeddings * attention_mask.unsqueeze(-1)
masked_embeddings.shape

torch.Size([1, 128, 768])

In [37]:
masked_embeddings

tensor([[[-0.9489,  0.6905, -0.2188,  ...,  0.0161,  0.5874, -0.1449],
         [-0.6643,  1.1984, -0.1346,  ...,  0.4839,  0.6338, -0.5003],
         [-0.3289,  0.6412,  0.2473,  ..., -0.0965,  0.4298,  0.0515],
         ...,
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000,  0.0000]]],
       grad_fn=<MulBackward0>)

In [38]:
mask[0][0].shape

torch.Size([768])

Then we sum the remained of the embeddings along axis `1`:

In [39]:
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([1, 768])

Then sum the number of values that must be given attention in each position of the tensor:

In [40]:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

torch.Size([1, 768])

In [41]:
summed_mask

tensor([[11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11., 11.,
         11., 11., 11., 11., 11., 11., 11., 11., 11.

Finally, we calculate the mean as the sum of the embedding activations `summed` divided by the number of values that should be given attention in each position `summed_mask`:

In [42]:
mean_pooled = summed / summed_mask

In [44]:
mean_pooled.shape

torch.Size([1, 768])

And that is how we calculate our dense similarity vector.