# An introduction to inference with BERT

This notebook gives an example of how [BERT](https://arxiv.org/abs/1810.04805) can be used to extract
sentence embeddings while at the same time giving some information about the model. Note that it does not try
to be exhaustive. In some places, links are given as suggestions for further reading. Also note that these days,
BERT isn't state of the art anymore. However, the methodology used here can be used in other models such as RoBERTa
with minimal changes. Be careful, though, because the differences between model APIs, however small, are incredibly
important. For instance, the position of the classification token is not the same for all models. Read the paper,
the documentation, or - if you're up for it - the source code! The latter might be a challenge at first, but you 
learn a lot from it.

In [1]:
import torch
from transformers import BertModel, BertTokenizer

To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html


## The tokenizer

A deep learning model works with tensors. Tensors are (basically) vectors. Vectors are (basically) a bunch of
numbers. To get started, then, the input text (string) needs to be converted into some data type (numbers)
that the model can use. This is done by the tokenizer.

In [2]:
# Initialize the tokenizer with a pretrained model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

During pretraining, the tokenizer has been "trained" as well. It has generated a vocabulary that it "knows". Each word 
has been assigned an index (a number) and that number can then be used in the model. To counter the annoying problem of 
words that the tokenizer doesn't know yet (out-of-vocabulary or OOV), a special technique is used that ensures that the
tokenizer has learnt "subword units". That should mean that when using the pretrained models, you won't run into OOV
problems. When the tokenizer does not recognize a word (it is not in its vocabulary) it will try to split that word up 
into smaller parts that it does know. The BERT tokenizer uses [WordPiece](https://arxiv.org/pdf/1609.08144.pdf)
to split tokens. As an example, you'll see that `granola` is split into `gran` and `##ola` where `##` indicates the
start of the substring.

In [3]:
# Convert the string "granola bars" to tokenized vocabulary IDs
granola_ids = tokenizer.encode('granola bars')
# Print the IDs
print('granola_ids', granola_ids)
print('type of granola_ids', type(granola_ids))
# Convert the IDs to the actual vocabulary item
# Notice how the subword unit (suffix) starts with "##" to indicate 
# that it is part of the previous string
print('granola_tokens', tokenizer.convert_ids_to_tokens(granola_ids))

granola_ids [101, 12604, 6030, 6963, 102]
type of granola_ids <class 'list'>
granola_tokens ['[CLS]', 'gran', '##ola', 'bars', '[SEP]']


You will probably have noticed the so-called "special tokens" [CLS] and [SEP]. These tokens are added auomatically by 
the `.encode()` method so we don't have to worry about them. The first one is a classification token which has been 
pretrained. It is specifically inserted for any sort of classification task. So instead of having to average of all 
tokens and use that as a sentence representation, it is recommended to just take the output of the [CLS] which then 
represents the whole sentence. [SEP], on the other hand, is inserted as a separator between multiple instances. We will
not use that here, but it used for things like next sentence prediction where it is a separator between the current and 
the next sentence. It is especially important to remember the [CLS] token as it can play a great role in classification 
and regression tasks. 

We almost have the correct data type to get started. As we saw above, the data type of the token IDs is a list of
integers. In this notebook we use the `transformers` library in combination with PyTorch, which works with tensors.
A tensor is a special type of optimised list which is typically used in deep learning. To convert our token IDs to a
tensor, we can simply put the list in a tensor constructor. Here, we use a `LongTensor` which is used for integers.
For floating-point numbers, we'd typically use a `FloatTensor` or just `Tensor`. The `.encode()` method of the 
tokenizer can return a tensor instead of a list by passing the parameter `return_tensors='pt'` but for illustrative
purposes, we will do the conversion from a list to a tensor manually.

In [4]:
# Convert the list of IDs to a tensor of IDs 
granola_ids = torch.LongTensor(granola_ids)
# Print the IDs
print('granola_ids', granola_ids)
print('type of granola_ids', type(granola_ids))

granola_ids tensor([  101, 12604,  6030,  6963,   102])
type of granola_ids <class 'torch.Tensor'>


## The model
Now that we have preprocessed our input string into a tensor of IDs, we can feed this to the model. Remember that the 
IDs are the IDs of a token in the tokenizer's vocabulary. The model "knows" which words are being processed because it
"knows" which token belongs to which ID. In BERT, and in most - if not all - current transformer language models, the
first layer is an embedding layer. Each token ID has a embeddings appointed to it. In BERT, the embeddings are the sum
of three types of embeddings: the token embedding, the segment embedding, and the position embedding. The token
embedding is a value for the given token, the segment embedding indicates whether the segment is the first or the
(optional) second one, and the positional embedding distinguishes the position in the input. Below you find a figure
from the BERT paper. (See how playing is split in "play" and "##ing"?) Note that in our case, where we just use BERT
for inference of a single sentence, the segmentation embedding is of no importance. For more information, see
[this Medium article](https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a).  We'll come back to the model architecture later on.

![BERT embeddings visualization](img/bert-embeddings.png)

(For a very detailed and visual explanation of the whole BERT model, have a look at the explanations on
[Jay Alammar's homepage](http://jalammar.github.io/). In particular the "Illustrated transformer" is very interesting.)

To get started, we first need to initialize the model. Just like the tokenizer, the model is pretrained which makes it
very easy for us to just use the pretrained language model to get some token or sentence representations out of it.
Note how we use the same pretrained model as the tokenizer uses (`bert-base-uncased`). This is the smaller BERT model
that has been trained on lower case text. Because the model has been trained on lower case text, it does not know cased
text. You may hav enoticed that the tokenizer automatically lowercases the text for us. Whether to use a cased or
uncased language model really depends on the task. If you think that casing matters (e.g. for NER), you may want to
opt for a cased model, otherwise casing might just add noise.

In the example below, an additional argument has been given to the model initialisation. `output_hidden_states` will
give us more output information. By default, a `BertModel` will return a tuple but the contents of that tuple differ
depending on the configuration of the model. When passing `output_hidden_states=True`, the tuple will contain
(in order; shape in brackets):

1. the last hidden state `(batch_size, sequence_length, hidden_size)`
2. the pooler_output of the classification token `(batch_size, hidden_size)`
3. the hidden_states of the outputs of the model at each layer and the initial embedding outputs
   `(batch_size, sequence_length, hidden_size)`

Graphic cards (GPUs) are much better at doing operations on tensors than a CPU is. Therefore, we wish to run our 
computations on the GPU if it is available. Note that you need to have a GPU available as well as CUDA, and a
GPU-accelerated torch version. To increase the calculation speed, we have to move our model to the correct device:
if it's available we'll move the model `.to()` the GPU, otherwise it'll stay on the CPU. It is important to remember 
that the model and the data to process need to be on the same device. This means that we will have to move our 
`granola_ids` to the same device as the model, too.

Finally, we also set the model to evaluation mode (`.eval`) in contrast to training mode (`.train()`). In evluation
mode, the model's batchnorm and dropout layers will work in `eval()` mode, e.g. disabling dropout, which you only want
during training.

In [5]:
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
# Set the device to GPU (cuda) if available, otherwise stick with CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = model.to(device)
granola_ids = granola_ids.to(device)

model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

## Inference
The model has been initialized, and the input string has been converted into a tensor. A language model (such as 
`BertModel` above) has a `forward()` method that is called automatically when calling the object. The forward method 
basically pushes a given input tensor forward through the model and then returns the output. Since we're only doing
inference and not training or fine-tuning the model, this is the only step that involves the model directly to get 
output. So we don't need to optimize the model (calculate gradients, propagating back). That's quite simple, isn't it?
One pecularity is that we set `torch.no_grad()`. This tells the model that we won't be doing any gradient 
calculation/backpropagation. Ultimately, it makes inference faster and more memory-efficient. You would typically use
`model.eval()` (see above) and `torch.no_grad()` together for evaluation and testing of your model. When training the
model should be set to `model.train()` and `torch.no_grad()` should *not* be used.

In the cell below, you'll see that there's a strange method called `.unsqueeze()`. It "unsqueezes" a tensor by adding 
an extra dimension. In our case, you'll see that our granola tensor of size `(5,)` turns into a different shape of
`(1, 5)` where `1` is the dimension of the sentence. These two dimensions are required by the model: it is optimised
to train on *batches*. The next paragraph goes into a bit more technical detail but is not required to understand this 
notebook.

A batch consists of multiple input texts at "the same time" (typically of the power of two, e.g. 64). With a batch size
of 64 (64 sentences at once), the batch size would be `(64, n)` where `64` is the number of sentences, and `n` the
sequence length. In this notebook, where we only ever use one input, the following is not important, but if you ever
want to fine-tune a model, you'll want to work with batches since the gradient calculation will be better for large
batches. In such cases, `n` needs to be the same for all entries; you cannot have one sequence of 5 items and one of
12 items. That is where padding comes in - but that is a story for another day. For now, you can remember that the
input size of the model needs to be `(n_input_sentences, seq_len)` where `seq_len` can be determined in different ways.
Two popular choices are: using the longest text in the batch as `seq_len` (e.g. 12) and padding shorter texts up to
this length, or setting a fixed maximal sequence length for the model (typically 512) and pad all items up to this
length. The latter approach is easier to implement but is not memory-efficient and is computationally heavier. The
choice, as always, is yours.

In [6]:
print(granola_ids.size())
# unsqueeze IDs to get batch size of 1 as added dimension
granola_ids = granola_ids.unsqueeze(0)
print(granola_ids.size())

print(type(granola_ids))
with torch.no_grad():
    out = model(input_ids=granola_ids)

# the output is a tuple
print(type(out))
# the tuple contains three elements as explained above)
print(len(out))
# we only want the hidden_states
hidden_states = out[2]
print(len(hidden_states))

torch.Size([5])
torch.Size([1, 5])
<class 'torch.Tensor'>
<class 'tuple'>
3
13


As discussed above, `hidden_states` is a tuple of the output of each layer in the model for each token. In the previous
cell we saw that the tuple contains 13 items. When you execute the cell below, the architecture of the BertModel is
shown (from top-down to the bottom). The `hidden_states` include the output of the `embeddings` layer and the output of
all 12 `BertLayer`'s in the encoder. The output of each layer has a size of `(batch_size, sequence_length, 768)`.
In our case, that is `(1, 5, 768)` because we only have one input string (batch size of 1), and our input string was
tokenized into five IDs (sequence length of 5). `768` is the number of hidden dimensions.

The critical reader will notice that there is still one more layer after the encoder, called `pooler`, which is not
part of `hidden_states`. This layer is used to "pool" the output of the classification token but we will not use that 
here. Its output is returned in the second item of the output tuple `out`, as discussed before.

For an in-depth analysis of BERT's architecture, I'd 
recommend to read [the paper](https://arxiv.org/abs/1810.04805). However, if you like a more visual explanation, 
[The Illustrated BERT](http://jalammar.github.io/illustrated-bert/) might be a better place to start.

In [7]:
print(model)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

Now that we have all hidden_states, we may want to get a usable value out of it. Let's say that we want to retrieve a
sentence embedding by averaging over all tokens. In other words, we want to reduce the size of `(1, 5, 768)` to
`(1, 768)` where `1` is the batch size and `768` is the number of hidden dimensions. (One could also call `768` the 
features that you wish to use in another task.) There are many ways to make a sentence abstraction of tokens, and it 
often depends on the given task. Here, we will take the mean. For now, we will only use the output of the last layer in
the encoder, that is, `hidden_states[-1]`. It is important to indicate that we want to take the `torch.mean`
_over a given axis_. Since the size of the output of the layers is `(1, 5, 768)`, we want to average over the five 
tokens, which are in the second dimension (`dim=1`). 

In [8]:
sentence_embedding = torch.mean(hidden_states[-1], dim=1).squeeze()
print(sentence_embedding)
print(sentence_embedding.size())

tensor([ 2.7497e-01,  1.8313e-01, -8.8652e-02,  2.1698e-01,  3.1942e-01,
        -1.1412e-01,  7.4039e-02,  3.7655e-01, -4.1821e-01,  9.9971e-02,
        -9.0241e-02, -2.4298e-01,  1.5542e-01,  4.2042e-01, -2.5547e-01,
         2.9753e-01, -2.9643e-01, -2.5810e-02,  8.5306e-02,  1.0182e-01,
         3.0401e-01, -4.4263e-01,  3.1249e-02,  1.4435e-01,  3.0189e-01,
         7.3913e-02, -2.5580e-01,  3.1384e-01, -1.4688e-01, -1.5202e-01,
         7.0785e-02,  4.0448e-01, -1.1769e-01,  3.1848e-01,  2.8022e-02,
        -1.6934e-01,  3.5639e-01, -2.2931e-01, -1.1899e-01, -1.1182e-01,
        -1.6003e-01,  7.9355e-02,  5.1107e-01,  5.2223e-02, -1.5481e-01,
         2.8229e-02, -1.4365e-01, -4.7737e-01, -5.6638e-01, -4.8802e-01,
        -1.1429e-01,  2.8087e-01, -5.7160e-02,  2.3862e-01,  3.5440e-01,
         5.8237e-01,  1.2777e-01,  1.0363e-01,  3.0538e-01,  2.0989e-01,
         1.1693e-01,  2.6346e-01, -1.5832e-01, -1.1380e-01,  1.7189e-02,
        -3.4662e-02,  1.1470e-01,  3.2023e-02, -1.9

**We now have a vector of 768 features representing our input sentence.** But we can do more! The BERT paper discusses
how they reached the best results by concatenating the output of the last four layers.

![BERT embeddings visualization](img/bert-feature-extraction-contextualized-embeddings.png)

In our example, that means that
we need to get the last four layers of `hidden_states` and concatenate them after which we can take the mean. We want
to concatenate across the axis of the hidden dimensions of `768`. As a consequence, our concatenated output vector will
be of size `(1, 5, 3072)` where `3072=4*768`, i.e. the concatenation of four layers with a hidden dimension of 768. The
concatenated vector is much larger than the output of only a single layer, meaning that it contains a lot more features.
Do note, as usual, that it depends on your specific task whether these `3072` features perform better than `768`.

Having a vector of shape `(1, 5, 3072)`, we still need to take the mean over the token dimension, as we did before. We
end up with one feature vector of size `(3072,)`. 

In [9]:
# get last four layers
last_four_layers = [hidden_states[i] for i in (-1, -2, -3, -4)]
# cast layers to a tuple and concatenate over the last dimension
cat_hidden_states = torch.cat(tuple(last_four_layers), dim=-1)
print(cat_hidden_states.size())

# take the mean of the concatenated vector over the token dimension
cat_sentence_embedding = torch.mean(cat_hidden_states, dim=1).squeeze()
print(cat_sentence_embedding)
print(cat_sentence_embedding.size())

torch.Size([1, 5, 3072])
tensor([ 0.2750,  0.1831, -0.0887,  ...,  0.2894, -0.0034,  0.0764],
       device='cuda:0')
torch.Size([3072])


## Saving and loading results

It is likely that you want to use your generated feature vector in another model or task and just save them to your 
hard drive. You can easily save a tensor with `torch.save` and load it in another script with `torch.load`. Typically,
the `.pt` (PyTorch) extension is used. Note that you cannot read the saved file with a text editor. It is a pickled
object which allows for efficient (de)compression. If you do want to save your tensors in a readable format, you can
convert a tensor to numpy and using something like `np.savetxt('tensor.txt', your_tensor.numpy())`. I do not recommend
that approach (I'd stick with `torch.save` or another compression technique) but it is possible.

See how we use `.cpu()`? `cpu()` tells PyTorch that we want to move the output tensor back from the GPU to the CPU. 
This is not a required step, but I think it is good practice when doing feature extraction to move your data to CPU so
that when you load it, it is also loaded as a CPU tensor rather than a CUDA tensor. Afterwards you can still move 
things to GPU if need be, but using CPU by default seems like a good idea. Note that a tensor has to be on CPU if you
want to convert it to `.numpy()`, though.

In [10]:
# save our created sentence representation
torch.save(cat_sentence_embedding.cpu(), 'my_sent_embed.pt')

# load it again
loaded_tensor = torch.load('my_sent_embed.pt')
print(loaded_tensor)
print(loaded_tensor.size())

# convert it to numpy to use in e.g. sklearn
np_loaded_tensor = loaded_tensor.numpy()
print(np_loaded_tensor)
print(type(np_loaded_tensor))


tensor([ 0.2750,  0.1831, -0.0887,  ...,  0.2894, -0.0034,  0.0764])
torch.Size([3072])
[ 0.2749733   0.18313345 -0.0886516  ...  0.2893934  -0.00340437
  0.07635471]
<class 'numpy.ndarray'>
