# An introduction to inference with BERT

> This notebook tries to give an example of how BERT can be used to extract contextual embeddings while at the same time
giving some information about the model. Note that it does not try to be exhaustive. In some places, links are given as
suggestions for further reading.

In [1]:
import torch
from transformers import BertModel, BertTokenizer

To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html


## The tokenizer

A deep learning model works with tensors. Tensors are (basically) vectors. Vectors are (basically) numbers. To get
started, then, the input text (string) needs to be converted into some data type (numbers) that the model can use. 
This is done by the tokenizer.

In [2]:
# Initialize the tokenizer with a pretrained model
# We'll come back to the `do_lower_case` parameter
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

During pre-training, the tokenizer has been "trained" as well. It has generated a vocabulary that it "knows". Each word 
has been assigned an index (a number) and that number can then be used in the model. To counter the annoying problem of 
words that the tokenizer doesn't know yet (out-of-vocabulary or OOV), a special technique is used that ensures that the
tokenizer has learnt "subword units". That should mean that when using the pretrained models, you won't run into OOV
problems. When the tokenizer does not recognize a word (it is not in its vocabulary) it will try to split that word up 
into smaller parts that it does know. The BERT tokenizer uses the WordPiece algorithm to split tokens. As an example:

In [3]:
# Convert the string "granola bars" to tokenized vocabulary IDs
granola_ids = tokenizer.encode('granola bars')
# Print the IDs
print('granola_ids', granola_ids)
print('type of granola_ids', type(granola_ids))
# Convert the IDs to the actual vocabulary item
# Notice how the subword unit (suffix) starts with "##" to indicate 
# that it is part of the previous string
print('granola_tokens', tokenizer.convert_ids_to_tokens(granola_ids))


granola_ids [101, 12604, 6030, 6963, 102]
type of granola_ids <class 'list'>
granola_tokens ['[CLS]', 'gran', '##ola', 'bars', '[SEP]']


You will probably have noticed the so-called "special tokens" [CLS] and [SEP]. These tokens are added auomatically by 
the `.encode()` method so we don't have to worry about them. The first one is a classification token which has been 
pretrained. It is specifically inserted for any sort of classification task. So instead of having to average of all 
tokens and use that as a sentence representation, it is recommended to just take the output of the [CLS] which then 
represents the whole sentence. [SEP], on the other hand, is inserted as a separator between multiple instances. We will
not use that here, but it used for things like next sentence prediction where it is a separator between the current and 
the next sentence. It is especially important to remember the [CLS] token as it can play a great role in classification 
and regression tasks. 

We almost have the correct data type to get started. As we saw above, the data type of the token IDs is a list of
integers. In this notebook we use the `transformers` library in combination with PyTorch, which works with tensors.
A tensor is a special type of optimised list which is typically used in deep learning. To convert our token IDs to a
tensor, we can simply put the list in a tensor constructor. Here, we use a `LongTensor` which is used for integers.
For floating-point numbers, we'd typically use a `FloatTensor`.

In [4]:
# Convert the list of IDs to a tensor of IDs 
granola_ids = torch.LongTensor(granola_ids)
# Print the IDs
print('granola_ids', granola_ids)
print('type of granola_ids', type(granola_ids))

granola_ids tensor([  101, 12604,  6030,  6963,   102])
type of granola_ids <class 'torch.Tensor'>


## The model
Now that we have preprocessed our input string into a tensor of IDs, we can feed this to the model. Remember that the 
IDs are the IDs of a token in the tokenizer's vocabulary. The model "knows" which words are being processed because it
"knows" which token belongs to which ID. In BERT, and in most - if not all - current transformer language models, the
first layers are embeddings. Each token ID has a embeddings appointed to it. In BERT, the embeddings are the sum of 
three types of embeddings: the token embedding, the segment embedding, and the position embedding. The token embedding
is a value for the given token, the segment embedding indicates whether the segment is the first or the (optional) 
second one, and the positional embedding distinguishes the position in the input. Below you find a figure from the BERT
paper. (See how playing is split in "play" and "##ing"?) Note that in our case, where we just use BERT for inference
of a single sentence, the segmentation embedding is of no importance.
For more information, see [this Medium article](https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a).

![BERT embeddings visualization](img/bert-embeddings.png)

(For a very detailed and visual explanation of the whole BERT model, have a look at the explanations on
[Jay Alammar's homepage](http://jalammar.github.io/). In particular the "Illustrated transformer" is very interesting.)

To get started, we first need to initialize the model. Just like the tokenizer, the model is pretrained which makes it
very easy for us to just use the pretrained language model to get some token or sentence representations out of it.
Note how we use the same pretrained model as the tokenizer uses (`bert-base-uncased`). This is the smaller BERT model
that has been trained on lower case text. That is also the reason that we passed the argument `do_lower_case`. Because
the model has been trained on lower case text, it does not know cased text. If we pass `do_lower_case=True` to the 
tokenizer, it takes care of casing for us. Whether to use a cased or uncased language model really depends on the task.
If you think that casing matters (e.g. for NER), you may want to opt for a cased model.

In this example, an additional argument has been given. `output_hidden_states` will give us more output information. 
By default, a `BertModel` will return a tuple but the contents of that tuple differs depending on the configuration of 
the model. When passing `output_hidden_states=True`, the tuple will contain (in order; shape in brackets):

1. the last hidden state (batch_size, sequence_length, hidden_size)
2. the pooler_output of the [CLS] token (batch_size, hidden_size)
3. the hidden_states of the outputs of the model at each layer (batch_size, sequence_length, hidden_size)

In [5]:
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

The model has been initialized, and the input string has been converted into a tensor. A language model (such as 
`BertModel` above) have a `forward()` method that is called automatically when calling the object. The forward method 
basically pushes a given input tensor forward through the model and then returns the output.

In [6]:
out = model(input_ids=granola_ids.unsqueeze(0))
print(type(out))
print(len(out))

<class 'tuple'>
3
