# An introduction to BERT with Transformers

In [1]:
import torch
from transformers import BertModel, BertTokenizer

To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html


A model works with tensors. Tensors are (basically) vectors. Vectors are (basically) numbers. To get started, then, the 
input text (string) needs to be converted into some data form (numbers) that the model can use. This is done by the 
tokenizer.

In [None]:
# Initialize the tokenizer that is already pretrained
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

During pre-training, the tokenizer has been "trained" as well. It has generated a vocabulary that it "knows". Each word 
has been assigned an index (a number) and that number can then be used in the model. To counter the annoying problem of 
words that the tokenizer doesn't know yet (out-of-vocabulary or OOV), a special technique is used that ensures that the
tokenizer has learnt "subword units". That should mean that when using the pretrained models, you won't run into OOV
problems. When the tokenizer does not recognize a word (it is not in its vocabulary) it will try to split that word up 
into smaller parts that it does know. The BERT tokenizer uses the WordPiece algorithm to split tokens. As an example:

In [21]:
# Convert the string "granola bars" to tokenized vocabulary IDs
granola_ids = tokenizer.encode('granola bars')
# Print the IDs
print('granola_ids', granola_ids)
# Convert the IDs to the actual vocabulary item
# Notice how the subword unit (suffix) starts with "##" to indicate 
# that it is part of the previous string
print('granola_tokens', tokenizer.convert_ids_to_tokens(granola_ids))


granola_ids [101, 12604, 6030, 6963, 102]
granola_tokens ['[CLS]', 'gran', '##ola', 'bars', '[SEP]']


You will probably have noticed the so-called "special tokens" [CLS] and [SEP]. These tokens are added auomatically by 
the `.encode()` method so we don't have to worry about them. The first one is a classification token which has been 
pretrained. It is specifically inserted for any sort of classification task. So instead of having to average of all 
tokens and use that as a sentence representation, it is recommended to just take the output of the [CLS] which then 
represents the whole sentence. [SEP], on the other hand, is inserted as a separator between multiple instances. We will
not use that here, but it used for things like next sentence prediction where it is a separator between the current and 
the next sentence. It is especially important to remember the [CLS] token as it can play a great role in classification 
and regression tasks. 