BERT:overview of the model and its tokenizer.

First read this paper: https://arxiv.org/abs/1810.04805.

# BERT

If you want to quickly try different models without having to import their corresponding classes, you can use HuggingFace’s AutoModel instead:

In [1]:
from transformers import AutoModel
auto_model = AutoModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


And we can check is the loaded model is correct:

In [2]:
print(auto_model.__class__)

<class 'transformers.models.bert.modeling_bert.BertModel'>


You can look at the architecture of the model:

In [3]:
auto_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

We see that there is three main blocks: embeddings, encoder, and pooler. 
Also notice that the hidden size is 768, 12 attention heads, 12 headen layers. 
Other components we will discuss letter.
The output above may be difficult to recognize. So I wrote function bellow. (It is the first time I use widgest. So the output bellow is not nice but helpfull to see the hierarchy of moduls in the model.)

In [4]:
import ipywidgets as widgets
from IPython.display import display, clear_output
import torch

class ModulePrinter:
    def __init__(self, depth):
        self.depth = depth
        self.current_depth = 0
 
    def print_module(self, module, name):
        if self.current_depth <= self.depth:
            print("  " * self.current_depth + name + ":")
            self.current_depth += 1
            for child_name, child_module in module.named_children():
                self.print_module(child_module, child_name)
            self.current_depth -= 1

def print_model_structure(model, depth):
    with out:
        clear_output()
        mp = ModulePrinter(depth)
        mp.print_module(model, "model")
 
depth_selector = widgets.IntSlider(min=0, max=7, description='Depth:')
display(depth_selector)
 
out = widgets.Output()
display(out)
 
def on_depth_change(change):
    print_model_structure(auto_model, change.new)
 
depth_selector.observe(on_depth_change, names='value')

IntSlider(value=0, description='Depth:', max=7)

Output()

Let’s create BERT model by loading the pre-trained weights for bert-base-uncased:

In [5]:
from transformers import BertModel
bert_model = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's look at the pre-trained model’s configuration:

In [6]:
bert_model.config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.30.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

Again notice that we see hidden size is 768, 12 attention heads, 12 headen layers. 
We can also check this configuration for our auto-model which will be the same. 

Do you remember that the model needs inputs? :)
And here we start to talk about tokenizer.
Tokenization is a pre-processing step, and, since we’ll be using a pre-trained BERT model, we need to use the same tokenizer that
was used during pre-training. For each pre-trained model available in HuggingFace there is an accompanying pre-trained tokenizer as well.


In [6]:
from transformers import BertTokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

We can check the size of the vocabulary in the following way:

In [7]:
bert_tokenizer.vocab_size

30522

or in this way:

In [9]:
len(bert_tokenizer.vocab)

30522

Let's tokenize something:

In [8]:
sentence1 = 'I love my cat'
sentence2 = 'The sky is blue'
tokens = bert_tokenizer(sentence1, sentence2, return_tensors='pt')
tokens

{'input_ids': tensor([[ 101, 1045, 2293, 2026, 4937,  102, 1996, 3712, 2003, 2630,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

'token_type_ids' shows which tokens belongs to which sentences.
Tokens '101', '102' are special tokens.
Let's tokenize new sentences.

In [9]:
sentence1 = 'I love my dog'
sentence2 = 'Kim is very good'
tokens1 = bert_tokenizer(sentence1, sentence2, return_tensors='pt')
tokens1

{'input_ids': tensor([[ 101, 1045, 2293, 2026, 3899,  102, 5035, 2003, 2200, 2204,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Note that the mapping word(or piece of word) --> token is consistens.

Now let's run the next code to look at the vocab. 

In [None]:
bert_tokenizer.vocab

What is [unused###]? Unused tokens are helpful if you want to introduce specific words to your fine-tuning or further pre-training procedure; they allow you to treat words that are relevant only in your context just like you want, and avoid subword splitting that would occur with the original vocabulary of BERT. 

Now let's tokenize one sentence

In [10]:
tokens3 = bert_tokenizer('I am boy', return_tensors='pt')
tokens3

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In [11]:
tokens4 = bert_tokenizer('I am boy','I am boy', return_tensors='pt')
tokens4 

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102, 1045, 2572, 2879,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

So our tokenizer tokinizes two and one sentences. What about three?

In [12]:
tokens5 = bert_tokenizer('I am boy','I am boy','I am boy', return_tensors='pt')
tokens5

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102, 1045, 2572, 2879,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[ 101, 1045, 2572, 2879,  102]])}

Note that our tokenizer tokenizes only one or two sentences. We will talk about it latter. But also notice that we can tokenize the list with three sentences.

In [13]:
tokens6= bert_tokenizer(['I am boy','I am girl','I am man'], return_tensors='pt')
tokens6

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102],
        [ 101, 1045, 2572, 2611,  102],
        [ 101, 1045, 2572, 2158,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])}

We can convert ids back to tokens

In [14]:
sentence1 = 'I love my dog'
sentence2 = 'Kim is very good'
tokens1 = bert_tokenizer(sentence1, sentence2, return_tensors='pt')
print(tokens1)
print(bert_tokenizer.convert_ids_to_tokens(tokens1['input_ids'][0]))

{'input_ids': tensor([[ 101, 1045, 2293, 2026, 3899,  102, 5035, 2003, 2200, 2204,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
['[CLS]', 'i', 'love', 'my', 'dog', '[SEP]', 'kim', 'is', 'very', 'good', '[SEP]']


In [15]:
tokens3 = bert_tokenizer('I am boy', return_tensors='pt')
print(tokens3)
print(bert_tokenizer.convert_ids_to_tokens(tokens3['input_ids'][0]))

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
['[CLS]', 'i', 'am', 'boy', '[SEP]']


In [16]:
tokens4 = bert_tokenizer('I am boy','I am boy', return_tensors='pt')
print(tokens4)
print(bert_tokenizer.convert_ids_to_tokens(tokens4['input_ids'][0]))

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102, 1045, 2572, 2879,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
['[CLS]', 'i', 'am', 'boy', '[SEP]', 'i', 'am', 'boy', '[SEP]']


In [17]:
tokens5 = bert_tokenizer('I am boy','I am boy','I am boy', return_tensors='pt')
print(tokens5)
print(bert_tokenizer.convert_ids_to_tokens(tokens5['input_ids'][0])) #Look how our tokenizer ignores the third sentance.

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102, 1045, 2572, 2879,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[ 101, 1045, 2572, 2879,  102]])}
['[CLS]', 'i', 'am', 'boy', '[SEP]', 'i', 'am', 'boy', '[SEP]']


Now notice that in all above examples if we tokenize one sentence we get the next structure:
[CLS]###[SEP].
If we have two sentences then we get:
[CLS]###[SEP]###[SEP].
What about list of sentences?

In [18]:
tokens6= bert_tokenizer(['I am boy','I am girl','I am man'], return_tensors='pt')
print(tokens6)
print(bert_tokenizer.convert_ids_to_tokens(tokens6['input_ids'][0]))

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102],
        [ 101, 1045, 2572, 2611,  102],
        [ 101, 1045, 2572, 2158,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])}
['[CLS]', 'i', 'am', 'boy', '[SEP]']


In [19]:
tokens6= bert_tokenizer(['I am boy','I am girl','I am man'], return_tensors='pt')
print(tokens6)
print(bert_tokenizer.convert_ids_to_tokens(tokens6['input_ids'][1]))

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102],
        [ 101, 1045, 2572, 2611,  102],
        [ 101, 1045, 2572, 2158,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])}
['[CLS]', 'i', 'am', 'girl', '[SEP]']


In [20]:
tokens6= bert_tokenizer(['I am boy','I am girl','I am man'], return_tensors='pt')
print(tokens6)
print(bert_tokenizer.convert_ids_to_tokens(tokens6['input_ids'][2]))

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102],
        [ 101, 1045, 2572, 2611,  102],
        [ 101, 1045, 2572, 2158,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])}
['[CLS]', 'i', 'am', 'man', '[SEP]']


We get the structure:
[CLS]###[SEP] for each sentence.

So we can assume that if we will have two lists with the same number of sentences then they will be stacked. And we see this bellow:

In [21]:
tokens6= bert_tokenizer(['I am boy','I am girl','I am man'],['Am I boy?','Am I girl?','Am I man?'], return_tensors='pt')
print(tokens6)
print(bert_tokenizer.convert_ids_to_tokens(tokens6['input_ids'][0]))

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102, 2572, 1045, 2879, 1029,  102],
        [ 101, 1045, 2572, 2611,  102, 2572, 1045, 2611, 1029,  102],
        [ 101, 1045, 2572, 2158,  102, 2572, 1045, 2158, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
['[CLS]', 'i', 'am', 'boy', '[SEP]', 'am', 'i', 'boy', '?', '[SEP]']


In [22]:
tokens6= bert_tokenizer(['I am boy','I am girl','I am man'],['Am I boy?','Am I girl?','Am I man?'], return_tensors='pt')
print(tokens6)
print(bert_tokenizer.convert_ids_to_tokens(tokens6['input_ids'][1]))

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102, 2572, 1045, 2879, 1029,  102],
        [ 101, 1045, 2572, 2611,  102, 2572, 1045, 2611, 1029,  102],
        [ 101, 1045, 2572, 2158,  102, 2572, 1045, 2158, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
['[CLS]', 'i', 'am', 'girl', '[SEP]', 'am', 'i', 'girl', '?', '[SEP]']


In [23]:
tokens6= bert_tokenizer(['I am boy','I am girl','I am man'],['Am I boy?','Am I girl?','Am I man?'], return_tensors='pt')
print(tokens6)
print(bert_tokenizer.convert_ids_to_tokens(tokens6['input_ids'][2]))

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102, 2572, 1045, 2879, 1029,  102],
        [ 101, 1045, 2572, 2611,  102, 2572, 1045, 2611, 1029,  102],
        [ 101, 1045, 2572, 2158,  102, 2572, 1045, 2158, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
['[CLS]', 'i', 'am', 'man', '[SEP]', 'am', 'i', 'man', '?', '[SEP]']


What will be if two lists has different number of sentences? (We get error)

In [None]:
tokens6= bert_tokenizer(['I am boy','I am girl','I am man'],['Am I boy?','Am I girl?'], return_tensors='pt')
print(tokens6)
print(bert_tokenizer.convert_ids_to_tokens(tokens6['input_ids'][0]))

What will be if we tokenize three list? (The third list is ignored as the third sentance)

In [24]:
tokens6= bert_tokenizer(['I am boy','I am girl','I am man'],['Am I boy?','Am I girl?','Am I man?'],['Am I boy?','Am I girl?','Am I man?'], return_tensors='pt')
print(tokens6)
print(bert_tokenizer.convert_ids_to_tokens(tokens6['input_ids'][0]))

{'input_ids': tensor([[ 101, 1045, 2572, 2879,  102, 2572, 1045, 2879, 1029,  102],
        [ 101, 1045, 2572, 2611,  102, 2572, 1045, 2611, 1029,  102],
        [ 101, 1045, 2572, 2158,  102, 2572, 1045, 2158, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[ 101, 2572, 1045, 2879, 1029,  102],
        [ 101, 2572, 1045, 2611, 1029,  102],
        [ 101, 2572, 1045, 2158, 1029,  102]])}
['[CLS]', 'i', 'am', 'boy', '[SEP]', 'am', 'i', 'boy', '?', '[SEP]']


Apart from the diccussion above note that the number of tokens in all sentences in lists were choosen not random. If we add some extra letters to any sentence we get error. Why? We will talk about it.

In [None]:
tokens6= bert_tokenizer(['I am boyEXTRA','I am girl','I am man'],['Am I boy?','Am I girl?'], return_tensors='pt')
print(tokens6)
print(bert_tokenizer.convert_ids_to_tokens(tokens6['input_ids'][0]))

Now we have tokenized sentences. What is next? BERT beside Token Embeddings uses also Position Embeddings (not positional encoding) and Segment Embeddings.

Let's check it:

In [25]:
bert_model.embeddings

BertEmbeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

What is the difference between positional encoding and Position Embeddings?

While the positional encoding has fixed values for each position,
the position embeddings are learned by the model (like any other embedding
layer). The number of entries in this lookup table is given by the maximum length
of the sequence.

So the maximum number of tokens which can be in the sentence is 512. 

What will be if the length is more then 512?

Let's check: (I delete this line of code. First time there is warning that the length is greater than 512 and there can be some problems. After the first attempt there is no warning.)

Then, since there can only be either one or two sentences in the input, the segment
embedding layer has only two entries:

In [26]:
segment_embeddings = bert_model.embeddings.token_type_embeddings
segment_embeddings

Embedding(2, 768)

These three embeddings are just added.

## Pretraining Tasks

BIG INSIGHT FOR ME!!!
We know that the goal of a language model is to estimate the probability of a token or a sequence of tokens or, simply put, to predict the tokens more likely to fill in a blank. 

But who said the blank must be at the end? In the continuous bag-of-words (CBoW) model, the blank was the word in the center, and the remaining
words were the context. In a way, that’s what the MLM (masked-language-model) task is doing: It is randomly choosing words to be masked as blanks in a sentence. BERT then tries to predict
the correct words that fill in the blanks.

BERT is said to be an autoencoding model because it is a Transformer encoder and
because it was trained to "reconstruct" sentences from corrupted inputs (it does
not reconstruct the entire input but predicts the corrected words instead). That’s
the masked language model (MLM) pre-training task.


Actually, it’s a bit more structured than that:
1) 80% of the time, it masks 15% of the tokens at random: "A B C [MASK] E."
2) 10% of the time, it replaces 15% of the tokens with some other random word: "A B C X E."
3) The remaining 10% of the time, the tokens are unchanged: "A B C D E."

The target is the original sentence: "A B C D E." This way, the model effectively learns to reconstruct the original sentence from corrupted inputs
(containing missing—masked—or randomly replaced words).

Also, notice that BERT computes logits for the randomly masked inputs only. The remaining inputs are not even considered for computing the loss.

"OK, but how can we randomly replace tokens like that?"

One alternative, similar to the way we do data augmentation for images, would be
to implement a custom dataset that performs the replacements on the fly in the
__getitem__() method. There is a better alternative, though: using a collate
function or, better yet, a data collator. There’s a data collator that performs the
replacement procedure prescribed by BERT: DataCollatorForLanguageModeling.

In [27]:
sentence = 'A B C D E F G H J K'
tokens = bert_tokenizer(sentence)
tokens['input_ids']

[101, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1046, 1047, 102]

In [28]:
from transformers import DataCollatorForLanguageModeling
torch.manual_seed(41)
data_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer, mlm_probability=0.15)
mlm_tokens = data_collator([tokens])
mlm_tokens

{'input_ids': tensor([[ 101, 1037, 1038, 1039, 1040, 1041,  103, 1043, 1044, 1046, 1047,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[-100, -100, -100, -100, -100, -100, 1042, -100, -100, -100, -100, -100]])}

In [29]:
print(bert_tokenizer.convert_ids_to_tokens(
  mlm_tokens['input_ids'][0]
))


['[CLS]', 'a', 'b', 'c', 'd', 'e', '[MASK]', 'g', 'h', 'j', 'k', '[SEP]']


The second pre-training task is a binary classification task: BERT was trained to
predict if a second sentence is actually the next sentence in the original text or
not. The purpose of this task is to give BERT the ability to understand the
relationship between sentences, which can be useful for some of the tasks BERT
can be fine-tuned for, like question answering.

So, BERT takes two sentences as inputs (with the special separator token [SEP]
between them):
• 50% of the time, the second sentence is indeed the next sentence (the positive class).
• 50% of the time, the second sentence is a randomly chosen one (the negative class).

This task uses the special classifier token [CLS], taking the values of the
corresponding final hidden state as features for a classifier. 

The final hidden state is actually further processed by a pooler (composed of a linear
layer and a hyperbolic tangent activation function) before being fed to the classifier
(FFN, feed-forward network):

In [30]:
bert_model.pooler

BertPooler(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (activation): Tanh()
)

In [31]:
bert_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

## Outputs

In [121]:
sentence = 'Of course you don'
sentence

'Of course you don'

First we need tokenize the sentence.

In [122]:
tokens = bert_tokenizer(sentence,
                        padding='max_length',
                        max_length=50,
                        truncation=True,
                        return_tensors="pt")
tokens

{'input_ids': tensor([[ 101, 1997, 2607, 2017, 2123,  102,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]])}

Notice that we can get the tokenized sentence directly by using encode method of tokenizer.

In [46]:
bert_tokenizer.encode(sentence)

[101, 1036, 1997, 2607, 2017, 2123, 1005, 1056, 999, 1005, 102]

Let's look at the output

In [47]:
bert_model.eval()
out = bert_model(input_ids=tokens['input_ids'],
                 attention_mask=tokens['attention_mask'],
                 output_attentions=True,
                 output_hidden_states=True,
                 return_dict=True)
out.keys()

odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states', 'attentions'])

Let’s see what’s inside each of these four outputs:
1) last_hidden_state is returned by default and is the most important output of
    all: It contains the final hidden states for each and every token in the input,
    which can be used as contextual word embeddings.
    
(Don’t forget that the first token is the special classifier token
[CLS] and that there may be padding ([PAD]) and separator
([SEP]) tokens as well!
)


Let's look at the shape of last_hidden_state:

In [48]:
out['last_hidden_state'].shape

torch.Size([1, 50, 768])

In [50]:
out['last_hidden_state']

tensor([[[ 0.1413,  0.4983,  0.0346,  ...,  0.0015,  0.4425,  0.4796],
         [ 0.2524,  1.1373,  0.6587,  ...,  0.3395,  0.4608,  0.4859],
         [-0.6004,  0.4385,  0.0305,  ...,  0.4747,  0.1893,  0.3341],
         ...,
         [ 0.6018,  0.2764,  0.3000,  ...,  0.4480,  0.4790,  0.1072],
         [ 0.6166,  0.3411,  0.2570,  ...,  0.5106,  0.4919,  0.1703],
         [-0.1183,  0.1018,  0.1330,  ...,  0.7263,  0.3697, -0.0943]]],
       grad_fn=<NativeLayerNormBackward0>)

In [51]:
last_hidden_batch=out['last_hidden_state'][0]
last_hidden_batch

tensor([[ 0.1413,  0.4983,  0.0346,  ...,  0.0015,  0.4425,  0.4796],
        [ 0.2524,  1.1373,  0.6587,  ...,  0.3395,  0.4608,  0.4859],
        [-0.6004,  0.4385,  0.0305,  ...,  0.4747,  0.1893,  0.3341],
        ...,
        [ 0.6018,  0.2764,  0.3000,  ...,  0.4480,  0.4790,  0.1072],
        [ 0.6166,  0.3411,  0.2570,  ...,  0.5106,  0.4919,  0.1703],
        [-0.1183,  0.1018,  0.1330,  ...,  0.7263,  0.3697, -0.0943]],
       grad_fn=<SelectBackward0>)

To remove the embeddings for PAD tokens we do next thing:

In [52]:
tokens['attention_mask'].shape

torch.Size([1, 50])

In [53]:
mask = tokens['attention_mask'].squeeze().bool()

In [54]:
embeddings = last_hidden_batch[mask]


In [55]:
embeddings

tensor([[ 0.1413,  0.4983,  0.0346,  ...,  0.0015,  0.4425,  0.4796],
        [ 0.2524,  1.1373,  0.6587,  ...,  0.3395,  0.4608,  0.4859],
        [-0.6004,  0.4385,  0.0305,  ...,  0.4747,  0.1893,  0.3341],
        ...,
        [-0.1614,  0.1552, -0.2573,  ...,  0.0814,  0.4423, -0.1207],
        [-0.0507, -0.1088,  0.5689,  ...,  0.1001,  0.2240, -0.5466],
        [ 1.0030,  0.3613,  0.0170,  ...,  0.2430, -0.2788, -0.1007]],
       grad_fn=<IndexBackward0>)

In [56]:
emb_of_sent=embeddings[1:-1] #To remove embeddings for CLS and SEP tokens
emb_of_sent

tensor([[ 0.2524,  1.1373,  0.6587,  ...,  0.3395,  0.4608,  0.4859],
        [-0.6004,  0.4385,  0.0305,  ...,  0.4747,  0.1893,  0.3341],
        [ 0.9868,  0.0997, -0.1441,  ..., -0.5819,  0.1935,  0.5088],
        ...,
        [ 0.6412,  0.0380,  0.0665,  ..., -0.3176,  0.5659, -0.0630],
        [-0.1614,  0.1552, -0.2573,  ...,  0.0814,  0.4423, -0.1207],
        [-0.0507, -0.1088,  0.5689,  ...,  0.1001,  0.2240, -0.5466]],
       grad_fn=<SliceBackward0>)

In [57]:
emb_of_sent.shape

torch.Size([9, 768])

2) Hidden_states returns hidden states for every "layer" in BERT’s encoder architecture, including the last one (returned as last_hidden_state), and the input embeddings as well

So we can assume that the dimmension of the 'hidden_states' output will be 12 times bigger in first dimension than 'last_hidden_state'.

Let's check it

In [62]:
out.keys()

odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states', 'attentions'])

In [63]:
out['last_hidden_state'].shape

torch.Size([1, 50, 768])

In [72]:
print(len(out['hidden_states']))
print(out['hidden_states'][0].shape)

13
torch.Size([1, 50, 768])


UPS!!! There 13!

What is wrong? Simply the first hidden state is the input embedding!

In [78]:
(out['hidden_states'][0] == bert_model.embeddings(tokens['input_ids'])).all()

tensor(True)

The last hidden state is 'last_hidden_state' which is in output by default.

In [74]:
(out['hidden_states'][-1] == out['last_hidden_state']).all()

tensor(True)

Last hidden state is processed by pooler. And we can check it as bellow.

In [79]:
(out['pooler_output'] == bert_model.pooler(out['last_hidden_state'])).all()

tensor(True)

In [80]:
print(len(out['attentions']))
print(out['attentions'][0].shape)

12
torch.Size([1, 12, 50, 50])


## Let's implement the BERT Classifier

The idea is to use BERT as encoder and send its output through classifier (MLP)

In [82]:
from torch import nn
from torch.utils.data import TensorDataset, DataLoader
from transformers import AutoTokenizer

In [92]:
bert_model = AutoModel.from_pretrained("distilbert-base-uncased")
auto_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer_kwargs = dict(truncation=True, padding=True, max_length=30, add_special_tokens=True)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [93]:
class BERTClassifier(nn.Module):
    def __init__(self, bert_model, ff_units, n_outputs, dropout=0.3):
        super().__init__()
        self.d_model = bert_model.config.dim
        self.n_outputs = n_outputs
        self.encoder = bert_model
        self.mlp = nn.Sequential(
            nn.Linear(self.d_model, ff_units),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(ff_units, n_outputs)
        )

    def encode(self, source, source_mask=None):
        states = self.encoder(input_ids=source,
                              attention_mask=source_mask)[0]
        cls_state = states[:, 0]
        return cls_state

    def forward(self, X):
        source_mask = (X > 0)
        # Featurizer
        cls_state = self.encode(X, source_mask)
        # Classifier
        out = self.mlp(cls_state)
        return out