In [1]:
import torch
import torch.nn as nn

# BERT

Materials by Francesco Periti and Elisabetta Rocchetti

## Introduction

Transformers recap:
<center><img src="https://pytorch.org/tutorials/_images/transformer_architecture.jpg" width="30%"/></center>
Both the decoder and the encoder have language understanding. Idea: use only either the decoder or the encoder!

Example of Transformer models in real world:

- Generative Pre-Training or GPT (Radford et al., “Improving Language Understanding by Generative Pre-Training.”)
- Bidirectional Encoder Representations from Transformers or BERT (Devlin et al., “BERT.”)

GPT looks like this:
<center><img src="img/gpt.png" width="30%"/></center>

Main properties of GPT:

- transfer learning paradigm
- based on Transformers decoder

BERT looks like this:
<center><img src="img/bert.png" width="30%"/></center>

Main properties of BERT:

- transfer learning paradigm
- based on Transformers encoder

## Transfer learning

But... what is transfer learning? Training procedure in 2 steps:

1. Pre-training: understand language $\rightarrow$ high computational cost
2. Fine tuning: understand how to solve task, given that I have language knowledge $\rightarrow$ low computational cost

## BERT

### Motivation

Why BERT? Unlike GPT, it is bidirectional: this means that it can learn from both left and right context! c:

... but we loose the benefits of masked multi-head attention :c

### Architecture

<center><img src="img/bert.svg"/></center>

### Input/Output Representations

Expected input: either a single sentence or a pair of sentences. 

Tokenization: WordPiece tokenizer [Wu et al., “Google’s Neural Machine Translation System.”].

The sequence must start with $\text{[CLS]}$, and each sentence must end with $\text{[SEP]}$.
$$\text{Raccoons love eating. They are playful.} \rightarrow \text{[CLS] Raccoons love eating. [SEP] They are playful. [SEP]}$$

### Details on WordPiece tokenizer

This implementation of WordPiece tokenizer is similar to Byte-Pair Encoding [Sennrich, Haddow, and Birch, “Neural Machine Translation of Rare Words with Subword Units.”] and is described in more detail in [Schuster and Nakajima, “Japanese and Korean Voice Search.”].

Steps:

1. Extract all the words from the dataset along with their count
2. Split all the words into character sequences
3. Define a vocabulary size
4. Add all the unique characters present in the character sequences to the vocabulary
5. Identify symbol pair having the highest score. Merge it, and add it to the vocabulary
6. Repeat Step 5 until you reach the vocabulary size

In [2]:
import torchtext
from torch.utils.data import DataLoader
import datasets

dataset = torchtext.datasets.AG_NEWS(split = 'train')
batch_size = 30
data_loader = iter(DataLoader(dataset, batch_size))

In [3]:
for s in data_loader:
    print(s)
    break 

[tensor([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,


In [4]:
import nltk
from collections import Counter

class WordPieceTokenizer:
    def __init__(self, vocabulary_size):
        self.word_counter = Counter()
        #3
        self.vocab_size = vocabulary_size
        #4 and 5
        self.vocab = Counter()
        
    #1
    def extract_words(self, batch):
        res_batch = []
        for sentence in batch:
            tokenized = nltk.word_tokenize(sentence.lower().strip())
            self.word_counter.update(tokenized)
            res_batch.append(tokenized)
        return res_batch
    
    #2
    def split_char(self, batch):
        res_batch = []
        for sentence in batch:
            res_batch.append([[c for c in word] for word in sentence])
        return res_batch
    
    #4
    def add_unique_chars(self, batch):
        batch_split = self.split_char(batch)
        for sentence in batch_split:
            for word in sentence:
                self.vocab.update(word)
    
    #5
    def count_likelihood(self):
        pair_counter = Counter()     
        for word, word_count in self.word_counter.items():
            word = self.word_tokenize(word)
            if len(word) < 2: continue 
            for i in range(len(word)-1):
                pair_counter.update([(word[i], word[i+1])] * word_count)
        likelihoods = {k: v/(self.vocab[k[0]]*self.vocab[k[1]]) for k, v in pair_counter.items()}
        likelihoods = sorted(likelihoods.items(), key=lambda x: x[1])
        return likelihoods, pair_counter
    
    def add_frequent(self, logging = False):
        while len(self.vocab) < self.vocab_size:
            likelihoods, counter = self.count_likelihood()
            found_next = False
            while not found_next:
                if not likelihoods: return len(self.vocab) < self.vocab_size
                new_frequent_split = likelihoods.pop()
                if not f'{new_frequent_split[0][0]}{new_frequent_split[0][1]}' in self.vocab: found_next = True
            self.vocab.update({f'{new_frequent_split[0][0]}{new_frequent_split[0][1]}':counter[new_frequent_split[0]]})
            if logging: 
                print(f'The pair "{new_frequent_split[0][0]}{new_frequent_split[0][1]}" having score {new_frequent_split[1]} has been added')
                print(f'Updated vocabulary: {self.vocab}')
        return len(self.vocab) == self.vocab_size
    
    def word_tokenize(self, word):
        if len(word)==1:
            return [word]
        splits = []
        i = 0
        word_temp = word
        while ''.join(splits) !=  word:
            if i < len(word_temp):
                split = word_temp[0:len(word_temp)-i]
                if split in self.vocab:
                    splits.append(split)
                    word_temp = word_temp[len(split):]
                    i=0
                else:
                    i = i+1
            else:
                splits = splits + [word_temp]
        return splits
    
    def tokenize(self, sentence):
        sentence_tokenized = []
        for word in nltk.word_tokenize(sentence.lower().strip()):
            sentence_tokenized.append(self.word_tokenize(word))
        return sentence_tokenized

In [49]:
wptokenizer = WordPieceTokenizer(vocabulary_size=20)
b = ['this is home', 'house', 'horse', 'universe', 'university']
d = wptokenizer.extract_words(b)
print(d)
wptokenizer.add_unique_chars(d)
print(wptokenizer.vocab)
wptokenizer.add_frequent(logging = True)
print(len(wptokenizer.vocab))

[['this', 'is', 'home'], ['house'], ['horse'], ['universe'], ['university']]
Counter({'s': 6, 'e': 6, 'i': 5, 'h': 4, 'o': 3, 'u': 3, 'r': 3, 't': 2, 'n': 2, 'v': 2, 'm': 1, 'y': 1})
The pair "ty" having score 0.5 has been added
Updated vocabulary: Counter({'s': 6, 'e': 6, 'i': 5, 'h': 4, 'o': 3, 'u': 3, 'r': 3, 't': 2, 'n': 2, 'v': 2, 'm': 1, 'y': 1, 'ty': 1})
The pair "un" having score 0.3333333333333333 has been added
Updated vocabulary: Counter({'s': 6, 'e': 6, 'i': 5, 'h': 4, 'o': 3, 'u': 3, 'r': 3, 't': 2, 'n': 2, 'v': 2, 'un': 2, 'm': 1, 'y': 1, 'ty': 1})
The pair "om" having score 0.3333333333333333 has been added
Updated vocabulary: Counter({'s': 6, 'e': 6, 'i': 5, 'h': 4, 'o': 3, 'u': 3, 'r': 3, 't': 2, 'n': 2, 'v': 2, 'un': 2, 'm': 1, 'y': 1, 'ty': 1, 'om': 1})
The pair "hom" having score 0.25 has been added
Updated vocabulary: Counter({'s': 6, 'e': 6, 'i': 5, 'h': 4, 'o': 3, 'u': 3, 'r': 3, 't': 2, 'n': 2, 'v': 2, 'un': 2, 'm': 1, 'y': 1, 'ty': 1, 'om': 1, 'hom': 1})
The pa

In [52]:
wptokenizer.word_tokenize('universe')

['univ', 'e', 'r', 's', 'e']

In [53]:
wptokenizer = WordPieceTokenizer(vocabulary_size=10000)
_, b = next(data_loader)
d = wptokenizer.extract_words(b)
wptokenizer.add_unique_chars(d)
wptokenizer.add_frequent()

True

In [8]:
example_sentence = next(data_loader)[1][2]
tokenized = wptokenizer.tokenize(example_sentence)
print(f'Example sentence: {example_sentence}')
print(f'Tokenized:{tokenized}')

Example sentence: Downhome Pinoy Blues, Intersecting Life Paths, and Heartbreak Songs The Blues is alive and well in the Philippines, as evidenced by this appreciation of the Pinoy Blues band 'Lampano Alley', penned by columnist Clarence Henderson as a counterpoint to his usual economics, business, and culture fare.
Tokenized:[['down', 'ho', 'm', 'e'], ['pin', 'o', 'y'], ['b', 'l', 'u', 'es'], [','], ['inter', 's', 'e', 'c', 't', 'ing'], ['l', 'if', 'e'], ['p', 'at', 'h', 's'], [','], ['and'], ['heart', 'break'], ['s', 'on', 'g', 's'], ['the'], ['b', 'l', 'u', 'es'], ['is'], ['al', 'ive'], ['and'], ['well'], ['in'], ['the'], ['philippines'], [','], ['as'], ['e', 'vi', 'de', 'n', 'ced'], ['by'], ['this'], ['a', 'p', 'pre', 'c', 'i', 'at', 'ion'], ['of'], ['the'], ['pin', 'o', 'y'], ['b', 'l', 'u', 'es'], ['band'], ["'", 'l', 'a', 'mp', 'an', 'o'], ['all', 'e', 'y'], ["'"], [','], ['p', 'e', 'n', 'ned'], ['by'], ['columnist'], ['c', 'l', 'are', 'n', 'c', 'e'], ['he', 'nd', 'ers', 'on'], 

In [9]:
example_sentence = next(data_loader)[1][0]
tokenized = wptokenizer.tokenize(example_sentence)
print(f'Example sentence: {example_sentence}')
print(f'Tokenized:{tokenized}')

Example sentence: Science, Politics Collide in Election Year (AP) AP - With more than 4,000 scientists, including 48 Nobel Prize winners, having signed a statement opposing the Bush administration's use of scientific advice, this election year is seeing a new development in the uneasy relationship between science and politics.
Tokenized:[['s', 'c', 'i', 'e', 'n', 'c', 'e'], [','], ['politic', 's'], ['co', 'll', 'id', 'e'], ['in'], ['election'], ['year'], ['('], ['a', 'p'], [')'], ['a', 'p'], ['-'], ['with'], ['more'], ['than'], ['4', ',000'], ['scientists'], [','], ['including'], ['4', '8'], ['no', 'be', 'l'], ['pri', 'z', 'e'], ['winn', 'ers'], [','], ['having'], ['s', 'i', 'g', 'ned'], ['a'], ['state', 'm', 'e', 'n', 't'], ['op', 'po', 'sing'], ['the'], ['bush'], ['ad', 'm', 'in', 'ist', 'r', 'at', 'ion'], ["'s"], ['use'], ['of'], ['s', 'c', 'i', 'e', 'n', 't', 'if', 'i', 'c'], ['ad', 'vic', 'e'], [','], ['this'], ['election'], ['year'], ['is'], ['s', 'e', 'ein', 'g'], ['a'], ['new']

### Token embeddings

... on Huggingface

In [10]:
from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', return_tensors="pt")

In [11]:
#sentences = example_sentence.split('.')
example_sentence = 'Do you like raccons? Yes, I love them!'
example_sentence_1, example_sentence_2 = 'Do you like raccons?', 'Yes, I love them!'
tokenized_1 = bert_tokenizer.tokenize(example_sentence_1)
tokenized_2 = bert_tokenizer.tokenize(example_sentence_2)
print(f'Example sentence: {example_sentence_1}')
print(f'Tokenized:{tokenized_1}')

Example sentence: Do you like raccons?
Tokenized:['do', 'you', 'like', 'ra', '##cco', '##ns', '?']


Now that we have our tokens, we can extract embeddings for them (as with the Transformers!).

In [12]:
tokenizer_output = bert_tokenizer(example_sentence_1, example_sentence_2, padding = 'max_length')
print(f'Input ids: {tokenizer_output["input_ids"]}')

Input ids: [101, 2079, 2017, 2066, 10958, 21408, 3619, 1029, 102, 2748, 1010, 1045, 2293, 2068, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [13]:
tokens = bert_tokenizer.convert_ids_to_tokens(tokenizer_output["input_ids"])
print(f'Input tokens ({len(tokens)}): {tokens}')

Input tokens (512): ['[CLS]', 'do', 'you', 'like', 'ra', '##cco', '##ns', '?', '[SEP]', 'yes', ',', 'i', 'love', 'them', '!', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '

In [14]:
d_model = 768 #as in the original paper
max_seq_len = 512

embedding_layer = nn.Embedding(bert_tokenizer.vocab_size, d_model)
input_embedding = embedding_layer(torch.IntTensor(tokenizer_output["input_ids"]))
input_embedding.size()

torch.Size([512, 768])

### Position embeddings

Add position embeddings

In [15]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_sequence_length):
        super().__init__()
        self.max_sequence_length = max_sequence_length
        self.d_model = d_model

    def forward(self):
        i = torch.arange(0,self.d_model,2, dtype=torch.float).repeat_interleave(2)[:self.d_model]
        denominator = torch.pow(10000, 2*i/self.d_model)
        position = torch.arange(self.max_sequence_length).reshape(self.max_sequence_length, 1)
        sin_cos_argument = position/denominator
        PE = torch.zeros(size = sin_cos_argument.shape)
        PE[:, 0::2] = torch.sin(sin_cos_argument[:, 0::2])
        PE[:, 1::2] = torch.cos(sin_cos_argument[:, 1::2])
        return PE

In [16]:
tokenizer_output = bert_tokenizer(example_sentence_1, example_sentence_2, truncation=True, padding='max_length', add_special_tokens=True)
input_embedding = embedding_layer(torch.IntTensor(tokenizer_output["input_ids"]))
print(input_embedding.size())

positional_encoding_layer = PositionalEncoding(max_sequence_length=max_seq_len, d_model=d_model)
input_embedding = input_embedding + positional_encoding_layer()
input_embedding

torch.Size([512, 768])


tensor([[ 0.4902,  0.1315,  0.8269,  ...,  0.4319,  0.3296,  0.1364],
        [ 0.6519, -0.3569, -1.5178,  ..., -0.4685,  1.4497,  1.8214],
        [ 1.0745,  0.3352,  0.6594,  ...,  1.5334, -0.4668,  1.6931],
        ...,
        [-2.7681,  0.7737,  0.9482,  ...,  1.4937,  0.7936,  1.9153],
        [-1.9567,  0.2627,  0.7124,  ...,  1.4937,  0.7936,  1.9153],
        [-1.9482, -0.6961, -0.1470,  ...,  1.4937,  0.7936,  1.9153]],
       grad_fn=<AddBackward0>)

### Sentence embeddings

Add sentence embeddings

In [17]:
sentence_embeddings = torch.tile(torch.IntTensor(tokenizer_output['token_type_ids']).unsqueeze(1), (1, d_model))
print(sentence_embeddings.size())
sentence_embeddings

torch.Size([512, 768])


tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], dtype=torch.int32)

In [18]:
input_embedding = input_embedding + sentence_embeddings
input_embedding

tensor([[ 0.4902,  0.1315,  0.8269,  ...,  0.4319,  0.3296,  0.1364],
        [ 0.6519, -0.3569, -1.5178,  ..., -0.4685,  1.4497,  1.8214],
        [ 1.0745,  0.3352,  0.6594,  ...,  1.5334, -0.4668,  1.6931],
        ...,
        [-2.7681,  0.7737,  0.9482,  ...,  1.4937,  0.7936,  1.9153],
        [-1.9567,  0.2627,  0.7124,  ...,  1.4937,  0.7936,  1.9153],
        [-1.9482, -0.6961, -0.1470,  ...,  1.4937,  0.7936,  1.9153]],
       grad_fn=<AddBackward0>)

### Pre-training

! BERT is bidirectional: this allows each word to "see itself", and the model could trivially predict the target word in a multi-layered context.

*SOLUTION*: mask some percentage of the input tokens at random, and then predict those masked tokens! c:

This is called **Masked Language Modeling (MLM)**, and it is one of the two pre-training objectives BERT has.

Usually, language models are trained to predict each word in a sentence; here, BERT must predict only the masked tokens.

**Masking procedure**

0. Take 15% of the tokens in the input sentence
1. $\rightarrow$ 80% of those tokens will be replaced with the special token $\text{[MASK]}$
2. $\rightarrow$ 10% of those tokens will be replaced with a random token
3. $\rightarrow$ 10% of those tokens will remain unchanged

**Why 80-10-10?**

There may be a mismatch between the pre-training and fine-tuning tasks because the latter does not involve predicting masked words in most of the downstream tasks (e.g. sentiment analysis). The model should be good not only in predicting masked tokens, but also as pre-trained model for other tasks. The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every input token.

- Just using the $\text{[MASK]}$ token resulted in the model learning very little about the context of the surrounding words. This seems to occur since the model knows it can “forget” all the information about the surrounding words and focus only on the target word.

- If you use the $\text{[MASK]}$ token 80% of the time and the right word 20% of the time, the model will know that when the [MASK] is not there, then the word is correct. The network has to predict the token, but it actually gets the answer already as input. Thus, it needs to learn nothing, since it knows the non-masked token is always correct.

- If you use the $\text{[MASK]}$ token 80% of the time and a wrong word 20% of the time, the model will know when the $\text{[MASK]}$ doesn’t appear the selected token is a wrong one, and it will just treat it like another $\text{[MASK]}$, i.e. you would likely encounter the same problem of before (100% $\text{[MASK]}$). 

In [19]:
import random

def mask_sentence(sentence1, sentence2 = None, add_special_tokens = True, padding = 'max_length'):
    if sentence2:
        inputs = bert_tokenizer(sentence1, sentence2, return_tensors='pt', add_special_tokens=add_special_tokens, truncation=True, padding=padding)
    else:
        inputs = bert_tokenizer(sentence1, return_tensors='pt', add_special_tokens=add_special_tokens, truncation=True, padding=padding)
    
    inputs['labels'] = inputs.input_ids.detach().clone() #for training
    
    rand = torch.rand(inputs.input_ids.squeeze().shape) #Returns a tensor filled with random numbers from a uniform distribution on the interval [0,1)
    mask_arr = (rand < 0.15) * (inputs.input_ids.squeeze() != 101) * (inputs.input_ids.squeeze() != 102) * (inputs.input_ids.squeeze() != 0) #101 is [CLS] and 102 is [SEP] and 0 is [PAD]
    index_to_mask = torch.flatten(mask_arr.nonzero()).tolist()
    
    index_to_mask_shuffle = random.sample(index_to_mask, int(len(index_to_mask)-(0.1 * len(index_to_mask))))
    to_mask = int(0.8*len(index_to_mask))
    inputs.input_ids[0, index_to_mask_shuffle[:to_mask]] = 103 #[MASK] token
    inputs.input_ids[0, index_to_mask_shuffle[to_mask:]] = torch.LongTensor(random.sample(list(bert_tokenizer.vocab.values()), len(index_to_mask_shuffle)-to_mask)) #random token
    return inputs


In [20]:
masked_sentence = mask_sentence(example_sentence_1, example_sentence_2)
print(f'Original sentence: {example_sentence_1} {example_sentence_2}')
print(f'Masked sentence: {" ".join(bert_tokenizer.convert_ids_to_tokens(*masked_sentence.input_ids))}')

Original sentence: Do you like raccons? Yes, I love them!
Masked sentence: [CLS] do you like ra [MASK] ##ns ? [SEP] yes , [MASK] love them ! [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

#### Pipeline with masking

In [21]:
tokenized_input_masked = mask_sentence(example_sentence_1, example_sentence_2)
token_embeddings = embedding_layer(tokenized_input_masked['input_ids'])
position_embeddings = positional_encoding_layer()
sentence_embeddings = torch.tile(tokenized_input_masked['token_type_ids'].permute(1,0), (1, d_model)).unsqueeze(0)
embeddings = token_embeddings + position_embeddings + sentence_embeddings
print(embeddings.size())
embeddings

torch.Size([1, 512, 768])


tensor([[[ 0.4902,  0.1315,  0.8269,  ...,  0.4319,  0.3296,  0.1364],
         [ 0.6519, -0.3569, -1.5178,  ..., -0.4685,  1.4497,  1.8214],
         [ 0.6947, -1.9195,  0.1595,  ...,  1.7897, -0.2654,  0.0901],
         ...,
         [-2.7681,  0.7737,  0.9482,  ...,  1.4937,  0.7936,  1.9153],
         [-1.9567,  0.2627,  0.7124,  ...,  1.4937,  0.7936,  1.9153],
         [-1.9482, -0.6961, -0.1470,  ...,  1.4937,  0.7936,  1.9153]]],
       grad_fn=<AddBackward0>)

During pre-training, BERT will evaluate the loss of these masked tokens, and it will backpropagate it.

The second pre-training objective is called **Next Sentence Prediction (NSP)**. 

Many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling.

When pre-traininig for a binarized NSP task, we choose the sentences A and B for each pretraining example,ì so that 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext).

$\text{[CLS]}$ is used for NSP.

<center><img src="img/bert_pretrain_2.svg"/></center>

### Finetuning

Just use task-specific data and additional output layers taking as input BERT's output.

**Question Answering with Stanford Question Answering Dataset (SQuAD)**
<center><img src="img/bert_finetune_squad.svg"/></center>

**Entailment with MNLI**
<center><img src="img/bert_finetune_mnli.svg"/></center>

### Hands-on: BERT!

This section is (finally) dedicated to some real-world BERT use cases

#### MLM

In [22]:
from transformers import BertForMaskedLM

model_mlm = BertForMaskedLM.from_pretrained("bert-base-uncased")
print(bert_tokenizer.vocab_size)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


30522


In [23]:
with torch.no_grad():
    logits = model_mlm(**masked_sentence).logits
logits.size()

torch.Size([1, 512, 30522])

In [24]:
# retrieve index of [MASK]
mask_token_index = (masked_sentence.input_ids == bert_tokenizer.mask_token_id)[0].nonzero().squeeze()
mask_token_index

tensor([ 5, 11])

In [26]:
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
print(predicted_token_id)
print(bert_tokenizer.convert_ids_to_tokens([predicted_token_id[0]]))

tensor([22414,  1045])
['##bba']


In [27]:
prediction = torch.where(masked_sentence.input_ids == bert_tokenizer.mask_token_id, logits[0, :].argmax(axis=-1), masked_sentence.input_ids)[0]
print(bert_tokenizer.decode(masked_sentence.labels[0]))
print(bert_tokenizer.decode(masked_sentence.input_ids[0]))
print(bert_tokenizer.decode(prediction))

[CLS] do you like raccons? [SEP] yes, i love them! [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

Pipeline version

In [28]:
from transformers import pipeline

mlm_pipeline = pipeline('fill-mask', model=model_mlm, tokenizer=bert_tokenizer)

In [31]:
pipeline_mask_sentence = mask_sentence(example_sentence_1, example_sentence_2, add_special_tokens = True, padding = False)
pipeline_mask_sentence = bert_tokenizer.decode(pipeline_mask_sentence.input_ids[0])
pipeline_mask_sentence

'[CLS] do you like raccons? [SEP] yes, i [MASK] [MASK]! [SEP]'

In [32]:
mlm_pipeline(pipeline_mask_sentence)

[[{'score': 0.7961465716362,
   'token': 2079,
   'token_str': 'do',
   'sequence': '[CLS] [CLS] do you like raccons? [SEP] yes, i do [MASK]! [SEP] [SEP]'},
  {'score': 0.07182539254426956,
   'token': 2572,
   'token_str': 'am',
   'sequence': '[CLS] [CLS] do you like raccons? [SEP] yes, i am [MASK]! [SEP] [SEP]'},
  {'score': 0.017079079523682594,
   'token': 2031,
   'token_str': 'have',
   'sequence': '[CLS] [CLS] do you like raccons? [SEP] yes, i have [MASK]! [SEP] [SEP]'},
  {'score': 0.015040510334074497,
   'token': 2113,
   'token_str': 'know',
   'sequence': '[CLS] [CLS] do you like raccons? [SEP] yes, i know [MASK]! [SEP] [SEP]'},
  {'score': 0.013145454227924347,
   'token': 2066,
   'token_str': 'like',
   'sequence': '[CLS] [CLS] do you like raccons? [SEP] yes, i like [MASK]! [SEP] [SEP]'}],
 [{'score': 0.5938681960105896,
   'token': 2079,
   'token_str': 'do',
   'sequence': '[CLS] [CLS] do you like raccons? [SEP] yes, i [MASK] do! [SEP] [SEP]'},
  {'score': 0.104746617

#### NSP

In [33]:
from transformers import BertForNextSentencePrediction

model_nsp = BertForNextSentencePrediction.from_pretrained("bert-base-uncased")

labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see input_ids docstring). Indices should be in [0, 1]:
- 0 indicates sequence B is a continuation of sequence A,
- 1 indicates sequence B is a random sequence.

In [34]:
masked_sentence['labels'] = torch.LongTensor([0])
outputs = model_nsp(**masked_sentence)
logits = outputs.logits
print(logits)
print(torch.argmax(logits)) #isNext

tensor([[ 5.0871, -4.1504]], grad_fn=<AddmmBackward0>)
tensor(0)


In [35]:
prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
next_sentence = "The sky is blue due to the shorter wavelength of blue light."
#next_sentence = "This is what happens in Italian restaurants."
encoding = bert_tokenizer(prompt, next_sentence, return_tensors="pt")

outputs = model_nsp(**encoding, labels=torch.LongTensor([1]))
logits = outputs.logits
print(logits)
print(torch.argmax(logits))# next sentence was random

tensor([[-3.0729,  5.9056]], grad_fn=<AddmmBackward0>)
tensor(1)


#### Sequence Classification with MNLI

In [37]:
from datasets import load_dataset

mnli_dataset = load_dataset('multi_nli', split='train[:100]')

In [38]:
example_mnli = mnli_dataset[9] #entailment (0), neutral (1), contradiction (2)
example_mnli

{'promptID': 60529,
 'pairID': '60529c',
 'premise': "At the end of Rue des Francs-Bourgeois is what many consider to be the city's most handsome residential square, the Place des Vosges, with its stone and red brick facades.",
 'premise_binary_parse': "( ( At ( ( the end ) ( of Rue ) ) ) ( ( des Francs-Bourgeois ) ( ( is ( what ( many ( consider ( to ( ( be ( ( ( ( ( the ( city 's ) ) ( ( most handsome ) ( residential square ) ) ) , ) ( the ( Place ( des Vosges ) ) ) ) , ) ) ( with ( its ( stone ( and ( red ( brick facades ) ) ) ) ) ) ) ) ) ) ) ) . ) ) )",
 'premise_parse': "(ROOT (S (PP (IN At) (NP (NP (DT the) (NN end)) (PP (IN of) (NP (NNP Rue))))) (NP (NNP des) (NNP Francs-Bourgeois)) (VP (VBZ is) (SBAR (WHNP (WP what)) (S (NP (DT many)) (VP (VBP consider) (S (VP (TO to) (VP (VB be) (NP (NP (NP (DT the) (NN city) (POS 's)) (ADJP (RBS most) (JJ handsome)) (JJ residential) (NN square)) (, ,) (NP (DT the) (NNP Place) (NNP des) (NNPS Vosges)) (, ,)) (PP (IN with) (NP (PRP$ its) (NN st

In [39]:
print(example_mnli['premise'])
example_mnli['hypothesis'] = "Place des Vosges is constructed entirely of stone and bricks."
print(example_mnli['hypothesis'])

At the end of Rue des Francs-Bourgeois is what many consider to be the city's most handsome residential square, the Place des Vosges, with its stone and red brick facades.
Place des Vosges is constructed entirely of stone and bricks.


In [40]:
def mnli_preprocess_function(examples):
    return bert_tokenizer(examples['premise'], examples['hypothesis'], truncation=True, padding = True, return_tensors="pt")

In [41]:
print(f'Premise: {example_mnli["premise"]}')
print(f'Hypothesis: {example_mnli["hypothesis"]}')
example_mnli_processed = mnli_preprocess_function(example_mnli)
print(f'After preprocessing: {bert_tokenizer.decode(example_mnli_processed["input_ids"][0])}')

Premise: At the end of Rue des Francs-Bourgeois is what many consider to be the city's most handsome residential square, the Place des Vosges, with its stone and red brick facades.
Hypothesis: Place des Vosges is constructed entirely of stone and bricks.
After preprocessing: [CLS] at the end of rue des francs - bourgeois is what many consider to be the city's most handsome residential square, the place des vosges, with its stone and red brick facades. [SEP] place des vosges is constructed entirely of stone and bricks. [SEP]


In [42]:
encoded_mnli_dataset = mnli_dataset.map(mnli_preprocess_function, batched=True)

In [43]:
from transformers import AutoModelForSequenceClassification

model_mnli = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-MNLI") #entailment (1), neutral (2), contradiction (0)

  return self.fget.__get__(instance, owner)()


In [44]:
outputs = model_mnli(**example_mnli_processed)
logits = outputs.logits
print(logits)
print(torch.argmax(logits))

tensor([[-2.2080,  0.2117,  1.4080]], grad_fn=<AddmmBackward0>)
tensor(2)


#### Question Answering with SQuAD

In [45]:
squad_dataset = load_dataset("squad", split = 'train[:100]')

In [46]:
squad_example = squad_dataset[0]
squad_example

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In [47]:
from transformers import pipeline

qa_pipeline = pipeline(
  "question-answering",
  model="csarron/bert-base-uncased-squad-v1",
  tokenizer="csarron/bert-base-uncased-squad-v1"
)

Some weights of the model checkpoint at csarron/bert-base-uncased-squad-v1 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [48]:
predictions = qa_pipeline({
  'context': squad_example['context'],
  'question': squad_example['question']
})

print(predictions)

{'score': 0.9795625805854797, 'start': 515, 'end': 541, 'answer': 'Saint Bernadette Soubirous'}


## Bert Variants

<center><img src="img/bert_models.png" width = "70%"/></center>

## Bert Extensions

<center><img src="img/bert_variants.png" width = "70%"/></center>

## References

Papers:
- Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. “Improving Language Understanding by Generative Pre-Training,” 2019, 12.
- Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86. Minneapolis, Minnesota: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/N19-1423.
- Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.” ArXiv:1609.08144 [Cs], October 8, 2016. http://arxiv.org/abs/1609.08144.
- Sennrich, Rico, Barry Haddow, and Alexandra Birch. “Neural Machine Translation of Rare Words with Subword Units.” arXiv, June 10, 2016. https://doi.org/10.48550/arXiv.1508.07909.
- Schuster, Mike, and Kaisuke Nakajima. “Japanese and Korean Voice Search.” In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5149–52, 2012. https://doi.org/10.1109/ICASSP.2012.6289079.
- Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” ArXiv:1907.11692 [Cs], July 26, 2019. http://arxiv.org/abs/1907.11692.
- Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.” arXiv, February 29, 2020. https://doi.org/10.48550/arXiv.1910.01108.

Online resources / tutorials:
- https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/bert#bert
