# Yet another practical introduction to Bert

### Raphaël Sourty

#### Toulouse School of Economics Master's degree

### What is the purpose of Bert? 🤖

BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream tasks.

### What is a downstream task? 🧐

A downstream task is when a pre-trained model is used for a new task. For example, you could train Bert on the question answering task. You would benefit from Bert's pre-trained weights for this new task.

### Why Bert outperform other models? 🎯

BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP. 

`I made a bank deposit` 

In the example, the unidirectional representation of the token `bank` is only based on `I made` but not on `deposit`.

BERT represents `bank` using both its left and right context starting from the very bottom of a deep neural network, so it is deeply bidirectional.

The ELMo and OpenAI GPT models are other high-performance models that provide contextual latent representations. ELMo uses the concatenation of independently trained left-to-right and right-to- left LSTMs to generate features for downstream tasks. OpenAI GPT uses a left-to-right Transformer. 

### How Bert is trained? 🔧

BERT is trained from **two unsupervised tasks**.

#### Task 1: Masked LM

**15%** of the words in the input sentence are masked. The model must then find the words that have been hidden.

```python
input = 'the man went to the [MASK1]. he bought a [MASK2] of milk'
label = {'[MASK1]': 'store', '[MASK2]': 'gallon'}`
```


Tips used in the Masked LM task 😎: 

- 80% of the time: Replace the word with the `[MASK]` token, e.g., `my dog is hair` → `my dog is [MASK]`


- 10% of the time: Replace the word with a randomword,e.g., my `dog is hair` → `my dog is apple`


- 10% of the time: Keep the word unchanged,e.g., `my dog is hair` → `my dog is hair`. 

The model is forced to keep a distributional contextual representation of every input token because the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words.


#### Task 2: Next sentence prediction

This task is designed to provide BERT an understanding of the relationship between two sentences. This information is not captured by the masked language model task. When choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext).

```python

input = '[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]'
label = 'IsNext'

input = '[CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]'
label = 'NotNext'

```

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

- **[SEP]**: Separator between sentences.
- **[CLS]**: Token dedicated to classification tasks.

**The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood.**


### What are the corpus used to train BERT?

- BooksCorpus (800M words) 📖

- English Wikipedia (2,500M words) 📚

It is critical to use a document-level corpus rather than a shuffled sentence-level corpus in order to extract long contiguous sequences.

### What is Finetuning ?

Finetuning consists of training the BERT model on a new task to specialize the model. BERT finetuning is relatively inexpensive. It is necessary to add additional layers of neurons to transform the model into a classifier or regressor for example.

### How to pre-process text to feed BERT?

BERT uses WordPiece tokenization. The vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the existing words in the vocabulary are iteratively added.

### How does BERT handle OOV words?

Any word that does not occur in the vocabulary is broken down into sub-words greedily. 

For example, if `play`, `##ing`, and `##ed` are present in the vocabulary but `playing` and `played` are OOV words then they will be broken down into `play` + `##ing` and `play` + `##ed` respectively.`

`##` is used to represent sub-words).

### Number of BERT parameters:


#### BERTBASE: 

- Number of transformer blocks: 12, 
- Hidden size: 768, 
- Number of self-attention heads: 12
- Total Parameters: 110M 

#### 🤯 BERTLARGE: 

- Number of transformer blocks: 24
- Hidden size: 1024
- Number of self-attention heads: 16
- Total Parameters: 340M 


<img src="curve_parameters_models.png" alt="drawing" width="700"/>

The performance of the language models depends on, among other things:
    
    - the quality of the training data
    
    - the volume of training data
    
    - the quality of the model architecture
    
    - the number of model parameters
    
   
There are some interesting works that allow to compress the knowledge of language models. The objective of this area of research is to build models with fewer (lighter) parameters while avoiding reducing their accuracy. **DistillBERT** is one of these models. **DistillBERT** tends to reproduce BERT's results with far fewer parameters than its predecessor.

## Knowledge distillation

Knowledge distillation is a method dedicated to compress knowledge. It consists in transmitting knowledge from a teacher to a student. 👩‍🏫

Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, $zi$, computed for each class into a probability, $qi$, by comparing $zi$ with the other logits. `Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).`

$$q_{i}=\frac{\exp \left(z_{i} / T\right)}{\sum_{j} \exp \left(z_{j} / T\right)}$$

$$L_s = (1 - \alpha) * KL(q^{t}_{i}, q^{s}_{i}) + \alpha * \mathcal{H}_{s}$$ 

<p style="text-align: center;"> 
    $\mathcal{H}_{s}$ is the loss of the student such as cross-entropy loss function.
</p> 

<p style="text-align: center;"> 
    $q^{t}$ is the probability distribution in output of teacher.
</p> 
 
<p style="text-align: center;"> 
    $q^{s}$ is the probability distribution in output of the student.
</p> 


<p style="text-align: center;"> 
    $KL$ denotes the Kullback-Leibler divergence, $KL(P \| Q)=\sum_{i} P(i) \log \frac{P(i)}{Q(i)}$  
</p> 

Tips:

- By making the coefficient $\alpha$ evolve linearly as the training progresses, the student can overtake the teacher.

---

The HuggingFace organization used knowledge distillation to concentrate BERT's knowledge into a much smaller model.

#### Concretely, what are the results obtained?

$$
\begin{array}{lcc}
\hline \text { Model } & \begin{array}{c}
\text { IMDb } \\
\text { (acc.) }
\end{array} & \begin{array}{c}
\text { SQuAD } \\
\text { (EM/F1) }
\end{array} \\
\hline \text { BERT-base } & 93.46 & 81.2 / 88.5 \\
\text { DistilBERT } & 92.82 & 77.7 / 85.8 \\
\text { DistilBERT (D) } & - & 79.1 / 86.9 \\
\hline
\end{array}
$$

$$
\begin{array}{lcc}
\hline \text { Model } & \begin{array}{c}
\# \text { param. } \\
\text { (Millions) }
\end{array} & \begin{array}{c}
\text { Inf. time } \\
\text { (seconds) }
\end{array} \\
\hline \text { ELMo } & 180 & 895 \\
\text { BERT-base } & 110 & 668 \\
\text { DistilBERT } & 66 & 410 \\
\hline
\end{array}
$$

DistilBERT yields to comparable performance on downstream tasks. 

**IMDb**: Binary sentiment classification dataset.

**SQuAD**: The Stanford Question Answering Dataset.

`Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).`

-----

## Toxic Comment Classification Challenge

#### Identify and classify toxic online comments


In [None]:
device = 'cpu'

In [None]:
import transformers
from transformers import DistilBertModel
from transformers import DistilBertTokenizer

In [None]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

In [None]:
from creme import metrics
from creme import stats

In [None]:
from sklearn import model_selection

In [None]:
import zipfile
import torch
import pandas as pd
import re
import collections
import sklearn
import tqdm

#### Load data:

In [None]:
labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [None]:
zf = zipfile.ZipFile('./jigsaw-toxic-comment-classification-challenge.zip')

In [None]:
train = pd.read_csv(zf.open('train.csv.zip'), compression = 'zip')

In [None]:
train.head()

In [None]:
X_test  = pd.read_csv(zf.open('test.csv.zip'), compression = 'zip')

In [None]:
X_test.head()

In [None]:
sample_submission = pd.read_csv(zf.open('sample_submission.csv.zip'), compression = 'zip')

In [None]:
sample_submission.head()

In [None]:
test_labels = pd.read_csv(zf.open('test_labels.csv.zip'), compression = 'zip')

#### Pre-process train and test set

In [None]:
train['comment_text'] = train['comment_text'].map(lambda x: re.sub(r'\W+', ' ', x))
X_test['comment_text'] = X_test['comment_text'].map(lambda x: re.sub(r'\W+', ' ', x))

#### Create validation set

In [None]:
X_train, X_valid, y_train, y_valid = model_selection.train_test_split(
    train, 
    train[labels], 
    test_size=0.20, 
    random_state=42
)

X_train.reset_index(drop=True, inplace=True)
X_valid.reset_index(drop=True, inplace=True)

y_train.reset_index(drop=True, inplace=True)
y_valid.reset_index(drop=True, inplace=True)

y_train = torch.tensor(y_train.values, dtype = torch.float)
y_valid = torch.tensor(y_valid.values, dtype = torch.float)

### Custom dataset

In [None]:
class CustomDataset(Dataset):
    
    def __init__(self, X, tokenizer, max_len, y):
        self.len = len(X)
        self.X = X
        self.y = y
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        
        comment_text = str(self.X['comment_text'][index])
        
        inputs = self.tokenizer.encode_plus(
            comment_text,
            text_pair             = None,
            add_special_tokens    = True,
            truncation            = True,
            max_length            = self.max_len,
            padding               = 'max_length',
            return_token_type_ids = True,
        )
        
        ids  = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'index': torch.tensor(ids, dtype=torch.long),
            'mask' : torch.tensor(mask, dtype=torch.long),
            'y'    : self.y[index]
        } 
    
    def __len__(self):
        return self.len

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
max_len   = tokenizer.max_model_input_sizes['distilbert-base-uncased']

In [None]:
train_dataset = CustomDataset(
    X = X_train, 
    y = y_train,
    tokenizer = tokenizer, 
    max_len = max_len,
)

valid_dataset = CustomDataset(
    X = X_valid, 
    y = y_valid,
    tokenizer = tokenizer, 
    max_len = max_len,
)

test_dataset = CustomDataset(
    X = X_test, 
    y = torch.tensor(test_labels[labels].values),
    tokenizer = tokenizer, 
    max_len = max_len,
)

In [None]:
batch_size      = 1
test_batch_size = 1

params = {
    'batch_size' : batch_size,
    'shuffle'    : True,
    'num_workers': 0,
}

params_test = {
    'batch_size' : batch_size,
    'shuffle'    : False,
    'num_workers': 0,
}


train_loader = DataLoader(
    train_dataset, 
    **params
)

valid_loader = DataLoader(
    valid_dataset, 
    **params_test
)

test_loader = DataLoader(
    test_dataset, 
    **params_test
)

In [None]:
class DistillBert(torch.nn.Module):
    
    def __init__(self):
        super(DistillBert, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.l2 = torch.nn.Dropout(0.3)
        self.l3 = torch.nn.Linear(768, 6)

    def forward(self, input_ids, attention_mask):
        output = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output[0]
        pooler = hidden_state[:, 0]
        pooler = self.l2(pooler)
        pooler = self.l3(pooler)
        return pooler

In [None]:
model = DistillBert()
model = model.to(device)

#### Define loss function

In [None]:
loss_function = torch.nn.BCEWithLogitsLoss()

optimizer = torch.optim.Adam(
    params = model.parameters(), 
    lr     = 1e-04,
)

#### Define metric

In [None]:
class RollingF1:
    
    def __init__(self, k):
        self.score = stats.RollingMean(k)
        
    def update(self, y_true, y_pred):

        y_true = y_true.cpu().detach()
        y_pred = y_pred.cpu().detach()
        
        y_pred[y_pred < 0.5] = 0
        y_pred[y_pred >= 0.5] = 1
        
        y_pred = y_pred.numpy()
        
        for i, pred in enumerate(y_pred):

            self.score.update(
                sklearn.metrics.f1_score(y_true = y_true[i], y_pred = pred, zero_division=1)
            )

        return self

    def get(self):
        return self.score.get()

#### Train BERT:

In [None]:
epochs = 1

model = model.train()

for epoch in range(epochs):

    bar = tqdm.tqdm(enumerate(train_loader), position = 0, total = len(train_loader))

    metric_loss = stats.RollingMean(len(train_loader))

    metric_f1 = RollingF1(len(train_loader))

    for step, data in bar:
        
        index = data['index'].to(device)
        mask  = data['mask'].to(device)
        y     = data['y'].to(device)
        
        y_pred = model(index, mask)
        
        loss = loss_function(y_pred, target = y)
        
        loss.backward()
        
        optimizer.step()
        
        optimizer.zero_grad()
        
        with torch.no_grad():

            metric_f1.update(y_pred = y_pred, y_true = y)
            
        metric_loss.update(loss.item())
        
        if step % 3 == 0:
        
            bar.set_description(
                f'Epoch: {epoch + 1}, Loss: {metric_loss.get():6f}, F1: {metric_f1.get():6f}'
            )
    break
    
    model = model.eval()
    
    with torch.no_grad():

        metric_f1 = RollingAUC(len(test_loader))
        
        for data in valid_loader:

            index = data['index'].to(device)
            mask  = data['mask'].to(device)
            y     = data['y'].to(device)

            y_pred = model(index, mask)

            metric_f1.update(y_pred = y_pred, y_true = y)

        print(f'\n Epoch: {epoch + 1}, Valid - F1: {metric_f1.get():6f}\n')
        
    model = model.train()

#### Make prediction:

In [None]:
y_pred = []

model = model.eval()

bar = tqdm.tqdm(test_loader, position = 0, total = len(test_loader))

with torch.no_grad():
    
    for data in bar:

        index = data['index'].to(device)
        mask  = data['mask'].to(device)

        y_pred.append(model(index, mask))

y_pred = pd.DataFrame(torch.cat(y_pred).numpy())
y_pred.columns = labels

In [None]:
submission = pd.concat([sample_submission['id'], y_pred], axis = 'columns', sort = False)

In [None]:
submission.sample()

In [None]:
submission.to_csv('./../data/submission.csv', index = False)

## What about the energy consumption of these models?

<img src="energy.png" alt="drawing" width="500"/>

**To give us an idea...** ✈️

**100lbs ~= 45 kgs** 

<img src="bert_energy.png" alt="drawing" width="750"/>


NAS: neural architecture search for machine translation and language modeling. NAS base model requires 10 hours to train for 300k steps on one TPUv2 core. This equates to 32,623 hours of TPU or 274,120 hours on 8 P100 GPUs.

`Strubell, Emma, Ananya Ganesh, and Andrew McCallum. "Energy and policy considerations for deep learning in NLP." arXiv preprint arXiv:1906.02243 (2019).`

#### TIPS:

- How to speed up bert in production?

- How to effectively fine tune bert?

#### References:

---

Understanding LSTM Networks

url: https://colah.github.io/posts/2015-08-Understanding-LSTMs/?fbclid=IwAR27M5XcLHjege9JDK94xMPAmbUYZTf9BXCYQSoGTzfI00P1PGa3U0mW6rY

---
Publication introducing Bert.

title: "Bert: Pre-training of deep bidirectional transformers for language understanding"

author: Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina

year: 2018

url: https://arxiv.org/abs/1810.04805

---

The Illustrated Transformer.

url: http://jalammar.github.io/illustrated-transformer/

---

Transformers from scratch.

http://peterbloem.nl/blog/transformers

---

Huggingface, transformers library:

url: https://github.com/huggingface/transformers

---

Source code of Bert.

url: https://github.com/google-research/bert

---

Bert explained.

url: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

---

Energy consumption of deep neural network models:

title: "Energy and policy considerations for deep learning in NLP."

author: Strubell, Emma, Ananya Ganesh, and Andrew McCallum.

year: 2019

url: https://arxiv.org/abs/1906.02243

---

BERT Explained – A list of Frequently Asked Questions

url: https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/

---

Wordpiece tokenizer:

title: "Japanese and korean voice search."

author: Schuster, Mike, and Kaisuke Nakajima.

year: 2012 

url: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf

---

title: "Distilling the knowledge in a neural network."

author: Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 

year: 2014

url: https://arxiv.org/abs/1503.02531

---

title: "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter."

author: Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas

year: 2019

url: https://arxiv.org/abs/1910.01108


---