# BERT Pre-training and Fine-tuning

In this notebook, you will understand how to implement the BERT model for pre-training, and to fine-tune a pre-trained BERT model for sentiment analysis.


## Preparation

First, let's import necessary modules.

In [3]:
import random, math

import numpy as np
import mxnet as mx
from mxnet import gluon, nd
import gluonnlp as nlp
from bert import data

# utils.py includes some Blocks defined in the previous transformer notebook
from utils import PositionalEncoding, MultiHeadAttention, AddNorm, PositionWiseFFN, EncoderBlock

### Encoder

Different from the transformer encoder, the BERT encoder has an additional embedding for segment information.

![segment embedding](bert-embed.png)

In [4]:
class BERTEncoder(gluon.nn.Block):
    def __init__(self, vocab_size, units, hidden_size,
                 num_heads, num_layers, dropout, **kwargs):
        super(BERTEncoder, self).__init__(**kwargs)
        self.segment_embed = gluon.nn.Embedding(2, units)
        self.word_embed = gluon.nn.Embedding(vocab_size, units)
        self.pos_encoding = PositionalEncoding(units, dropout)
        self.blks = gluon.nn.Sequential()
        for i in range(num_layers):
            self.blks.add(EncoderBlock(units, hidden_size, num_heads, dropout))

    def forward(self, words, segments, mask, *args):
        X = self.word_embed(words) + self.segment_embed(segments)
        X = self.pos_encoding(X)
        for blk in self.blks:
            X = blk(X, mask)
        return X

In [36]:
encoder = BERTEncoder(vocab_size=30000, units=768, hidden_size=3072,
                      num_heads=12, num_layers=12, dropout=0.1)
encoder.initialize()
words = nd.random.randint(low=0, high=30000, shape=(2,8))
segments = nd.array([[0,0,0,0,1,1,1,1],[0,0,0,1,1,1,1,1]])
encodings = encoder(words, segments, None)
print(encodings.shape)

(2, 8, 768)


### Masked Language Model Decoder

Masked language modeling is one of the two pre-training task, where random positions are masked and the model needs to reconstruct the masked words.

In the masked language model decoder, we first use `gather_nd` to pick the dense vectors representing words at masked position. Then a feed-forward network is applied on them, followed by a fully-connected layer to predict the unnormalized score for all words in the vocabulary.

In [26]:
class MLMDecoder(gluon.nn.Block):
    def __init__(self, vocab_size, units, **kwargs):
        super(MLMDecoder, self).__init__(**kwargs)
        self.decoder = gluon.nn.Sequential()
        self.decoder.add(gluon.nn.Dense(units, flatten=False))
        self.decoder.add(gluon.nn.GELU())
        self.decoder.add(gluon.nn.LayerNorm())
        self.decoder.add(gluon.nn.Dense(vocab_size, flatten=False))

    def forward(self, X, masked_positions, *args):
        X = nd.gather_nd(X, masked_positions)
        pred = self.decoder(X)
        return pred

In [40]:
decoder = MLMDecoder(vocab_size=30000, units=768)
decoder.initialize()

masked_positions = nd.array([[0,1],[4,8]])
mlm_pred = decoder(encodings, masked_positions)
print(mlm_pred.shape)

(2, 30000)


### Next Sentence Classifier

In [41]:
class NSClassifier(gluon.nn.Block):
    def __init__(self, vocab_size, units, **kwargs):
        super(NSClassifier, self).__init__(**kwargs)
        self.classifier = gluon.nn.Sequential()
        self.classifier.add(
            gluon.nn.Dense(units=units, flatten=False, activation='tanh'))
        self.classifier.add(
            gluon.nn.Dense(units=2, flatten=False))

    def forward(self, X, *args):
        X = X[:, 0, :]
        pred = self.classifier(X)
        return pred

In [42]:
ns_classifier = NSClassifier(vocab_size=30000, units=768)
ns_classifier.initialize()

pred = ns_classifier(encodings)
print(pred.shape)

(2, 2)


## Using the pre-trained BERT model

The list of pre-trained BERT models available
in GluonNLP can be found
[here](../../model_zoo/bert/index.rst).

In this
tutorial, the BERT model we will use is BERT
BASE trained on an uncased corpus of books and
the English Wikipedia dataset in the
GluonNLP model zoo.

### Get BERT

Let's first take
a look at the BERT model
architecture for sentence pair classification below:
<div style="width:
500px;">![bert-sentence-pair](bert-sentence-pair.png)</div>
where the model takes a pair of
sequences and pools the representation of the
first token in the sequence.
Note that the original BERT model was trained for a
masked language model and next-sentence prediction tasks, which includes layers
for language model decoding and
classification. These layers will not be used
for fine-tuning the sentence pair classification.

We can load the
pre-trained BERT fairly easily
using the model API in GluonNLP, which returns the vocabulary
along with the
model. We include the pooler layer of the pre-trained model by setting
`use_pooler` to `True`.

In [9]:
ctx = mx.gpu(0) if mx.test_utils.list_gpus() else mx.cpu()
bert_base, vocabulary = nlp.model.get_model('bert_12_768_12',
                                             dataset_name='book_corpus_wiki_en_uncased',
                                             pretrained=True, ctx=ctx,
                                             use_decoder=False, use_classifier=False)
print(bert_base)

BERTModel(
  (encoder): BERTEncoder(
    (layer_norm): BERTLayerNorm(scale=True, center=True, eps=1e-12, axis=-1, in_channels=768)
    (transformer_cells): HybridSequential(
      (0): BERTEncoderCell(
        (layer_norm): BERTLayerNorm(scale=True, center=True, eps=1e-12, axis=-1, in_channels=768)
        (ffn): BERTPositionwiseFFN(
          (layer_norm): BERTLayerNorm(scale=True, center=True, eps=1e-12, axis=-1, in_channels=768)
          (activation): GELU()
          (ffn_2): Dense(3072 -> 768, linear)
          (ffn_1): Dense(768 -> 3072, linear)
          (dropout_layer): Dropout(p = 0.1, axes=())
        )
        (attention_cell): MultiHeadAttentionCell(
          (_base_cell): DotProductAttentionCell(
            (_dropout_layer): Dropout(p = 0.1, axes=())
          )
          (proj_key): Dense(768 -> 768, linear)
          (proj_value): Dense(768 -> 768, linear)
          (proj_query): Dense(768 -> 768, linear)
        )
        (proj): Dense(768 -> 768, linear)
        (dr

### Transform the model for `SentencePair` classification

Now that we have loaded
the BERT model, we only need to attach an additional layer for classification.
The `BERTClassifier` class uses a BERT base model to encode sentence
representation, followed by a `nn.Dense` layer for classification.

In [4]:
bert_classifier = model.classification.BERTClassifier(bert_base, num_classes=2, dropout=0.1)
# only need to initialize the classifier layer.
bert_classifier.classifier.initialize(init=mx.init.Normal(0.02), ctx=ctx)
bert_classifier.hybridize(static_alloc=True)

# softmax cross entropy loss for classification
loss_function = mx.gluon.loss.SoftmaxCELoss()
loss_function.hybridize(static_alloc=True)

metric = mx.metric.Accuracy()

## Data preprocessing for BERT

For this tutorial, we need to do a bit of preprocessing before feeding our data introduced
the BERT model. Here we want to leverage the dataset included in the downloaded archive at the
beginning of this tutorial.

### Loading the dataset

We use
the dev set of the
Microsoft Research Paraphrase Corpus dataset. The file is
named 'dev.tsv'. Let's take a look at the first few lines of the raw dataset.

In [5]:
tsv_file = io.open('dev.tsv', encoding='utf-8')
for i in range(5):
    print(tsv_file.readline())

﻿Quality	#1 ID	#2 ID	#1 String	#2 String

1	1355540	1355592	He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .	" The foodservice pie business does not fit our long-term growth strategy .

0	2029631	2029565	Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .	His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .

0	487993	487952	The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .	The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .

1	1989515	1989458	The AFL-CIO is waiting until October to decide if it will endorse a candidate .	The AFL-CIO announced Wednesday that it will decide in October whether to endorse a candidate before the primaries .



The file contains 5 columns, separated by tabs.
The header of
the file explains each of these columns, although an explanation for each is included
here:
0. The label indicating whether the two
sentences are semantically equivalent
1. The id of the first sentence in this
sample
2. The id of the second sentence in this sample
3. The content of the
first sentence
4. The content of the second sentence

For our task, we are
interested in the 0th, 3rd and 4th columns.
To load this dataset, we can use the
`TSVDataset` API and skip the first line because it's just the schema:

In [6]:
# Skip the first line, which is the schema
num_discard_samples = 1
# Split fields by tabs
field_separator = nlp.data.Splitter('\t')
# Fields to select from the file
field_indices = [3, 4, 0]
data_train_raw = nlp.data.TSVDataset(filename='dev.tsv',
                                 field_separator=field_separator,
                                 num_discard_samples=num_discard_samples,
                                 field_indices=field_indices)
sample_id = 0
# Sentence A
print(data_train_raw[sample_id][0])
# Sentence B
print(data_train_raw[sample_id][1])
# 1 means equivalent, 0 means not equivalent
print(data_train_raw[sample_id][2])

He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .
" The foodservice pie business does not fit our long-term growth strategy .
1


To use the pre-trained BERT model, we need to pre-process the data in the same
way it was trained. The following figure shows the input representation in BERT:
<div style="width: 500px;">![bert-embed](bert-embed.png)</div>

We will use
`BERTDatasetTransform` to perform the following transformations:
- tokenize
the
input sequences
- insert [CLS] at the beginning
- insert [SEP] between sentence
A and sentence B, and at the end
- generate segment ids to indicate whether
a token belongs to the first sequence or the second sequence.
- generate valid length

In [7]:
# Use the vocabulary from pre-trained model for tokenization
bert_tokenizer = nlp.data.BERTTokenizer(vocabulary, lower=True)

# The maximum length of an input sequence
max_len = 128

# The labels for the two classes [(0 = not similar) or  (1 = similar)]
all_labels = ["0", "1"]

# whether to transform the data as sentence pairs.
# for single sentence classification, set pair=False
# for regression task, set class_labels=None
# for inference without label available, set has_label=False
pair = True
transform = data.transform.BERTDatasetTransform(bert_tokenizer, max_len,
                                                class_labels=all_labels,
                                                has_label=True,
                                                pad=True,
                                                pair=pair)
data_train = data_train_raw.transform(transform)

print('vocabulary used for tokenization = \n%s'%vocabulary)
print('%s token id = %s'%(vocabulary.padding_token, vocabulary[vocabulary.padding_token]))
print('%s token id = %s'%(vocabulary.cls_token, vocabulary[vocabulary.cls_token]))
print('%s token id = %s'%(vocabulary.sep_token, vocabulary[vocabulary.sep_token]))
print('token ids = \n%s'%data_train[sample_id][0])
print('valid length = \n%s'%data_train[sample_id][1])
print('segment ids = \n%s'%data_train[sample_id][2])
print('label = \n%s'%data_train[sample_id][3])

vocabulary used for tokenization = 
Vocab(size=30522, unk="[UNK]", reserved="['[CLS]', '[SEP]', '[MASK]', '[PAD]']")
[PAD] token id = 1
[CLS] token id = 2
[SEP] token id = 3
token ids = 
[    2  2002  2056  1996  9440  2121  7903  2063 11345  2449  2987  1005
  1056  4906  1996  2194  1005  1055  2146  1011  2744  3930  5656  1012
     3  1000  1996  9440  2121  7903  2063 11345  2449  2515  2025  4906
  2256  2146  1011  2744  3930  5656  1012     3     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1]
valid length = 
44
segment ids = 

## Fine-tuning the model

Now we have all the pieces to put together, and we can finally start fine-tuning the
model with very few epochs. For demonstration, we use a fixed learning rate and
skip the validation steps. For the optimizer, we leverage the ADAM optimizer which
performs very well for NLP data and for BERT models in particular.

In [8]:
# The hyperparameters
batch_size = 32
lr = 5e-6

# The FixedBucketSampler and the DataLoader for making the mini-batches
train_sampler = nlp.data.FixedBucketSampler(lengths=[int(item[1]) for item in data_train],
                                            batch_size=batch_size,
                                            shuffle=True)
bert_dataloader = mx.gluon.data.DataLoader(data_train, batch_sampler=train_sampler)

trainer = mx.gluon.Trainer(bert_classifier.collect_params(), 'adam',
                           {'learning_rate': lr, 'epsilon': 1e-9})

# Collect all differentiable parameters
# `grad_req == 'null'` indicates no gradients are calculated (e.g. constant parameters)
# The gradients for these params are clipped later
params = [p for p in bert_classifier.collect_params().values() if p.grad_req != 'null']
grad_clip = 1

# Training the model with only three epochs
log_interval = 4
num_epochs = 3
for epoch_id in range(num_epochs):
    metric.reset()
    step_loss = 0
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(bert_dataloader):
        with mx.autograd.record():

            # Load the data to the GPU
            token_ids = token_ids.as_in_context(ctx)
            valid_length = valid_length.as_in_context(ctx)
            segment_ids = segment_ids.as_in_context(ctx)
            label = label.as_in_context(ctx)

            # Forward computation
            out = bert_classifier(token_ids, segment_ids, valid_length.astype('float32'))
            ls = loss_function(out, label).mean()

        # And backwards computation
        ls.backward()

        # Gradient clipping
        trainer.allreduce_grads()
        nlp.utils.clip_grad_global_norm(params, 1)
        trainer.update(1)

        step_loss += ls.asscalar()
        metric.update([label], [out])

        # Printing vital information
        if (batch_id + 1) % (log_interval) == 0:
            print('[Epoch {} Batch {}/{}] loss={:.4f}, lr={:.7f}, acc={:.3f}'
                         .format(epoch_id, batch_id + 1, len(bert_dataloader),
                                 step_loss / log_interval,
                                 trainer.learning_rate, metric.get()[1]))
            step_loss = 0

[Epoch 0 Batch 4/18] loss=0.7212, lr=0.0000050, acc=0.458
[Epoch 0 Batch 8/18] loss=0.7397, lr=0.0000050, acc=0.430
[Epoch 0 Batch 12/18] loss=0.7349, lr=0.0000050, acc=0.442
[Epoch 0 Batch 16/18] loss=0.7559, lr=0.0000050, acc=0.437
[Epoch 1 Batch 4/18] loss=0.8057, lr=0.0000050, acc=0.400
[Epoch 1 Batch 8/18] loss=0.7505, lr=0.0000050, acc=0.443
[Epoch 1 Batch 12/18] loss=0.6790, lr=0.0000050, acc=0.488
[Epoch 1 Batch 16/18] loss=0.6854, lr=0.0000050, acc=0.499
[Epoch 2 Batch 4/18] loss=0.6563, lr=0.0000050, acc=0.638
[Epoch 2 Batch 8/18] loss=0.5937, lr=0.0000050, acc=0.676
[Epoch 2 Batch 12/18] loss=0.5657, lr=0.0000050, acc=0.690
[Epoch 2 Batch 16/18] loss=0.6428, lr=0.0000050, acc=0.674


## Conclusion

In this tutorial, we showed how to fine-tune a sentence pair
classification model with pre-trained BERT parameters. In GluonNLP, this can be
done with such few, simple steps. All we did was apply a BERT-style data transformation to
pre-process the data, automatically download the pre-trained model, and feed the
transformed data into the model, all within 50 lines of code!

For demonstration purpose, we skipped the warmup learning rate
schedule and validation on the dev dataset used in the original
implementation. Please visit the
[BERT model zoo webpage](../../model_zoo/bert/index.rst), or the scripts/bert folder
in the Github repository for the complete fine-tuning scripts.

## References

[1] Devlin, Jacob, et al. "Bert:
Pre-training of deep
bidirectional transformers for language understanding."
arXiv preprint
arXiv:1810.04805 (2018).

[2] Dolan, William B., and Chris
Brockett.
"Automatically constructing a corpus of sentential paraphrases."
Proceedings of
the Third International Workshop on Paraphrasing (IWP2005). 2005.

[3] Peters,
Matthew E., et al. "Deep contextualized word representations." arXiv
preprint
arXiv:1802.05365 (2018).