# Finetuning BERT for the MLM-task with the Cranfield set
generic MLM training inspired by [this post](https://towardsdatascience.com/masked-language-modelling-with-bert-7d49793e5d2c) by James Brigg

 The following notebook creates a MLM-training set eand trains a pre-trained BERT model for the MLM task, finaly is saves the model in de Models dir.

## Import Depenedencies, do Setting

In [None]:
# the next line requires that this notebook is connected to google drive
# %cd /content/drive/MyDrive/COMPUTING SCIENCE/THESIS_PROJECT/bert-meets-cranfield/Code
%cd /home/jupyter/BERT-BM25-Thesis-Project/bert-meets-cranfield-enrich/Code

/home/jupyter/BERT-BM25-Thesis-Project/bert-meets-cranfield-enrich/Code


In [None]:
!pip install transformers



In [None]:
from transformers import BertTokenizer, BertForMaskedLM
# from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
import torch
import torch.nn as nn
# import os

import data_utils

from transformers import AdamW
from tqdm import tqdm, tqdm_notebook, notebook
import numpy as np

# Set Hyperparameters and Flags

In [None]:
output_dir = "/home/jupyter/BERT-BM25-Thesis-Project/Models/" #@param {type:"string"}

In [None]:
MAX_LENGTH = 128
BATCH_SIZE = 16
LEARNING_RATE = 5e-5
EPOCHS = 2

# GLOBAL TESING FLAG
ALLOW_CONSECUTIVE_MASKING= False

# model_dir = '/content/drive/MyDrive/COMPUTING SCIENCE/THESIS_PROJECT/bert-meets-cranfield/Models/'
# custom
output_model_file = output_dir + 'BERT_Cranfield_MLM_model-' + str(MAX_LENGTH) + '-' + str(BATCH_SIZE) + '-' + str(LEARNING_RATE) + '-' + str(EPOCHS) + '.bin'
output_config_file = output_dir + 'BERT_Cranfield_MLM_config-' + str(MAX_LENGTH) + '-' + str(BATCH_SIZE) + '-' + str(LEARNING_RATE) + '-' +  str(EPOCHS) + '.bin'
output_vocab_file = output_dir + 'BERT_Cranfield_MLM_vocab-' + str(MAX_LENGTH) + '-' + str(BATCH_SIZE) + '-' + str(LEARNING_RATE) + '-' +  str(EPOCHS) + '.bin'

print("# =========MLM=TRAINING===============================")
print("#               Hyper-Parameters")
print(LEARNING_RATE)
print(MAX_LENGTH)
print(BATCH_SIZE)
print(EPOCHS)
print("#               Experiment-Settings")
print('ALLOW_CONSECUTIVE_MASKING ', ALLOW_CONSECUTIVE_MASKING)

print("#               Other")
print(torch.cuda.get_device_name())
print("# ========================================")

#               Hyper-Parameters
5e-05
128
16
2
#               Experiment-Settings
ALLOW_CONSECUTIVE_MASKING  False
#               Other
Tesla T4


## Get tokenized corpus
This cells inmport the Cranfield corpus and tokenize it. They also add labels for the training and testing.

In [None]:
corpus = data_utils.get_corpus('../Data/cran/cran.all.1400')
# create tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
# do_lower_case: lowercase the input when tokenizing
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
inputs = tokenizer(corpus, return_tensors='pt', max_length=MAX_LENGTH, 
                   truncation=True, padding='max_length')
# inputs has no labels yet, so we create them by copy the input_ids
inputs['labels'] = inputs.input_ids.detach().clone()

# Masking the training set
basic task is mask a certain percentage, around 15%, while not masksing the special tokens. Based on ongoing insights there is a function that can prevent the maksing of consecutive tokens.

In [None]:
torch.random.manual_seed(234)
if ALLOW_CONSECUTIVE_MASKING:
  # maks at random using thresholded random values 
  rand = torch.rand(inputs.input_ids.shape)
  # create the masking array, note that certain tokens are off limits
  mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * \
            (inputs.input_ids != 102) * (inputs.input_ids != 0)
elif not ALLOW_CONSECUTIVE_MASKING:
  # omit the use of even or oneven indices
  basis = [False, True]
  chessboard = np.tile(basis, inputs.input_ids.shape )
  # reshape
  chessboard = chessboard[:, 0:MAX_LENGTH]
  chessboard[:,-1] = False
  chessboard = torch.tensor(chessboard)
  # rand around 0.3 should be best, but ginven the fixed seed, we can work toward the result.
  rand_chessboard = ( rand < .35 ) * chessboard
  mask_arr = ( rand_chessboard) * (inputs.input_ids != 101) * \
            (inputs.input_ids != 102) * (inputs.input_ids != 0) 
  print(mask_arr[0])
  # check how masked
  print(sum(mask_arr[0] == True) /128)

tensor([False, False, False,  True, False, False, False,  True, False, False,
        False,  True, False, False, False, False, False, False, False,  True,
        False, False, False,  True, False,  True, False, False, False, False,
        False, False, False, False, False, False, False, False, False,  True,
        False, False, False,  True, False,  True, False, False, False, False,
        False, False, False, False, False, False, False,  True, False,  True,
        False,  True, False,  True, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False, False,  True,
        False, False, False, False, False, False, False, False, False,  True,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,  True,
        False, False, False, False, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False])

test to check the sets we've just build

In [None]:
selection = []

for i in range(inputs.input_ids.shape[0]):
  selection.append(
      torch.flatten(mask_arr[i].nonzero()).tolist()
  )

The random chosen `input_ids` are to be overwitten with the mask character 103

In [None]:
for i in range(inputs.input_ids.shape[0]):
  inputs.input_ids[i, selection[i]] = 103
# testing:
inputs.input_ids

tensor([[  101,  6388,  4812,  ...,  4297, 28578,   102],
        [  101,   103, 18330,  ...,   103,  6895,   102],
        [  101,  1996,  6192,  ...,     0,     0,     0],
        ...,
        [  101,  9211,  1997,  ...,   103,  2011,   102],
        [  101, 10131,  2989,  ...,     0,     0,     0],
        [  101,   103, 10131,  ...,     0,     0,     0]])

In [None]:
# make an illustration
input0 = inputs.input_ids[0]
tokenizer.decode(input0)

'[CLS] experimental investigation [MASK] the aerodynamics [MASK] a wing in [MASK] slipstream an experimental study of a [MASK] in a propeller [MASK]tream [MASK] made in order to determine the spanwise distribution of the lift increase [MASK] to slipstream [MASK] different [MASK] of attack of the wing and at different free stream to [MASK]tream [MASK] ratios [MASK] results [MASK] intended in part as an evaluation basis [MASK] different theoretical treatments of this problem the [MASK] span loading curves, together with supporting evidence, [MASK] that a substantial part of the lift increment produced by the slipstream was due to a [MASK] destalling / or boundary - layer - control effect [MASK] integrated remaining lift increm [SEP]'

Here the consequtive masking can be seen (if they are there - we made a flag that prevents such behaviour)

### Create dataloader
customized PyTorch DataLoader as it is to be used when training

In [None]:
class CranfieldDataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  def __getitem__(self, idx):
    return{key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  def __len__(self):
    return len(self.encodings.input_ids)

In [None]:
# initialize the data
dataset = CranfieldDataset(inputs)
# split the input in train and test
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [1000, 400], generator=torch.Generator().manual_seed(88))
# and initialize the dataloader
# keep the settings
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True)

# Set up training / evalutation
* set the device to GPU if it is there
* intialize the optimizer

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
if torch.cuda.is_available():
 print('GPU Type:', torch.cuda.get_device_name(0))

GPU Type: Tesla T4


In [None]:
# capture
model.to(device)
# activate model
model.train()
torch.cuda
optim = AdamW(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss()

In [None]:
# check input keys
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

# Train / Test

In [None]:
for epoch in range(EPOCHS):
  loop = tqdm_notebook(train_loader, leave=True)
  for batch in loop:
    optim.zero_grad()
    # define input tensors
    input_ids       = batch['input_ids'].to(device)
    attention_mask  = batch['attention_mask'].to(device)
    labels          = batch['labels'].to(device)
    # process
    output = model(input_ids, attention_mask=attention_mask, labels=labels)
    loss = output.loss
    loss.backward()
    # update parameters
    optim.step()
    # print relevant info to progress bar
    loop.set_description(f'Epoch {epoch}')
    loop.set_postfix(loss=loss.item())

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


  0%|          | 0/63 [00:00<?, ?it/s]

  """


  0%|          | 0/63 [00:00<?, ?it/s]

## Save trained model
The following describes the technical implications and solutions found. It takes the PyTorch approach, since the method was implemented in this module. One could also choose to use the Tensorflow implementation, with potential benefit of integration with Googles Tensor Flow Processing Unit (TPU).
In both cases (MLM and NSP) the model was saved using the PyTorch function `state_dict`, this is because the BertModel used is a PyTorch `torch.nn.Module`  subclass, an elaborate description can be found on the [huggingface.co] website(https://huggingface.co/transformers/model_doc/bert.html\#bertmodel). 
A PyTorch model can be saved in something called a `state_dict`, this is a 'Python dictionary object that maps each layer to its parameter tensor', more can be read in the [maual](https://pytorch.org/tutorials/beginner/saving\_loading\_models.html\#what-is-a-state-dict). 
This enables easy inspection and selection of keys, this way only the encoder can be saved i.e. the keys starting with `bert.encoder` are filtered.

Now, for the ranking a standard pre-trained BertModel is loaded, with on top of it a `BertForSequenceClassification`-task head, hereafter the encoder-parameters from either the MLM-task or the NSP-task are loaded. Following this two options will be researched, in one the encoder parameters will be frozen, in the other no parameters will be frozen. With these options the BERT re-ranker is trained with various settings, such as number of epochs and learning rate.

In [None]:
# next line takes care of potentially distributed /paralell training
# common practice on huggingface fora (Thom Wolf posts)
model_to_save = model.module if hasattr(model, 'module') else model
to_save_dict = model_to_save.state_dict()
to_save_with_prefix= {}
for key, value in to_save_dict.items():
    if key.startswith('bert.encoder'):
      to_save_with_prefix[key] = value
    # to_save_with_prefix['bert.encoder' + key] = value
    # print(key)
torch.save(to_save_with_prefix, output_model_file)
print("saved to: ", output_model_file)

saved to:  /home/jupyter/BERT-BM25-Thesis-Project/Models/BERT_Cranfield_MLM_model-128-16-5e-05-2.bin


Inspect the state dict, it should only contain parameters for the encoder

In [None]:
# Check the state_dict 
# for the MLM-task it would look like this
print("MLM-Models state dict:\n")
for param_tensor in to_save_with_prefix:
  print(param_tensor, "\t", to_save_with_prefix[param_tensor].size())

MLM-Models state dict:

bert.encoder.layer.0.attention.self.query.weight 	 torch.Size([768, 768])
bert.encoder.layer.0.attention.self.query.bias 	 torch.Size([768])
bert.encoder.layer.0.attention.self.key.weight 	 torch.Size([768, 768])
bert.encoder.layer.0.attention.self.key.bias 	 torch.Size([768])
bert.encoder.layer.0.attention.self.value.weight 	 torch.Size([768, 768])
bert.encoder.layer.0.attention.self.value.bias 	 torch.Size([768])
bert.encoder.layer.0.attention.output.dense.weight 	 torch.Size([768, 768])
bert.encoder.layer.0.attention.output.dense.bias 	 torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.weight 	 torch.Size([768])
bert.encoder.layer.0.attention.output.LayerNorm.bias 	 torch.Size([768])
bert.encoder.layer.0.intermediate.dense.weight 	 torch.Size([3072, 768])
bert.encoder.layer.0.intermediate.dense.bias 	 torch.Size([3072])
bert.encoder.layer.0.output.dense.weight 	 torch.Size([768, 3072])
bert.encoder.layer.0.output.dense.bias 	 torch.Size([768])

# Evaluate MLM training
the accuracy is measured simply as the percentage correct predicted.

In [None]:
def acc(predictions, input_ids, labels, batch_size):
  # returns the accuracy i.e. correctly predicted fraction
  # predictions     [MaskedLMOutput]
  # input_ids       [torchTensor at device]
  # labels          [torchTensor at device]
  # batch_size      [int]

  batch_accuracy = 0
  for bi in range(batch_size):
    # predictions per 'document'
    pbi = torch.argmax(predictions.logits[bi], dim=1)
    pbi = pbi.to('cpu').numpy()
    # get ground truth i.e. labels
    gt = labels[bi].to('cpu').numpy()
    # get nr of masked items
    mk = input_ids[bi].to('cpu').numpy()
    masked = np.sum(mk == 103)
    # get the difference between the predictions and the labels
    missed = np.setdiff1d(gt, pbi, assume_unique=True)
    # batch_accuracy += ((masked - len(missed) / masked) / batch_size)
    # print('missed: ', missed)
    # print('masked: ', masked)
    # print('batch_size: ', batch_size)

    batch_accuracy += ( (1- (len(missed) / masked) ) / batch_size)
  return batch_accuracy 

In [None]:
model.eval()
sum_loss = 0
total_acc = 0
count = 0
loop = tqdm_notebook(test_loader, leave=True)
for batch in loop:
  with torch.no_grad():
    optim.zero_grad()
    input_ids       = batch['input_ids'].to(device)
    attention_mask  = batch['attention_mask'].to(device)
    labels          = batch['labels'].to(device)
    if input_ids.shape[0] is not BATCH_SIZE:
      break
    print('input_ids shape ', input_ids.shape)
    output = model(input_ids, attention_mask=attention_mask, labels=labels)
    # calculate the accuracy
    # predictions = output.logits
    batch_acc = acc(output, input_ids, labels, BATCH_SIZE)
    total_acc += batch_acc
    print('Batch accuracy: ', batch_acc, ' for batch ', count)
    
    sum_loss += output.loss.item()
    count += 1
print(sum_loss / count)
print(output.loss.item())
print('batch_acc :', total_acc / count)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """


  0%|          | 0/25 [00:00<?, ?it/s]

  """


input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.6653903388278388  for batch  0
input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.7343068351609205  for batch  1
input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.7467894142894141  for batch  2
input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.7956691167122838  for batch  3
input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.7277105231359042  for batch  4
input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.7962176270644877  for batch  5
input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.7044602283839982  for batch  6
input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.7585587546984606  for batch  7
input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.7800759809412555  for batch  8
input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.7121062551525937  for batch  9
input_ids shape  torch.Size([16, 128])
Batch accuracy:  0.7093934235272112  for batch  10
input_ids shape  tor

# Analysis

In [None]:
### illustrative analysis

In [None]:
# what is the item 0 of the test dataset
test0 = test_dataset[0]
print(test0['input_ids'])
a = np.array(test0['input_ids'])
# check te decoded version
tokenizer.decode(a)

tensor([  101,  1996,  4895,   103,  2100,  6336,  1997,  1037,  3358,   103,
        10713,   103,  6463,   103, 25647,   103,  1011,  6336,  4972,   103,
         4777,  1997, 10713,   103,  6463,  2031,  2042, 10174,  2011,   103,
         2075,  1996, 28033,  1999,  8743,   103,  1998,  1996,  6466,   103,
         2886,  1997,  1996, 10709,  3358,  1996, 16268,  2024,  2241,  2006,
         1996,   103,  4118,   103,  3225,  6336,  1997,  1996, 10713,  3358,
         2003,  2179,  2000,   103,  2069,   103,  2625,  2084,  2008,  1997,
         1996,   103,  3358,  1010,  6168,  1996,  2345,  6336,  2089,   103,
         9839,  2625,  1996,  3399,  7127,  2008,  1996,  3988,  4353,  1997,
         6336,  2003,  2714,   103,  1996,   103,  4353, 10543,  4760,   103,
         8386,   103,  6336,  2044,  1037,  5573,  3131,  2689,  1999,   103,
         1997,  2886,  1010,   103, 20015,  1997,  1037, 22147, 11818, 26903,
         1010,  1998,  2076,  1037,  7142,  9808,  6895,   102])

  """


'[CLS] the un [MASK]y lift of a wing [MASK] finite [MASK] ratio [MASK]stead [MASK] - lift functions [MASK] wings of finite [MASK] ratio have been calculated by [MASK]ing the aerodynamic inert [MASK] and the angle [MASK] attack of the infinite wing the calculations are based on the [MASK] method [MASK] starting lift of the finite wing is found to [MASK] only [MASK] less than that of the [MASK] wing, whereas the final lift may [MASK] considerably less the theory indicates that the initial distribution of lift is similar [MASK] the [MASK] distribution curves showing [MASK] variation [MASK] lift after a sudden unit change in [MASK] of attack, [MASK] penetration of a sharpedge gust, and during a continuous osci [SEP]'

In [None]:
test01 = test_dataset[0:2]
print(test01['input_ids'].size())

torch.Size([2, 128])


  """


In [None]:
model.eval()
with torch.no_grad():
  input_ids       = test01['input_ids'].to(device)
  attention_mask  = test01['attention_mask'].to(device)
  predictions= model(input_ids, attention_mask=attention_mask)

In [None]:
predictions.logits[0].size()

# predicted_index = torch.argmax(predictions.logits[0], dim=0).item()
p0 = torch.argmax(predictions.logits[0], dim=1)
np0 = p0.to('cpu').numpy()

# p0np = np.array(p0)
tokenizer.decode(np0)


'[CLS] the unsteady lift of a wing of finite aspect ratio unsteady - lift functions for wings of finite aspect ratio have been calculated by neglecting the aerodynamic inertia and the angle of attack of the infinite wing the calculations are based on the classical method the starting lift of the finite wing is found to be only slightly less than that of the finite wing, whereas the final lift may be considerably less the theory indicates that the initial distribution of lift is similar to the initial distribution curves showing the variation in lift after a sudden unit change in angle of attack, the penetration of a sharpedge gust, and during a continuous osci [SEP]'

Which should have been ([MASK]s partly made bold):
__NOTE__ the bold words are only correct when the CONESECUTIVE MASKING flag is up (note that for completeness, after this cell the label is given)

"the __unstead__y lift of a wing __of__ finite aspect ratio __.
unsteady__-lift functions __for__ wings of finite __aspect__ ratio have been
calculated by __correct__ing the aerodynamic inert __ia__ and the angle of __attack__ of
 the infinite wing . the __calculations__ are based on the operational
method .
the starting lift of the finite wing is found to be only slightly less
than that of the infinite wing,. whereas the final lift may be
considerably less . the theory indicates that the initial distribution of
lift is similar to the final distribution .
curves showing the variation of lift after a sudden unit change in angle
 of attack, during penetration of a sharpedge gust, and during a
continuous oscillation "

In [None]:
print(tokenizer.decode(test01['labels'][0]))

[CLS] the unsteady lift of a wing of finite aspect ratio unsteady - lift functions for wings of finite aspect ratio have been calculated by correcting the aerodynamic inertia and the angle of attack of the infinite wing the calculations are based on the operational method the starting lift of the finite wing is found to be only slightly less than that of the infinite wing, whereas the final lift may be considerably less the theory indicates that the initial distribution of lift is similar to the final distribution curves showing the variation of lift after a sudden unit change in angle of attack, during penetration of a sharpedge gust, and during a continuous osci [SEP]


This illustration gives a good idea how the model 'works' and performs, but how good is it in terms of the confusion matrix?

In [None]:
# get the ground truth and prediction tokens, for this example
gt = test0['labels']
mk = test0['input_ids']
mk = mk.to('cpu').numpy()
pr = torch.argmax(predictions.logits[0], dim=1)
pr = pr.to('cpu').numpy()

masked = np.sum(mk == 103)
missed = np.setdiff1d(gt, pr, assume_unique=True)

print('masked: ', masked)
print('missed: ', missed) # token ids
perc = (masked - len(missed)) / masked
print('percentage guessed right: ', perc)

masked:  22
missed:  [6149 6515]
percentage guessed right:  0.9090909090909091
