### MLM Logic

Here, we will train for Masked Language Modelling.

First, import everything.

In [None]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Example Text

In [None]:
# text = ("After Abrahm Lincoln won the November 1860 Presidential [MASK] on an "
#         "anti-slavery platform, an initial seven slave states declared their "
#         "secession from the country to form the Confederacy. War broke out in "
#         "1861 when sessionist forces [MASK] Fort Sumter in South "
#         "Carolina, just over a month after Lincoln's inauguration")
text = ("After Abrahm Lincoln won the November 1860 Presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "1861 when sessionist forces attacked Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration")

First, we will tokenize our text.

In [None]:
# inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
# using pyTorch so we want to return tensors
inputs = tokenizer(text, return_tensors='pt')
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

Special Tokens

103 = MASK token

101 = special token

Everything else is actual token text.

In [None]:
inputs.input_ids

tensor([[  101,  2044, 11113, 10404,  2213,  5367,  2180,  1996,  2281,  7313,
          4883,  1031,  2602,  2006,  2019,  3424,  1011,  8864,  4132,  1010,
          2019,  3988,  2698,  6658,  2163,  4161,  2037, 22965,  2013,  1996,
          2406,  2000,  2433,  1996, 18179,  1012,  2162,  3631,  2041,  1999,
          6863,  2043,  5219,  2923,  2749,  4457,  3481,  7680,  3334,  1999,
          2148,  3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,
          1055, 17331,   102]])

Create target labels under the tensor name 'labels'
It is a copy of input_id tensor so just clone it.

create our labels tensor by cloning the input_ids tensor.



In [None]:
inputs['labels'] = inputs.input_ids.detach().clone()

In [None]:
inputs

{'input_ids': tensor([[  101,  2044, 11113, 10404,  2213,  5367,  2180,  1996,  2281,  7313,
          4883,  1031,  2602,  2006,  2019,  3424,  1011,  8864,  4132,  1010,
          2019,  3988,  2698,  6658,  2163,  4161,  2037, 22965,  2013,  1996,
          2406,  2000,  2433,  1996, 18179,  1012,  2162,  3631,  2041,  1999,
          6863,  2043,  5219,  2923,  2749,  4457,  3481,  7680,  3334,  1999,
          2148,  3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,
          1055, 17331,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labe

Now, we mask a random number of input ids or tokens within the input_ids tensor but not the labels tensor

Now we mask tokens in the input_ids tensor, using the 15% probability we used before - and the not a CLS or SEP token condition. This time, because we have padding tokens we also need to exclude PAD tokens (0 input ids).

In [None]:
# create random array of floats with equal dimensions to input_ids tensor
rand = torch.rand(inputs.input_ids.shape)
rand.shape

torch.Size([1, 63])

In [None]:
rand

tensor([[0.5781, 0.3307, 0.1340, 0.4611, 0.3069, 0.1520, 0.1894, 0.1903, 0.4399,
         0.6487, 0.2077, 0.2646, 0.1158, 0.1014, 0.1104, 0.7378, 0.2717, 0.3029,
         0.5183, 0.5930, 0.5438, 0.7892, 0.3562, 0.6088, 0.4990, 0.6191, 0.6214,
         0.1239, 0.3377, 0.5458, 0.4842, 0.8023, 0.9957, 0.0618, 0.8546, 0.5215,
         0.6935, 0.7516, 0.9613, 0.8956, 0.0335, 0.0840, 0.4051, 0.1205, 0.9481,
         0.3516, 0.0479, 0.8222, 0.5770, 0.6068, 0.0263, 0.5486, 0.8625, 0.4247,
         0.9496, 0.8863, 0.7352, 0.3732, 0.3446, 0.4884, 0.2773, 0.9050, 0.2130]])

In [None]:
# create mask array
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102) * (inputs.input_ids != 0)

In [None]:
mask_arr

tensor([[False, False,  True, False, False, False, False, False, False, False,
         False, False,  True,  True,  True, False, False, False, False, False,
         False, False, False, False, False, False, False,  True, False, False,
         False, False, False,  True, False, False, False, False, False, False,
          True,  True, False,  True, False, False,  True, False, False, False,
          True, False, False, False, False, False, False, False, False, False,
         False, False, False]])

Gives boolean array.
Where TRUE are the masked tokens.

But first & last token i.e., special tokens are not to be Masked.

We don't want to mask our Separator or Classifier token (any special tokens)

Extra logic to avoid Masking special tokens.

In [None]:
# 101 = Classifier token
# 102 = Separator token
# (inputs.input_ids != 101) * (inputs.input_ids != 102) * (inputs.input_ids != 0)

In [None]:
mask_arr[0]

tensor([False, False,  True, False, False, False, False, False, False, False,
        False, False,  True,  True,  True, False, False, False, False, False,
        False, False, False, False, False, False, False,  True, False, False,
        False, False, False,  True, False, False, False, False, False, False,
         True,  True, False,  True, False, False,  True, False, False, False,
         True, False, False, False, False, False, False, False, False, False,
        False, False, False])

In [None]:
# gives us a vector of indices where we have True values
mask_arr[0].nonzero()

tensor([[ 2],
        [12],
        [13],
        [14],
        [27],
        [33],
        [40],
        [41],
        [43],
        [46],
        [50]])

In [None]:
# convert vector to list
mask_arr[0].nonzero().tolist()

[[2], [12], [13], [14], [27], [33], [40], [41], [43], [46], [50]]

In [None]:
# we have list within list therefore flatten
selection = torch.flatten(mask_arr[0].nonzero()).tolist()

In [None]:
selection

[2, 12, 13, 14, 27, 33, 40, 41, 43, 46, 50]

In [None]:
# 103 = MASK token
# to select first part of inputs.input_ids tensor so 0th index
inputs.input_ids[0, selection] = 103

We have MASK tokens in 15% positions.

We can see the values 103 have been assigned in the same positions as we found True values in the mask_arr tensor.

In [None]:
inputs.input_ids

tensor([[  101,  2044,   103, 10404,  2213,  5367,  2180,  1996,  2281,  7313,
          4883,  1031,   103,   103,   103,  3424,  1011,  8864,  4132,  1010,
          2019,  3988,  2698,  6658,  2163,  4161,  2037,   103,  2013,  1996,
          2406,  2000,  2433,   103, 18179,  1012,  2162,  3631,  2041,  1999,
           103,   103,  5219,   103,  2749,  4457,   103,  7680,  3334,  1999,
           103,  3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,
          1055, 17331,   102]])

Now we can pass this on to the model which will calculate our Loss & Logits (actual token at the place of MASK token)

In [None]:
outputs = model (**inputs)

Outputs has 2 tensors loss & logits

In [None]:
outputs.keys()

odict_keys(['loss', 'logits'])

In [None]:
outputs.loss

tensor(1.6917, grad_fn=<NllLossBackward0>)