<a href="https://colab.research.google.com/github/Azizkhaled/NLP_with_Aziz/blob/main/MLM_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install transformers

Installing collected packages: tokenizers, safetensors, huggingface-hub, transformers
Successfully installed huggingface-hub-0.16.4 safetensors-0.3.2 tokenizers-0.13.3 transformers-4.32.0


In [4]:
from transformers import  BertTokenizer, BertForMaskedLM
import torch

# Example 1: simple text and concept explaination

## Initailize tokenizer and model

In [12]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

text = ("On social [MASK], thousands of people from around the globe condemned Israel’s latest deadly"
 "attacks on the Palestinians. Despite relentless attempts by Israel andsocial media companies to silence them,"
  "they raised awareness about Israel’s illegal [MASK] as well as its repeated violations of Palestinian human"
   "[MASK] and international law.")

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
outputs.keys()

odict_keys(['logits'])

the logits are our MLM output

In [14]:
outputs.logits.shape

torch.Size([1, 63, 30522])

To identify the token position where we have **[MASK]** tokens we can check the inputs tensor for tokens matching `103` (eg MASK).

In [15]:
mask_pos = torch.flatten((inputs.input_ids[0] == 103).nonzero()).tolist()
mask_pos

[3, 47, 57]

It is for these positions that we must calculate the loss for when training our model. How does that work? Well, we compare the inputs at those two positions, to the predicted outputs at those two positions - converted to one-hot encoding and probability distribution respectively.

## Randomly mask tokens

In reality, we must mask our tokens randomly, after which we feed in the original (unmasked) token_ids into to model as labels, and keep token_ids as the new masked tensor.

To test this, let's unmask our text first.

In [33]:
text = ("On social media, thousands of people from around the globe condemned Israel’s latest deadly"
 "attacks on the Palestinians. Despite relentless attempts by Israel andsocial media companies to silence them,"
  "they raised awareness about Israel’s illegal occupation as well as its repeated violations of Palestinian human"
   "rights and international law.")

And now convert our text using the tokenizer:

In [34]:
inputs = tokenizer(text, return_tensors='pt')

Clone the inputs into labels before masking to be our base truth

In [36]:
inputs['labels'] = inputs.input_ids.detach().clone()
inputs

{'input_ids': tensor([[  101,  2006,  2591,  2865,  1010,  5190,  1997,  2111,  2013,  2105,
          1996,  7595, 10033,  3956,  1521,  1055,  6745,  9252, 19321,  8684,
          2015,  2006,  1996, 21524,  1012,  2750, 21660,  4740,  2011,  3956,
          1998,  6499, 13247,  2865,  3316,  2000,  4223,  2068,  1010,  2027,
          2992,  7073,  2055,  3956,  1521,  1055,  6206,  6139,  2004,  2092,
          2004,  2049,  5567, 13302,  1997,  9302,  2529, 15950,  2015,  1998,
          2248,  2375,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [37]:
# create random array of floats in equal dimension to input_ids
rand = torch.rand(inputs.input_ids.shape)
rand

tensor([[0.6923, 0.8197, 0.7476, 0.0119, 0.0370, 0.9191, 0.3837, 0.9546, 0.1439,
         0.5977, 0.8921, 0.1550, 0.1029, 0.6082, 0.8608, 0.7427, 0.9155, 0.4344,
         0.1660, 0.0449, 0.5158, 0.3124, 0.1385, 0.1940, 0.7310, 0.8536, 0.8797,
         0.7057, 0.8805, 0.1790, 0.7181, 0.0762, 0.3261, 0.2381, 0.4109, 0.1451,
         0.0927, 0.5450, 0.7416, 0.3041, 0.9239, 0.5813, 0.7397, 0.1297, 0.4223,
         0.5359, 0.4041, 0.3880, 0.8068, 0.2648, 0.2100, 0.5945, 0.5983, 0.5385,
         0.4520, 0.5081, 0.3499, 0.5901, 0.2262, 0.1826, 0.3845, 0.3542, 0.1545,
         0.8378]])

In [38]:
# where the random array is less than 0.15, we set true, these will be our MASK tokens
# make sure we don't mask the CLS (101) and SEP tokens (102)
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102)
mask_arr

tensor([[False, False, False,  True,  True, False, False, False,  True, False,
         False, False,  True, False, False, False, False, False, False,  True,
         False, False,  True, False, False, False, False, False, False, False,
         False,  True, False, False, False,  True,  True, False, False, False,
         False, False, False,  True, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False]])

In [39]:
# get the indieces to be masked

selection = torch.flatten((mask_arr[0]).nonzero()).tolist()
selection

[3, 4, 8, 12, 19, 22, 31, 35, 36, 43]

In [40]:
inputs.input_ids[0, selection] = 103
inputs

{'input_ids': tensor([[  101,  2006,  2591,   103,   103,  5190,  1997,  2111,   103,  2105,
          1996,  7595,   103,  3956,  1521,  1055,  6745,  9252, 19321,   103,
          2015,  2006,   103, 21524,  1012,  2750, 21660,  4740,  2011,  3956,
          1998,   103, 13247,  2865,  3316,   103,   103,  2068,  1010,  2027,
          2992,  7073,  2055,   103,  1521,  1055,  6206,  6139,  2004,  2092,
          2004,  2049,  5567, 13302,  1997,  9302,  2529, 15950,  2015,  1998,
          2248,  2375,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

## Pass these inputs into our model

In [41]:
outputs = model(**inputs)
outputs.keys()


odict_keys(['loss', 'logits'])

In [42]:
outputs.loss

tensor(1.8564, grad_fn=<NllLossBackward0>)

It is this loss that we will be training on.