# MLM Training

In this notebook we'll cover the training process for masked-language modeling (MLM). First we import and initialize everything required.

In [34]:
from transformers import BertTokenizer, BertForMaskedLM, AutoTokenizer
import torch
import csv, re, json
from tqdm import tqdm

old_tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = BertForMaskedLM.from_pretrained('mmaguero/gn-bert-small-cased')

Some weights of the model checkpoint at mmaguero/gn-bert-small-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We'll be using *Meditations* by *Marcus Aurelius* as our training data. The file below has already been cleaned and so no further processing is required (beyond the `split`).

In [35]:
normalize_tupi = lambda x: re.sub('\s+', ' ', x.replace("'","ç"))
with open('../nhe-enga/anotated_results.json', 'r') as fp:
    text = [{"input":normalize_tupi(x['label']), 'output':normalize_tupi(x['anotated'])} for x in json.load(fp)]
with open('../nhe-enga/anotated_tokens.json', 'r') as fp:
    tokens = [normalize_tupi(x) for x in json.load(fp) if x]

In [36]:
text[:5]

[{'input': 'xe oroeî',
  'output': 'xe[SUBJECT:1ps] oro[OBJECT:2ps:SUBJECT_1P]eî[VERB]'},
 {'input': 'xe nçoroeî',
  'output': 'xe[SUBJECT:1ps] nç[NEGATION_PREFIX]oro[OBJECT:2ps:SUBJECT_1P]e[VERB]î[NEGATION_SUFFIX:VOWEL_ENDING]'},
 {'input': 'xe opoeî',
  'output': 'xe[SUBJECT:1ps] opo[OBJECT:2pp:SUBJECT_1P]eî[VERB]'},
 {'input': 'xe nçopoeî',
  'output': 'xe[SUBJECT:1ps] nç[NEGATION_PREFIX]opo[OBJECT:2pp:SUBJECT_1P]e[VERB]î[NEGATION_SUFFIX:VOWEL_ENDING]'},
 {'input': 'ixé açe aîoseî',
  'output': 'ixé[SUBJECT:1ps] açe[OBJECT:3p] a[SUBJECT_PREFIX:1ps]îos[OBJECT_MARKER:3p:PLURIFORM_PREFIX:MONOSYLLABIC]eî[VERB]'}]

In [37]:
def get_training_corpus():
    dataset = text
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield [x["output"] for x in samples]

special_tokens_dict = {'additional_special_tokens': tokens, 'pad_token': '[PAD]'}
num_added_toks = old_tokenizer.add_special_tokens(special_tokens_dict)

In [38]:
tokenizer = old_tokenizer.train_new_from_iterator(get_training_corpus(), 7000)






First, we'll tokenize our text.

In [41]:
MAX_INPUT_LENGTH=20
# special_tokens_dict = {'additional_special_tokens': tokens}
# num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
inputs = tokenizer([x['input'] for x in text], return_tensors='pt', max_length=MAX_INPUT_LENGTH, truncation=True, padding='max_length')


In [42]:
inputs['input_ids'][0]


tensor([ 18, 318,  16,   1,   7, 106, 106, 106, 106, 106, 106, 106, 106, 106,
        106, 106, 106, 106, 106, 106])

Then we create our *labels* tensor by cloning the *input_ids* tensor.

In [43]:
inputs['labels'] = tokenizer([x['output'] for x in text], return_tensors='pt', max_length=MAX_INPUT_LENGTH, truncation=True, padding='max_length').input_ids.detach().clone()

In [44]:
inputs['labels'][0]

tensor([ 18,  22, 318,  16,  93,   1,   7,  42, 106, 106, 106, 106, 106, 106,
        106, 106, 106, 106, 106, 106])

In [45]:
inputs.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

Now we mask tokens in the *input_ids* tensor, using the 15% probability we used before - and the **not** a *CLS* or *SEP* token condition. This time, because we have padding tokens we also need to exclude *PAD* tokens (*0* input ids).

In [46]:
# # create random array of floats with equal dimensions to input_ids tensor
# rand = torch.rand(inputs.input_ids.shape)
# # create mask array
# mask_arr = (rand < 0.15) * (inputs.input_ids != 2) * \
#            (inputs.input_ids != 3) * (inputs.input_ids != 0)

In [47]:
# mask_arr

And now we take take the indices of each `True` value, within each individual vector.

In [48]:
# selection = []

# for i in range(inputs.input_ids.shape[0]):
#     selection.append(
#         torch.flatten(mask_arr[i].nonzero()).tolist()
#     )

In [49]:
# selection[:5]

Then apply these indices to each respective row in *input_ids*, assigning each of the values at these indices as *103*.

In [50]:
# for i in range(inputs.input_ids.shape[0]):
#     inputs.input_ids[i, selection[i]] = 4

In [51]:
# inputs.input_ids

We can see the values *103* have been assigned in the same positions as we found *True* values in the `mask_arr` tensor.

The `inputs` tensors are now ready, and can we can begin setting them up to be fed into our model during training. We create a PyTorch dataset from our data.

In [52]:
class MeditationsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)

Initialize our data using the `MeditationsDataset` class.

In [53]:
dataset = MeditationsDataset(inputs)

And initialize the dataloader, which we'll be using to load our data into the model during training.

In [54]:
loader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)

Now we can move onto setting up the training loop. First we setup GPU/CPU usage.

In [55]:
device = torch.device('mps')
# and move our model over to the selected device
model.to(device)

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(5247, 768)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-5): 6 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      

Activate the training mode of our model, and initialize our optimizer (Adam with weighted decay - reduces chance of overfitting).

In [56]:
from transformers import AdamW

# activate training mode
model.train()
# initialize optimizer
optim = AdamW(model.parameters(), lr=5e-5)



Now we can move onto the training loop, we'll train for two epochs (change `epochs` to modify this).

In [57]:
from tqdm import tqdm  # for our progress bar

epochs = 2

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0:  59%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                         | 8601/14469 [07:49<05:19, 18.34it/s, loss=0.0192]


KeyboardInterrupt: 

And there we go, we've fine-tuned our BERT model using MLM on *Meditations*!

In [65]:
for i in range(5):
    # Now we will test the model with a sample sentence
    sentence = normalize_tupi("Ixé îagûara Aîuká").lower()
    print(sentence)
    # Now we will use this sentence to predict the masked word
    inputs = tokenizer(sentence, return_tensors='pt', max_length=MAX_INPUT_LENGTH, truncation=True, padding='max_length')
    # print(inputs)
    # # Now we will test the model with a sample sentence
    # sentence = normalize_tupi("..[MASK] endé ruba. ixé nde ra'yra").replace('[mask]', '[MASK]')
    # # Now we will use this sentence to predict the masked word
    # inputs = tokenizer(sentence, return_tensors='pt')
    # print(inputs)
    inputs.to(device)
    # get logits of the prediction
    logits = model(**inputs).logits
    # get index of most likely prediction
    predicted_index = torch.argmax(logits[0, 3]).item()
    # Get a string representation of the most likely word
    pred = tokenizer.convert_ids_to_tokens([predicted_index])[0]
    # print(pred)
    # Now do rhat for all the predictions
    predicted_indices = torch.argmax(logits, dim=2)
    toks = tokenizer.convert_ids_to_tokens(predicted_indices[0])
    # print(toks)
    print(" ".join(toks).replace(' ##', '').replace(' \' ', "'").replace('ç', "'"))

ixé îagûara aîuká
ixé [SUBJECT:1ps] Ġ îa [SUBJECT_PREFIX:1ppi] g û a r a [VERB] a î î [VERB] [PAD] [PAD] [PAD] [PAD] [PAD]
ixé îagûara aîuká
ixé [SUBJECT:1ps] Ġ îa [GERUND_SUBJECT_PREFIX:1ps] g û a r a a a î î [VERB] [PAD] [PAD] [PAD] [PAD] [PAD]
ixé îagûara aîuká
ixé [SUBJECT:1ps] Ġ îa [GERUND_SUBJECT_PREFIX:1ps] g û a r a [VERB] a î î [VERB] [PAD] [PAD] [PAD] [PAD] [PAD]
ixé îagûara aîuká
ixé [SUBJECT:1ps] Ġ îa [GERUND_SUBJECT_PREFIX:1ps] g û a r a a a î ukÃ¡ [VERB] [PAD] [PAD] [PAD] [PAD] [PAD]
ixé îagûara aîuká
ixé [SUBJECT:1ps] Ġ îa [GERUND_SUBJECT_PREFIX:1ps] g û a r a [VERB] a î î [VERB] [PAD] [PAD] [PAD] [PAD] [PAD]


## 