In Code
We know how fine-tuning with NSP and MLM works, but how exactly do we apply that in code?

Well, we can start by importing transformers, PyTorch, and our training data — Meditations (find a copy of the training data here).

In [2]:
!pip install transformers torch accelerate -U transformers[torch]



In [3]:
from transformers import BertTokenizer, BertForPreTraining
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForPreTraining.from_pretrained('bert-base-uncased')

with open('/content/drive/MyDrive/data.txt', 'r') as fp:
    text = fp.read().split('\n')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Now we have a list of paragraphs in text — some, but not all, contain multiple sentences. Which we need when building our NSP training data.

# Preparing For NSP

To prepare our data for NSP, we need to create a mix of non-random sentences (where the two sentences were originally together) — and random sentences.


For this, we’ll create a bag of sentences extracted from text which we can then randomly select a sentence from when creating a random NotNextSentence pair.

In [4]:
bag = [sentence for para in text for sentence in para.split('.') if sentence != '']
bag_size = len(bag)

In [5]:
text[:14]

['From my grandfather Verus I learned good morals and the government of my temper.',
 'From the reputation and remembrance of my father, modesty and a manly character.',
 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.',
 'From my great-grandfather, not to have frequented public schools, and to have had good teachers at home, and to know that on such things a man should spend liberally.',
 "From my governor, to be neither of the green nor of the blue party at the games in the Circus, nor a partizan either of the Parmularius or the Scutarius at the gladiators' fights; from him too I learned endurance of labour, and to want little, and to work with my own hands, and not to meddle with other people's affairs, and not to be ready to listen to slander.",
 'From Diognetus, not to busy myself about trifling things, and not to give credit to what 

In [6]:
text[9:14]

['From Alexander the grammarian, to refrain from fault-finding, and not in a reproachful way to chide those who uttered any barbarous or solecistic or strange-sounding expression; but dexterously to introduce the very expression which ought to have been used, and in the way of answer or giving confirmation, or joining in an inquiry about the thing itself, not about the word, or by some other fit suggestion.',
 'From Fronto I learned to observe what envy, and duplicity, and hypocrisy are in a tyrant, and that generally those among us who are called Patricians are rather deficient in paternal affection.',
 'From Alexander the Platonic, not frequently nor without necessity to say to any one, or to write in a letter, that I have no leisure; nor continually to excuse the neglect of duties required by our relation to those with whom we live, by alleging urgent occupations.',
 'From Catulus, not to be indifferent when a friend finds fault, even if he should find fault without reason, but to t

After creating our bag we can go ahead and create our 50/50 random/non-random NSP training data. For this, we will create a list of sentence As, sentence Bs, and their respective IsNextSentence or NotNextSentence labels.



In [7]:
import random

sentence_a = []
sentence_b = []
label = []

for paragraph in text:
    sentences = [
        sentence for sentence in paragraph.split('.') if sentence != ''
    ]
    num_sentences = len(sentences)
    if num_sentences > 1:
      start = random.randint(0, num_sentences-2)
      # 50/50 whether is IsNextSentence or NotNextSentence
      if random.random() >= 0.5:
        # this is IsNextSentence
        sentence_a.append(sentences[start])
        sentence_b.append(sentences[start+1])
        label.append(0)
      else:
        index = random.randint(0, bag_size-1)
        # this is NotNextSentence
        sentence_a.append(sentences[start])
        sentence_b.append(bag[index])
        label.append(1)


Let's see what we have:



In [8]:
for i in range(3):
  print(sentence_a[i])
  print(sentence_b[i])
  print(label[i])

 I observed that everybody believed that he thought as he spoke, and that in all that he did he never had any bad intention; and he never showed amazement and surprise, and was never in a hurry, and never put off doing a thing, nor was perplexed nor dejected, nor did he ever laugh to disguise his vexation, nor, on the other hand, was he ever passionate or suspicious
 Again, remove to the times of Trajan
1
 He was most ready to give way without envy to those who possessed any particular faculty, such as that of eloquence or knowledge of the law or of morals, or of anything else; and he gave them his help, that each might enjoy reputation according to his deserts; and he always acted conformably to the institutions of his country, without showing any affectation of doing so
 Strive to continue to be such as philosophy wished to make thee
1
 Further, I owe it to the gods that I was not hurried into any offence against any of them, though I had a disposition which, if opportunity had offer

We can see in the console output that we have label 1 representing random sentences (NotNextSentence) and 0 representing non-random sentences (IsNextSentence).



Tokenization
We can now tokenize our data. As is typical with BERT models, we truncate/pad our sequences to a length of 512 tokens.



In [9]:
inputs = tokenizer(sentence_a, sentence_b, return_tensors= 'pt', max_length = 512, truncation = True, padding = 'max_length')

In [10]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [11]:
inputs

{'input_ids': tensor([[  101,  1045,  5159,  ...,     0,     0,     0],
        [  101,  2002,  2001,  ...,     0,     0,     0],
        [  101,  2582,  1010,  ...,     0,     0,     0],
        ...,
        [  101,  3459,  2185,  ...,     0,     0,     0],
        [  101,  2043, 15223,  ...,     0,     0,     0],
        [  101,  7887,  3288,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

There are a few things we should take note of here. Because we tokenized two sentences, our tokenizer automatically applied 0 values to sentence A and 1 values to sentence B in the token_type_ids tensor. The trailing zeros are aligned to the padding tokens.


Secondly, in the input_ids tensor, the tokenizer automatically placed a SEP token (102) between these two sentences — marking the boundary between them both.


BERT needs to see both of these when performing NSP.






# NSP Labels
Our NSP labels must be placed within a tensor called `next_sentence_label`. We create this easily by taking our label variable, and converting it into a `torch.LongTensor` — which must also be transposed using .`T`:

In [12]:
inputs['next_sentence_label'] = torch.LongTensor([label]).T

In [13]:
inputs.next_sentence_label[:10]

tensor([[1],
        [1],
        [1],
        [0],
        [0],
        [1],
        [0],
        [1],
        [0],
        [0]])

# Masking For MLM
For MLM we need to clone our current input_ids tensor to create a MLM labels tensor — then we move onto masking ~15% of tokens in the input_ids tensor.

In [14]:
inputs['labels'] = inputs.input_ids.detach().clone()

In [15]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'next_sentence_label', 'labels'])

Now that we that clone for our labels, we mask tokens in input_ids.


In [16]:
#create random array of floats with equal dimensions to input_ids tensor
rand = torch.rand(inputs.input_ids.shape)
# create mask array
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * \
           (inputs.input_ids != 102) * (inputs.input_ids != 0)


And now take the indices of each True value within each vector.



In [17]:
selection = []

for i in range(inputs.input_ids.shape[0]):
    selection.append(
        torch.flatten(mask_arr[i].nonzero()).tolist()
    )


In [18]:
selection[:2]

[[13, 19, 49, 53, 61, 73, 75, 86], [2, 32, 37, 39, 44, 63, 65, 67, 71, 76, 85]]


Then apply these indices to each row in input_ids, assigning each value at these indices a value of 103.

In [19]:
for i in range(inputs.input_ids.shape[0]):
    inputs.input_ids[i, selection[i]] = 103

In [20]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'next_sentence_label', 'labels'])

In [21]:
inputs.input_ids


tensor([[  101,  1045,  5159,  ...,     0,     0,     0],
        [  101,  2002,   103,  ...,     0,     0,     0],
        [  101,  2582,  1010,  ...,     0,     0,     0],
        ...,
        [  101,  3459,  2185,  ...,     0,     0,     0],
        [  101,  2043, 15223,  ...,     0,     0,     0],
        [  101,  7887,  3288,  ...,     0,     0,     0]])

Note that there are a few rules we’ve added here, by adding the additional logic when creating mask_arr — we are ensuring that we don’t mask any special tokens — such as CLS (101), SEP (102), and PAD (0) tokens.


# Dataloader
All of our input and label tensors are ready — all we need to do now is format them into a PyTorch dataset object so that it can be loaded into a PyTorch Dataloader — which will feed batches of data into our model during training.



In [22]:
class OurDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)


In [23]:
dataset = OurDataset(inputs)


In [24]:
loader = torch.utils.data.DataLoader(dataset, batch_size=2, shuffle=True)

The dataloader expects the __len__ method for checking the total number of samples within our dataset, and the __getitem__ method for extracting samples.

# Setup For Training
The last step before moving onto our training loop is preparing our model training setup.

We first check if we have a GPU available, if so we move the model over to it for training. Then we activate training parameters in our model and initialize an Adam optimizer with weighted decay.



In [25]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# and move our model over to the selected device
model.to(device)

BertForPreTraining(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elemen

In [26]:
from transformers import AdamW

# activate training mode
model.train()
# initialize optimizer
optim = AdamW(model.parameters(), lr=5e-6)



# Training

Finally, we’re onto training our model. We train for two epochs, and use tqdm to create a progress bar for our training loop.

In [27]:
from tqdm import tqdm  # for our progress bar

epochs = 2

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        next_sentence_label = batch['next_sentence_label'].to(device)
        labels = batch['labels'].to(device)
        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        token_type_ids=token_type_ids,
                        next_sentence_label=next_sentence_label,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|██████████| 159/159 [37:34<00:00, 14.18s/it, loss=6.52]
Epoch 1: 100%|██████████| 159/159 [37:58<00:00, 14.33s/it, loss=2.02]


Within the loop we:

- Initialize gradients, so that we are not starting from the gradients calculated in the previous step.
- Move all batch tensors to the selected device (GPU or CPU).
- Feed everything into the model and extract loss.
- Use loss.backward() to calculate the loss for each parameter.
- Update parameter weights based on the calculated loss.
- Print relevant information to the progress bar (loop).

-And that’s it, with that we’ve fine-tuned our model using both MLM and NSP!

