In [1]:
# NSP Training Logic

Next sentence prediction (NSP) is the other side of pretraining for BERT. It consists of taking two sentences, A and B - and attempting to guess (classification) whether sentence B comes after sentence A.

So, where MLM allowed us to encourage BERT to build up a contextual understanding between words - NSP encourages BERT to learn longer term contextual relationships between sentences rather than words.

Let's take a look at how this works in code. First, we import and initialize everything we need.

In [2]:
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch


In [3]:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces attacked Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
outputs.keys()


odict_keys(['logits'])

In [5]:
outputs.logits

tensor([[ 4.4646, -3.6635]], grad_fn=<AddmmBackward0>)

In [7]:
#Then we apply softmax to convert these logits into a probability distribution.

In [8]:
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
probs

tensor([[9.9970e-01, 2.9502e-04]], grad_fn=<SoftmaxBackward0>)

In [9]:

#And finally, take the argmax to get our prediction:

In [10]:

torch.argmax(probs)


tensor(0)

We are getting 0, which is IsNextSentence - however, we haven't actually specified two sentences - so this prediction is meaningless.

There are two parts we're missing. We need to specify two sentences in our input_ids tensor - and we need to create the labels tensor too.

Let's start by splitting the two sentences.

In [11]:
text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy.")
text2 = ("War broke out in April 1861 when secessionist forces attacked Fort "
         "Sumter in South Carolina, just over a month after Lincoln's "
         "inauguration.")

In [12]:
# Tokenize

In [13]:
inputs = tokenizer(text, text2, return_tensors='pt')

In [14]:
inputs


{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,   102,  2162,  3631,  2041,  1999,  2258,  6863,
          2043, 22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,
          3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055,
         17331,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Because we tokenized the two sentences seperately, our tokenizer will deal it a little differently. First, we can see the SEP token in the input_ids tensor with token id 102 - this marks the boundary between sentence A and sentence B.

Second, we can distinguish between sentences from the token_type_ids - sentence A tokens are assigned a 0, whereas sentence B tokens are assigned a 1 token.

Now, we still need to add our labels tensor - but how should it be formatted? Well, we use a torch.LongTensor format, and it must contain a single value [0] if sentence B is the next sentence, else it should be [1]. Here, sentence B is the next sentence so we set it to [0].

In [15]:
labels = torch.LongTensor([0])
labels

tensor([0])

In [16]:
#Now we process everything as we did before, including our labels tensor.

In [17]:
outputs = model(**inputs, labels=labels)
outputs.keys()

odict_keys(['loss', 'logits'])

In [18]:
outputs.loss

tensor(3.2186e-06, grad_fn=<NllLossBackward0>)

In [20]:
#Now we return the loss tensor - which we use for training our model with NSP.

### Data preparation

In [22]:
with open('clean.txt', 'r') as fp:
    text = fp.read().split('\n')

We need to split sentences into consecutive, and non-consecutive sequences.

We have to deal with edge-cases too - for example where there is only a single sentence within a paragraph as with the three examples above (in comparison to below where we can easily split into multiple sentences).



In [23]:
text[51].split('.')

['Body, soul, intelligence: to the body belong sensations, to the soul appetites, to the intelligence principles',
 ' To receive the impressions of forms by means of appearances belongs even to animals; to be pulled by the strings of desire belongs both to wild beasts and to men who have made themselves into women, and to a Phalaris and a Nero: and to have the intelligence that guides to the things which appear suitable belongs also to those who do not believe in the gods, and who betray their country, and do their impure deeds when they have shut the doors',
 ' If then everything else is common to all that I have mentioned, there remains that which is peculiar to the good man, to be pleased and content with what happens, and with the thread which is spun for him; and not to defile the divinity which is planted in his breast, nor disturb it by a crowd of images, but to preserve it tranquil, following it obediently as a god, neither saying anything contrary to the truth, nor doing anyth

In [24]:
bag = [item for sentence in text for item in sentence.split('.') if item != '']
bag_size = len(bag)

In [25]:
bag

['From my grandfather Verus I learned good morals and the government of my temper',
 'From the reputation and remembrance of my father, modesty and a manly character',
 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich',
 'From my great-grandfather, not to have frequented public schools, and to have had good teachers at home, and to know that on such things a man should spend liberally',
 "From my governor, to be neither of the green nor of the blue party at the games in the Circus, nor a partizan either of the Parmularius or the Scutarius at the gladiators' fights; from him too I learned endurance of labour, and to want little, and to work with my own hands, and not to meddle with other people's affairs, and not to be ready to listen to slander",
 'From Diognetus, not to busy myself about trifling things, and not to give credit to what was s

In [29]:
import random 
sentence_a = []
sentence_b = []
label = []

for paragraph in text:
    sentences = [
        sentence for sentence in paragraph.split(".") if sentence != ''
    ]
    num_sentences = len(sentences)
    if num_sentences > 1:
        start = random.randint(0, num_sentences-2)
        # 50/50 whether is IsNextSentence or NotNextSentence
        if random.random() >= 0.5:
             #this is IsNextSentence
            sentence_a.append(sentences[start])
            sentence_b.append(sentences[start+1])
            label.append(0)
        else:
            index = random.randint(0, bag_size-1)
            # this is NotNextSentence
            sentence_a.append(sentences[start])
            sentence_b.append(bag[index])
            label.append(1)
            
        

In [30]:
for i in range(3):
    print(label[i])
    print(sentence_a[i] + '\n---')
    print(sentence_b[i] + '\n')

1
 I observed, too, that no man could ever think that he was despised by Maximus, or ever venture to think himself a better man
---
 Is it not then strange that thy intelligent part only should be disobedient and discontented with its own place? And yet no force is imposed on it, but only those things which are conformable to its nature: still it does not submit, but is carried in the opposite direction

1
 Besides this, he honoured those who were true philosophers, and he did not reproach those who pretended to be philosophers, nor yet was he easily led by them
---
The intelligence of the universe is social

0
 Further, I am thankful to the gods that I was not longer brought up with my grandfather's concubine, and that I preserved the flower of my youth, and that I did not make proof of my virility before the proper season, but even deferred the time; that I was subjected to a ruler and a father who was able to take away all pride from me, and to bring me to the knowledge that it is p

In [42]:
inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt', max_length=512, truncation=True, padding='max_length')

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


In [43]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [44]:
inputs.token_type_ids

tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]])

We can see that the token_type_ids tensors have been built correctly (eg 1 indicating sentence B tokens) by checking the first instance of token_type_ids:

In [45]:

inputs['labels'] = torch.LongTensor([label]).T

In [46]:
inputs.labels[:10]

tensor([[1],
        [1],
        [0],
        [1],
        [1],
        [0],
        [1],
        [1],
        [0],
        [0]])

In [47]:
# Create torch dataset
class MeditationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [49]:
dataset2 = MeditationDataset(inputs)

In [51]:
loader = torch.utils.data.DataLoader(dataset2, batch_size=16, shuffle=True)

In [52]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# and move our model over to the selected device
model.to(device)

BertForNextSentencePrediction(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [53]:
from transformers import AdamW

# activate training mode
model.train()
# initialize optimizer
optim = AdamW(model.parameters(), lr=5e-6)



In [54]:
from tqdm import tqdm  # for our progress bar

epochs = 2

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)
        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        token_type_ids=token_type_ids,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|███████████████████████| 20/20 [05:33<00:00, 16.65s/it, loss=1.65]
Epoch 1: 100%|██████████████████████| 20/20 [05:33<00:00, 16.69s/it, loss=0.526]
