## NSP Training Logic

**Next sentence prediction (NSP)** *is the other side of pretraining for BERT. It consists of taking two sentences, A and B - and attempting to guess (classification) whether sentence B comes after sentence A.*

So, where MLM allowed us to encourage BERT to build up a contextual understanding between words - NSP encourages BERT to learn longer term contextual relationships between sentences rather than words.

Let's take a look at how this works in code. First, we import and initialize everything we need.

In [None]:
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
# consecutive sentences from Wikipedia page on American Civil War
# so, we hope to get output 0 label
text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces attacked Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.")

Now, if we were to tokenize and process this text through our model as is, we'll the logits tensor as output:

In [None]:
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
outputs.keys()

odict_keys(['logits'])

The logits tensor is our NSP output prediction, which looks like:

In [None]:
outputs.logits

tensor([[ 4.4646, -3.6635]], grad_fn=<AddmmBackward0>)

Then we apply softmax to convert these logits into a probability distribution.

In [None]:
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
probs

tensor([[9.9970e-01, 2.9502e-04]], grad_fn=<SoftmaxBackward0>)

And finally, take the argmax to get our prediction:

In [None]:
torch.argmax(probs)

tensor(0)

We are getting 0, which is IsNextSentence - however, we haven't actually specified two sentences - so this prediction is meaningless.

There are two parts we're missing. We need to specify two sentences in our input_ids tensor - and we need to create the labels tensor too.

Let's start by splitting the two sentences.

In [None]:
text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy.")
text2 = ("War broke out in April 1861 when secessionist forces attacked Fort "
         "Sumter in South Carolina, just over a month after Lincoln's "
         "inauguration.")

Then we tokenize.

In [None]:
# using pyTorch so we want to return a pyTorch tensor
inputs = tokenizer(text, text2, return_tensors='pt')

In [None]:
inputs.keys()

In [None]:
inputs
# our 2 sentences both within the same tensor input_ids sepearated by 102, SEP token
# in token_type_ids, we have 0 for sentence A and 1 for sentence B

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,   102,  2162,  3631,  2041,  1999,  2258,  6863,
          2043, 22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,
          3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055,
         17331,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Because we tokenized the two sentences seperately, our tokenizer will deal it a little differently. First, we can see the SEP token in the input_ids tensor with token id 102 - this marks the boundary between sentence A and sentence B.

Second, we can distinguish between sentences from the token_type_ids - sentence A tokens are assigned a 0, whereas sentence B tokens are assigned a 1 token.

Now, we still need to add our labels tensor - but how should it be formatted? Well, we use a torch.LongTensor format, and it must contain a single value [0] if sentence B is the next sentence, else it should be [1]. Here, sentence B is the next sentence so we set it to [0].

Creating the labels tensor using longTensor.

Pass a list containing a single value which is either 0 for isNext Sentence or 1 for isNot Next Sentence. In our case, we know our 2 sentences are consecutive so pass 0.

> 0 = isNext Sentence

> 1 = notNext Sentence

In [None]:
labels = torch.LongTensor([0])
labels

tensor([0])

Now we process everything as we did before (loss, logits), including our labels tensor.

In [None]:
outputs = model(**inputs, labels=labels)
# output contains a couple of tensors loss and logits
outputs.keys()

odict_keys(['loss', 'logits'])

In [None]:
outputs.logits
# logits has 2 values -> index 0 for isNext and index 1 for notNext

In [None]:
# taking argmax of logits
torch.argmax(outputs.logits)
# we get 0 that is isNext Sentence

In [None]:
# need to pass labels to calculate loss in the outputs
# or else we only get the logits
outputs.loss

tensor(3.2186e-06, grad_fn=<NllLossBackward0>)

Now we return the loss tensor - which we use for training our model with NSP.