<a href="https://colab.research.google.com/github/Azizkhaled/NLP_with_Aziz/blob/main/MLM_and_NSP_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For pre training BERt there are NSP (Next sentence prediction) and MLM (Mass Language Model) heads

In [1]:
pip install transformers

Successfully installed huggingface-hub-0.16.4 safetensors-0.3.2 tokenizers-0.13.3 transformers-4.32.0


# MLM and NSP

In [2]:
from transformers import BertTokenizer, BertForPreTraining
import torch

In [31]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForPreTraining.from_pretrained('bert-base-uncased')

text = ("After Abraham Lincoln won the November 1860 presidential [MASK] on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces [MASK] Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.") # 60 words the model needs to guess [MASK] after presidential,
                                                                     # and [MASK] after forces


inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

In [32]:
outputs.keys()

odict_keys(['prediction_logits', 'seq_relationship_logits'])

In [33]:
outputs.prediction_logits.shape

torch.Size([1, 62, 30522])

There are 62 tokens (60 + [CLS] and [SEP]), we can see this reflected in the

In [34]:
outputs.seq_relationship_logits

tensor([[ 2.8257, -1.6897]], grad_fn=<AddmmBackward0>)

- `outputs.prediction_logits` is the output from the MLM head (vocab which maps to a word from the vocab after softmax)

- `outputs.seq_relationship_logits` is the output from the NSP head (IsNext/NotNext, 0/1, as to whether it is the next sentence or not)

## MLM

Convert the predictions into words

In [35]:
token2idx = tokenizer.get_vocab()

In [None]:
token2idx.keys()

In [37]:
token2idx['yes']

2748

teken2idx has words as keys and, lets invert it to get the words based on the tokens

In [38]:
idx2token = {}
for key, value in token2idx.items():
  idx2token[value] = key


In [39]:
idx2token[2748]

'yes'

In [40]:
outputs.prediction_logits[0].shape

torch.Size([62, 30522])

`outputs.prediction_logits` has the probabilites of each word in our tokenizer vocab (30522 words) to be a word in our text (62 words)

Next is to extract the most probable using an argmax function:

In [42]:
softmax = torch.nn.functional.softmax(outputs.prediction_logits[0], dim=0)  # create probability distribution
argmax = torch.argmax(softmax, dim=1)

In [43]:
argmax

tensor([28191,  2348,  8181, 16628,  2180,  3882,  2281,  7313,  4883, 27419,
         2006,  2010,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
         8914,  2163, 13520,  2037,  4336,  2013,  1996,  2406,  2000,  2433,
        28775, 18179, 16363,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
        18232,  2923,  2749,  4548,  3481,  7680,  5017,  2005,  2148,  3792,
        24901,  2074,  2058,  1037,  3204,  2077,  3946,  1005,  1055, 17331,
         1025, 25656])

Now lets use these ids to get the tokens

In [44]:
for i in argmax:
    print(idx2token[i.item()], end=' ')

##ecin although abraham lincolnshire won 1948 november 1860 presidential primaries on his anti - slavery platform , an initial seven tributary states declare their independence from the country to form ##ici confederacy ##yre war broke out in april 1861 when ##oya ##ist forces occupied fort sum ##mer for south carolina ##trip just over a month before grant ' s inauguration ; ##tson 

Guesses were: primaries and occupied

the correct words were election and attacked. very close guesses

In [45]:
text

"After Abraham Lincoln won the November 1860 presidential [MASK] on an anti-slavery platform, an initial seven slave states declared their secession from the country to form the Confederacy. War broke out in April 1861 when secessionist forces [MASK] Fort Sumter in South Carolina, just over a month after Lincoln's inauguration."

## NSP

Next sentence prediction is slightly different. First, we need to define the two sequences, which we must split using a **[SEP]** token and differentiate using the `token_type_ids` tensor.

In [67]:
text0 = ("After Abraham Lincoln won the November 1860 presidential [MASK] on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy.")
text1 = ("War broke out in April 1861 when secessionist forces [MASK] Fort "
         "Sumter in South Carolina, just over a month after Lincoln's "
         "inauguration.")
text2 = ("A random text to test the NSP algorithim")

In [68]:
inputs = tokenizer(text0, text1, return_tensors="pt")


In [69]:
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,   103,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,   102,  2162,  3631,  2041,  1999,  2258,  6863,
          2043, 22965,  2923,  2749,   103,  3481,  7680,  3334,  1999,  2148,
          3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055,
         17331,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

`token_type_ids` = 0, thats for the text0.

`token_type_ids` = 1, thats for the text1.

also notice the aditional 102, which a [SEP] token

In [70]:
outputs = model(**inputs)

In [71]:
outputs.seq_relationship_logits


tensor([[ 6.0843, -5.6813]], grad_fn=<AddmmBackward0>)

Great, now we process them through a argmax function to get 0/1 as to whether sentence B follows sentence A (marked by 0 in token_type_ids).

In [72]:
argmax = torch.argmax(outputs.seq_relationship_logits)  # get index of the max activation
'NotNext' if argmax.item() else 'IsNext'


'IsNext'

Index 0 represents BERTs IsNext class, meaning that sentence B is the next sentence after A. Index 1 represents the NotNext class, meaning sentence B is not the next sentence after B. |

Let test it with a a random text

In [74]:
inputs = tokenizer(text0, text2, return_tensors="pt")
outputs = model(**inputs)
argmax = torch.argmax(outputs.seq_relationship_logits)  # get index of the max activation
'NotNext' if argmax.item() else 'IsNext'


'NotNext'

Perfect !