<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/masked_language_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT for masked langauge modelling


In [1]:
!pip3 -q install datasets transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m71.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m97.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m72.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import transformers
import datasets
import torch

# BERT: masked language modelling output

* But BERT is trained to predict masked words, let's try!

In [21]:
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased") #you can also use the trusty "TurkuNLP/bert-base-finnish-cased-v1"
bert = transformers.AutoModelForPreTraining.from_pretrained("bert-base-cased")

In [23]:
# Tell the model it is not really being trained (disables dropout for example)
# I do not think this is needed but am playing it safe, the docs say it is put to eval mode upon load: https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained.config
bert = bert.eval()

...now let's see how well this works for the masked word prediction...
* we need to find the most likely predicted words
* which can be achieved by arg-sorting the predictions and picking top N words
* this is easy and we have done this kind of stuff before
* now let's try straight in torch without a roundtrip to numpy

In [18]:
texts = ["Dogs like to [MASK] cats. They are cute."]

t = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
bert_out = bert(**t)
top20 = torch.argsort(bert_out["prediction_logits"], dim=2, descending=True)[:,:,:20]

print("Guesses:",tokenizer.decode(top20[0,4]))

Guesses: have eat chase see pet keep play watch get scare be hunt ride like visit kill feed fight lick catch


In [19]:
print(t)
print(tokenizer.mask_token_id)

{'input_ids': tensor([[  101, 16406,  1176,  1106,   103, 11771,   119,  1220,  1132, 10509,
           119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
103


# TASKS

As an exercise, you can try to solve the following:

1. How good is BERT at the masked language modelling (MLM) task? Feed random texts e.g. from the IMDB dataset, mask a random token at a time, and check: did BERT predict it correctly?
2. If you did (1), can you answer did BERT predict it correctly in top-5?
3. Try can you do better. Make yourself a program which picks random texts from one of the datasets we used in this course and produces two files: one with segments of texts with one [MASK] and one with the correct answers. Then try to guess the words without looking at the latter file and then compare your answers with the correct ones. How well did you do?


# General notes:

* Some of you had problems (or really bad performance) because you failed to select predictions for the correct token
* In the example code, the index of the masked token was hard-coded: **top20[0,4]**
* A typical error was to manually select the index based on the original text, not the tokenized text
  - _Dogs like to [MASK] cats. They are cute._ --> fourth token, so index=3
  - _[CLS] Dogs like to [MASK] cats . They are cute . [SEP]_ --> fifth token, so index=4
  - Note that tokenizer can also split tokens into subwords, so cannot assume just +1 for inserted CLS!




In [45]:
print(dir(tokenizer))
text = "Dogs like to [MASK] cats. They are cute."
tok = tokenizer(texts[0])
print(tok)
detok = tokenizer.convert_ids_to_tokens(tok["input_ids"])
print(detok)
print(detok[4])

{'input_ids': [101, 16406, 1176, 1106, 103, 11771, 119, 1220, 1132, 10509, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'Dogs', 'like', 'to', '[MASK]', 'cats', '.', 'They', 'are', 'cute', '.', '[SEP]']
[MASK]


In [41]:
text = "DOGS LIKE TO [MASK] CATS. THEY ARE CUTE."
tok = tokenizer(text)
print(tok)
detok = tokenizer.convert_ids_to_tokens(tok["input_ids"])
print(detok)
print(detok[8])

{'input_ids': [101, 141, 2346, 13472, 149, 2240, 22441, 16972, 103, 8784, 11365, 119, 7462, 3663, 22133, 2036, 140, 16830, 2036, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'D', '##O', '##GS', 'L', '##I', '##KE', 'TO', '[MASK]', 'CA', '##TS', '.', 'THE', '##Y', 'AR', '##E', 'C', '##UT', '##E', '.', '[SEP]']
[MASK]


# Genaralize to mask a random token and get predictions for it

In [57]:
#Get some data
dataset = datasets.load_dataset("imdb", split="train") # we need only training section
dataset = dataset.shuffle()
dataset = dataset.select(range(100)) # downsample
print(dataset)
print(dataset[0])

Dataset({
    features: ['text', 'label'],
    num_rows: 100
})
{'text': 'I rented this movie because it sounded pretty interesting but to my Horror this movie was the worst movie I had ever seen! I read the comment from Gumby-8 and he has to be a part of the cast or the crew. Unless Gumby-8 is a 4 year old child or some demented cult member no one in their right mind would think that this movie had any potential at all.<br /><br />I couldn\'t believe Gumby-8\'s comments. <br /><br />Quoting: "From the "Dune" inspired opening animation to the quick pace...this film keeps the eye moving and works so well that repeat viewing is not unexpected."<br /><br />The Animation is the only aspect of the movie that was interesting and the fact of that the film keeps the eye moving, well that\'s because you keep looking for any type of suspense. I mean give me a break Halloween was shot with a budget of $100,000.00 and a painted mask and also by the way became a cult classic. As far as "repeat view

In [58]:
# truncate review texts to make it easier to spot things, this step is not necessary!
# Note that this can make the task easier / more difficult compared to full reviews

def truncate_review(example):
  text = example["text"]
  tokens = text.split() # naive tokenization, split on whitespace
  tokens = tokens[:30] # let's keep first 30 tokens
  short_text = " ".join(tokens) # convert back to string
  return {"text": short_text}

dataset = dataset.map(truncate_review)
print(dataset)
print(dataset[0])

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 100
})
{'text': 'I rented this movie because it sounded pretty interesting but to my Horror this movie was the worst movie I had ever seen! I read the comment from Gumby-8 and', 'label': 0}


In [133]:
# Let's test how masking of a random token works

import random

#help(tokenizer.decode)

text = dataset[0]["text"]

t = tokenizer(text) # prepare normal tokenized input
print("input_ids:", t["input_ids"])
print("tokens:", tokenizer.decode(t["input_ids"]))

random_idx = random.randint(1, len(t["input_ids"])-2) # sample random token, -2 so that we do not mask the last token which is [SEP]
print("random index:", random_idx, tokenizer.decode(t["input_ids"][random_idx]))

t["input_ids"][random_idx] = tokenizer.mask_token_id # mask the token in input
print("masked tokens:", tokenizer.decode(t["input_ids"]))




input_ids: [101, 146, 12765, 1142, 2523, 1272, 1122, 4234, 2785, 5426, 1133, 1106, 1139, 16709, 1142, 2523, 1108, 1103, 4997, 2523, 146, 1125, 1518, 1562, 106, 146, 2373, 1103, 7368, 1121, 144, 1818, 2665, 118, 129, 1105, 102]
tokens: [CLS] I rented this movie because it sounded pretty interesting but to my Horror this movie was the worst movie I had ever seen! I read the comment from Gumby - 8 and [SEP]
random index: 4 movie
masked tokens: [CLS] I rented this [MASK] because it sounded pretty interesting but to my Horror this movie was the worst movie I had ever seen! I read the comment from Gumby - 8 and [SEP]


In [114]:
# define function to tokenize and mask
def tokenize_and_mask_random(text):
  t = tokenizer(text, return_tensors="pt")
  random_token_idx = random.randint(1, t["input_ids"].shape[1]-2) # sample random token
  original_token = tokenizer.decode(t["input_ids"][0, random_token_idx]) # get the original token
  t["input_ids"][0, random_token_idx] = tokenizer.mask_token_id # mask the token
  return t, random_token_idx, original_token

# define function to get predictions for masked token
def get_predictions(tokenized, random_token_idx):
  bert_out = bert(**tokenized)
  top5 = torch.argsort(bert_out["prediction_logits"], dim=2, descending=True)[:,:,:5] # lets take top5 to answer both 1 and 2
  pred_for_mask = top5[0, random_token_idx] # from first input text, select the predictions for masked token
  return [tokenizer.decode(t) for t in pred_for_mask]

# test with one example
tokenized, random_token_idx, original_token = tokenize_and_mask_random(dataset[0]["text"])
pred_for_mask = get_predictions(tokenized, random_token_idx)

print(dataset[0]["text"])
print(tokenizer.decode(tokenized["input_ids"][0,:]))
print("Original token:", original_token)
print("Predictions:", pred_for_mask)


I rented this movie because it sounded pretty interesting but to my Horror this movie was the worst movie I had ever seen! I read the comment from Gumby-8 and
[CLS] I rented this [MASK] because it sounded pretty interesting but to my Horror this movie was the worst movie I had ever seen! I read the comment from Gumby - 8 and [SEP]
Original token: movie
Predictions: ['movie', 'one', 'film', ',', 'thing']


In [141]:
# run all
# Note: not optimized for speed!

top1_acc = 0
top5_acc = 0

for i,review in enumerate(dataset):
  tokenized, random_token_idx, original_token = tokenize_and_mask_random(review["text"])
  pred_for_mask = get_predictions(tokenized, random_token_idx)
  if pred_for_mask[0]==original_token:
    top1_acc+=1
  if original_token in pred_for_mask:
    top5_acc+=1
  if i < 5: # print first 5 examples for visualization
    print(i)
    print(tokenizer.decode(tokenized["input_ids"][0]))
    print(original_token)
    print(pred_for_mask)
    print()

print("Top-1 accuracy:", top1_acc/len(dataset)*100)
print("Top-5 accuracy:", top5_acc/len(dataset)*100)

0
[CLS] I rented this movie because it sounded pretty interesting but to my Horror this movie was the worst movie [MASK] had ever seen! I read the comment from Gumby - 8 and [SEP]
I
['I', 'anyone', 'he', 'we', 'they']

1
[CLS] Moonwalker is absolutely incredible!!!!!!! What [MASK] can I say!? Michael Jackson is the true King of pop, rock and soul!!! Moonwalker has everything! Great story line, [SEP]
else
['else', 'more', 'exactly', 'words', 'other']

2
[CLS] To Die For ( 1989 ) was just another d. t. v. feature that made an appearance on cable ad nasuem during the early nineties. The only thing notable [MASK] this feature was the [SEP]
about
['about', 'in', 'for', 'of', 'from']

3
[CLS] All Dogs Go To [MASK] is on a par with Watership Down for scary kiddies films. Both were dark and pretty sinister, but at the same time the most mesmerising [SEP]
Heaven
['Hell', 'School', 'Heaven', 'Sleep', 'War']

4
[CLS] I got to see this [MASK] at a preview and was dazzled by it. It's not the typica