<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/bert_model_output.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic BERT operations


In [29]:
!pip3 -q install datasets transformers

In [30]:
import transformers
import datasets
import torch

In [31]:
tokenizer=transformers.AutoTokenizer.from_pretrained("bert-base-cased") #you can also use the trusty "TurkuNLP/bert-base-finnish-cased-v1"

In [32]:
# We will be running the model directly, so let's use return_tensors="pt" to get torch tensors rather than Python lists
texts=["Dogs like to [MASK] cats. They taste good.","Bad joke!"]
t=tokenizer(texts,padding=True, truncation=True, return_tensors="pt")
print("Input ids",t["input_ids"])
print("Token type ids",t["token_type_ids"])
print("Attention mask",t["attention_mask"])

Input ids tensor([[  101, 16406,  1176,  1106,   103, 11771,   119,  1220,  5080,  1363,
           119,   102],
        [  101,  6304,  8155,   106,   102,     0,     0,     0,     0,     0,
             0,     0]])
Token type ids tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Attention mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])


In [33]:
# This is what the first sequence looks like
tokenizer.decode(t["input_ids"][0])

'[CLS] Dogs like to [MASK] cats. They taste good. [SEP]'

# BERT: bare model
* How to use the bare model
* What does it give us?

In [34]:
bert=transformers.AutoModel.from_pretrained("bert-base-cased") #"TurkuNLP/bert-base-finnish-cased-v1" if you run this in Finnish


* in torch the model's forward() function tends to be mapped to `__call__()` i.e. it is used when you call the model as if it were a function


In [35]:
bert_out=bert(
    input_ids=t["input_ids"],
    attention_mask=t["attention_mask"],
    token_type_ids=t["token_type_ids"])
#an easy way to say the above would be bert(**t)


that's it, this is how you call BERT, now let's see what it gave us (not hard to figure out it is really a dictionary)

In [36]:
bert_out.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

* last_hidden_state: the last layer of the encoder
* pooler_output: the `tanh` layer on top of `[CLS]`

In [37]:
# Before you run this, stop to think:
# What will the shape be? How many dimensions? 1? 2? 3? more? And their approximate sizes?
# make a guess, see if it matches
bert_out.last_hidden_state.shape

torch.Size([2, 12, 768])

In [38]:
# And here? What will the shape be?
bert_out.pooler_output.shape

torch.Size([2, 768])

# BERT: masked language modelling output

* Not much we can do with the above
* But BERT is trained to predict masked words, let's try!

In [39]:
# Have a look at HuggingFace automodels documentation to see what types of automodels there are
bert=transformers.AutoModelForPreTraining.from_pretrained("bert-base-cased")

In [40]:
# Tell the model it is not really being trained (disables dropout for example)
# I do not think this is needed but am playing it safe, the docs say it is put to eval mode upon load: https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained.config
bert=bert.eval()

Now we can again run the model, and we will see the output is quite different!

In [41]:
bert_out=bert(**t)
bert_out.keys()

odict_keys(['prediction_logits', 'seq_relationship_logits'])

In [42]:
# What are these? https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#transformers.BertForPreTraining
#What do you think these shapes will be?
print("Logits",bert_out["prediction_logits"].shape)
print("Seq relationship logits",bert_out["seq_relationship_logits"].shape)

Logits torch.Size([2, 12, 28996])
Seq relationship logits torch.Size([2, 2])


In [43]:
#cross-check
tokenizer.vocab_size

28996

...now let's see how well this works for the masked word prediction...
* we need to find the most likely predicted words
* which can be achieved by arg-sorting the predictions and picking top N words
* this is easy and we have done this kind of stuff before
* now let's try straight in torch without a roundtrip to numpy

In [44]:
predictions = bert_out["prediction_logits"]
print(predictions.shape)
top20=torch.argsort(predictions,dim=2,descending=True)[:,:,:20] #why dim=2? what does [:,:,:20] do?
print(top20)

torch.Size([2, 12, 28996])
tensor([[[  119,   117,   107,   114,  1103,  1105,   136,  1104,  1106,   118,
           1107,  1116,   170,   112,  1108,   113,   146,  1122,  1115,   188],
         [  119,   107,   117,  1103,   132,   114,  1105,  1104,  1106,   136,
            112,   118,   170,  1107,   146,  1108,   113,  1109,  1112,   188],
         [ 1176,  1567,  3851,  4819,  1328,  9353,  2409,  6613,  5548,  3097,
           2037,  3940,  1920, 13054,  1329,  1132,  7871, 20662,  1215,  7407],
         [ 1106, 27629,  1128,  1152,  1103,  1195,   146,  1122,  1136,   170,
           1115,  1706,  1105,  1143,   117,  1184,  6513,  1315,   189,  1505],
         [ 3940,  9839,  1138, 11109,  1267,  8263,  2824,  2311, 13671,  1505,
           1712,  1243,  4877,  4176,  1129,  2147, 19676,  1176,  3963,  3644],
         [11771,  5855, 17408,  3551,  1172, 23463,  6363,  1122,  8892,  1234,
          14986, 11260, 12237, 25164,  1128,  4067, 21235, 13475,  1152,   117],
       

In [45]:
print(texts[0])

print("Guesses:",tokenizer.decode(top20[0,4]))

Dogs like to [MASK] cats. They taste good.
Guesses: eat chase have pet see hunt watch kill scare play keep get feed ride be fight lick like catch avoid


# ...in one block...

In [46]:
texts=["Dogs like to [MASK] cats. They are cute."]
t=tokenizer(texts,padding=True, truncation=True, return_tensors="pt")
bert_out=bert(**t)
top20=torch.argsort(bert_out["prediction_logits"],dim=2,descending=True)[:,:,:20]
print("Guesses:",tokenizer.decode(top20[0,4]))

Guesses: have eat chase see pet keep play watch get scare be hunt ride like visit kill feed fight lick catch


In [47]:
print(t)
print(tokenizer.mask_token_id)

{'input_ids': tensor([[  101, 16406,  1176,  1106,   103, 11771,   119,  1220,  1132, 10509,
           119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
103


# TASKS

As an exercise, you can try to solve the following:

1. How good is BERT at the masked language modelling (MLM) task? Feed random texts e.g. from the IMDB dataset, mask a random token at a time, and check: did BERT predict it correctly?
2. If you did (1), can you answer did BERT predict it correctly in top-5?
3. Try can you do better. Make yourself a program which picks random texts from one of the datasets we used in this course and produces two files: one with segments of texts with one [MASK] and one with the correct answers. Then try to guess the words without looking at the latter file and then compare your answers with the correct ones. How well did you do?
