# AdvNLP BERTlab

## Session goal
In this session we'll be using pre-trained BERT to predict missing words in English and German and to tackle a cloze test.

We're only using pre-trained BERT, so you don't need a GPU to run this notebook.
To avoid platform-specific problems, we recommend you run this notebook on Colab.

In [1]:
#!pip install pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

Choose a model below. Be sure to use a **multilingual** model if you wish to use German or anything else other than English.

In [2]:
import numpy as np

#model = 'bert-base-multilingual-uncased'
#model = 'bert-base-multilingual-cased'
model = 'bert-base-uncased'
#model = 'bert-large-uncased'
do_lower_case = True

tokenizer = BertTokenizer.from_pretrained(model, do_lower_case=do_lower_case)
language_model = BertForMaskedLM.from_pretrained(model)
language_model.eval()

100%|██████████| 231508/231508 [00:00<00:00, 644831.45B/s]
100%|██████████| 407873900/407873900 [01:34<00:00, 4329004.05B/s] 


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
         

Enter your text below. Populate the **candidates** list if you want BERT to choose out of a set of predefined words.

In [3]:
def ask_BERT(text, candidates, language_model, tokenizer):
    tokenized_text = tokenizer.tokenize(text)
    masked_index = []

    for i, token in enumerate(tokenized_text):
      if token == '_':
        masked_index.append(i)
        tokenized_text[i]= '[MASK]'

    candidates_ids = tokenizer.convert_tokens_to_ids(candidates)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    
    segments_ids = [0] * len(tokenized_text)
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    predictions = language_model(tokens_tensor, segments_tensors)

    if len(candidates) > 0:
      predictions_candidates = predictions[0, masked_index, candidates_ids]
      probs = (predictions_candidates.softmax(dim=0))

      probabilities = (probs.detach().numpy())
      tokens = tokenizer.convert_ids_to_tokens(candidates_ids)

      tuple_list = ([(x, probabilities[i]) for i, x in enumerate(tokens)])

      sorted_tuple_list = sorted(tuple_list, key=lambda x: x[1], reverse=True)
      print (text)
      print (sorted_tuple_list)

      answer_idx = torch.argmax(predictions_candidates).item()
      print(f'The most likely word is "{candidates[answer_idx]}".')

    else:
      predictions_candidates = predictions[0, masked_index, range(predictions.shape[-1])]
      probs = (predictions_candidates.softmax(dim=0))
      max_prob = probs.max(dim=0)
      threshold = 0.01
      indices = ((np.where(probs>threshold)))

      probabilities = ((probs[probs>threshold]).detach().numpy())
      tokens = tokenizer.convert_ids_to_tokens(indices[0])

      tuple_list = ([(x, probabilities[i]) for i, x in enumerate(tokens)])
      sorted_tuple_list = sorted(tuple_list, key=lambda x: x[1], reverse=True)
      print (text)
      print (sorted_tuple_list)

In [4]:
text = '”I am the first to arrive.” She thought and came to her desk.'
text = text + ' She was surprised to find a bunch of flowers on it.'
text = text + ' They were fresh. She _ them and they were sweet. She looked around for a vase to put them in.'
candidates = ['smelled', 'ate', 'took', 'held']

ask_BERT (text, candidates, language_model, tokenizer)

”I am the first to arrive.” She thought and came to her desk. She was surprised to find a bunch of flowers on it. They were fresh. She _ them and they were sweet. She looked around for a vase to put them in.
[('smelled', 0.8809953), ('ate', 0.0591424), ('took', 0.053083885), ('held', 0.0067784367)]
The most likely word is "smelled".


In [5]:
text = 'Nancy had just got a new job in a company.\
        Monday was the first day she went to work, so she \
        was very _ and arrived early.'
candidates = ['depressed', 'encouraged', 'excited', 'surprised']
ask_BERT (text, candidates, language_model, tokenizer)

Nancy had just got a new job in a company.        Monday was the first day she went to work, so she         was very _ and arrived early.
[('excited', 0.8247292), ('surprised', 0.14804865), ('encouraged', 0.021110376), ('depressed', 0.0061116396)]
The most likely word is "excited".


In [6]:
text = 'Nancy had just got a new job in a company.\
        Monday was the first day she went to work, so she \
        was very excited and arrived early.'
text = text + ' I am the _ to arrive.” She thought and came to her desk.'
candidates = ['last', 'second', 'third', 'first']
ask_BERT (text, candidates, language_model, tokenizer)

Nancy had just got a new job in a company.        Monday was the first day she went to work, so she         was very excited and arrived early. I am the _ to arrive.” She thought and came to her desk.
[('first', 0.916222), ('last', 0.07019155), ('second', 0.011083992), ('third', 0.0025024929)]
The most likely word is "first".


For German, we need Multilingual BERT.

In [7]:
model = 'bert-base-multilingual-cased'

tokenizer = BertTokenizer.from_pretrained(model, do_lower_case=do_lower_case)
language_model = BertForMaskedLM.from_pretrained(model)
language_model.eval()

The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
100%|██████████| 995526/995526 [00:00<00:00, 1769855.51B/s]
100%|██████████| 662804195/662804195 [00:40<00:00, 16261131.17B/s]


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
        

In [8]:
text = 'Wir gehen durch _ Wald.'
ask_BERT (text, [], language_model, tokenizer)

Wir gehen durch _ Wald.
[('den', 0.85803), ('einen', 0.054593645), ('ein', 0.016730534), ('diesen', 0.015647667)]


In [10]:
text = 'Die Trophäe würde nicht in den braunen Koffer passen, weil _ zu groß sei.'
ask_BERT (text, [], language_model, tokenizer)

Die Trophäe würde nicht in den braunen Koffer passen, weil _ zu groß sei.
[('er', 0.38623446), ('sie', 0.20512679), ('es', 0.18469486), ('diese', 0.026795492), ('dieser', 0.017069455), ('man', 0.012291713)]


Below you can find all the text used in the slides. Pass each string to *ask_BERT* to see whether Multilingual BERT knows enough about German grammar. Be sure to switch back to the monolingual model if you wish to use the English language examples below.

In [None]:
# German examples
text = 'Bitte legen Sie es auf _ Schreibtisch.'
text = 'Es ist auf _ Schreibtisch.'
text = 'Wir gehen durch _ Wald.'
text = 'Sie kommen aus _ Schweiz.'
text = 'Ich ging trotz _ Erkältung zur Arbeit.'

text = 'Die Trophäe würde nicht in den braunen Koffer passen, weil _ zu groß war.'
text = 'Le trophée ne rentrait pas dans la valise marron parce qu\'_ était trop grande.'
text = 'The trophy would not fit in the brown suitcase because it was too big. What was too big, the trophy or the suitcase? The _.'

text = 'Mark visited Janet\'s grave in 1765. At that time, _ had been traveling for five years.'
text = 'Mark visited Janet\'s grave in 1765. Mark was alive, Janet was dead. At that time, _ had been traveling for five years.'

text = 'Nancy had just got a new job in a company.\
        Monday was the first day she went to work, so she \
        was very _ and arrived early.'
candidates = ['depressed', 'encouraged', 'excited', 'surprised']

text = 'She _ the door open and found nobody there.'
candidates = ['turned', 'pushed', 'knocked', 'forced']

text = 'Ich ging trotz _ Erkältung zur Arbeit.'
text = 'Ich träume von _ sprechenden Delphin.'
text = 'Ich träume _, mit einem Delfin zu sprechen.'
text = 'Die Trophäe würde nicht in den braunen Koffer passen, weil _ zu klein ist.'

text = 'Mark visited Janet\'s grave in 1765. At that time, _ had been dead for five years.'
text = 'Janet had died. Mark visited Janet\'s grave in 1765 (Janet had died). At that time, _ had been traveling for five years.'

text = 'Die Trophäe würde nicht in den braunen Koffer passen, weil _ zu klein ist.'

# English examples
text = 'She _ the door open and found nobody there.'
candidates = ['turned', 'pushed', 'knocked', 'forced']

text = 'Mark visited Janet\'s grave in 1765. Mark was alive, Janet was dead. At that time, _ had been traveling for five years.'
candidates = []

text = 'Nancy had just got a new job in a company.\
        Monday was the first day she went to work, so she \
        was very excited and arrived early.'
candidates = ['depressed', 'encouraged', 'excited', 'surprised']
text = ' She pushed the door open and found nobody there.'
candidates = ['turned', 'pushed', 'knocked', 'forced']
text = '”I am the _ to arrive.” She thought and came to her desk.'
candidates = ['last', 'second', 'third', 'first']
text = text + ' She was surprised to find a bunch of flowers on it.'
text = text + ' They were fresh. She _ them and they were sweet. She looked around for a vase to put them in.'
candidates = ['smelled', 'ate', 'took', 'held']

In [11]:
text = 'Bitte legen Sie es auf _ Schreibtisch.'
ask_BERT (text, [], language_model, tokenizer)

Bitte legen Sie es auf _ Schreibtisch.
[('den', 0.5549784), ('dem', 0.18432878), ('einen', 0.09112396), ('einem', 0.087149836), ('der', 0.0155886095)]


In [12]:
text = 'Die grössten Einbussen bei den Grossunternehmen gab es neben dem Personalvermittler Adecco bei _ Finanzwerten Partners Group, UBS und Julius Bär, die zeitweise zwischen 8 und 9 Prozent einbrachen.'
ask_BERT (text, [], language_model, tokenizer)

Die grössten Einbussen bei den Grossunternehmen gab es neben dem Personalvermittler Adecco bei _ Finanzwerten Partners Group, UBS und Julius Bär, die zeitweise zwischen 8 und 9 Prozent einbrachen.
[('den', 0.9849413)]


In [13]:
text = 'Die grössten Einbussen bei den Grossunternehmen gab es neben _ Personalvermittler Adecco bei den Finanzwerten Partners Group, UBS und Julius Bär, die zeitweise zwischen 8 und 9 Prozent einbrachen.'
ask_BERT (text, [], language_model, tokenizer)

Die grössten Einbussen bei den Grossunternehmen gab es neben _ Personalvermittler Adecco bei den Finanzwerten Partners Group, UBS und Julius Bär, die zeitweise zwischen 8 und 9 Prozent einbrachen.
[('dem', 0.92739695), ('der', 0.030345194), ('den', 0.02083853)]
