# Infer masked token with BERT

The task is simple: given a sentence with a masked token, the model should predict which token can replace the masked position. For example, given `The car is [MASK]`, a possible solution could be `red`. This task is part of the BERT pretaining described by its authors (see: https://arxiv.org/pdf/1810.04805.pdf), thus we do not need to fine-tune the model.

## Prepare and import modules

With your environment configured, you can now prepare and import the BERT modules.

In [1]:
!pip install torchvision



In [2]:
!pip install transformers



In [3]:
import numpy as np
import torch
from transformers import BertTokenizer, BertForMaskedLM

# Get model

Visit https://huggingface.co/transformers/pretrained_models.html to see the full list of pretrained models.

In [4]:
BERT_MODEL = 'bert-base-multilingual-cased'

In [5]:
tokenizer = BertTokenizer.from_pretrained(BERT_MODEL)
model = BertForMaskedLM.from_pretrained(BERT_MODEL)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Tokenize & encode

BERT expects 3 different inputs:
- Token IDs: the tokens transformed into numbers.
- Mask: sequence of `0` (if there are PAD tokens in that position) and `1` (otherwise).
- Segments or Type IDs: sequence of `0` and `1` to distinguish between the first and the second sentence in NSP tasks. In this notebook, we do not need this input, so it will be always `0`.

The only constraint is that the **maximum number of tokens is 512**. Please, note that there are extra tokens which we are going to add (see the next section).

For example:
```
Text:       Is this jacksonville?
---------------------------------------------------------------------------------
Tokens:     [CLS] Is    this  ja    ##cks ##on  ##ville ?   [SEP] [PAD] [PAD] ...
Token IDs:  101   12034 10531 10201 18676 10263 12043   136 102   100   100   ...
Mask:       1     1     1     1     1     1     1       1   1     0     0     ...
Type IDs:   0     0     0     0     0     0     0       0   0     0     0     ...
```

Note: Token IDs may be different depending on the tokenizer.

For further details, see BERT implementation `convert_single_example()` at https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/run_classifier.py#L377 (permalink).

### Manual encoding

This section is an example of how to get the encoded input of BERT.

We use the BERT Tokenizer to split the text into tokens. Then, according to the paper, it is necessary to add two tokens at the beginning and the ending: `[CLS]` and `[SEP]`.

In [6]:
def get_masks(tokens, max_len=128):
    """Mask for padding"""
    if len(tokens) > max_len:
        raise IndexError("Token length more than max length!")
    return [1] * len(tokens) + [0] * (max_len - len(tokens))


def get_segments(tokens, max_len=128):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens) > max_len:
        raise IndexError("Token length more than max length!")
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (max_len - len(tokens))


def get_ids(tokens, tokenizer, max_len=128):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [tokenizer.pad_token_id] * (max_len - len(token_ids))
    return input_ids


def encode_input(sentence, max_len=128):
    stokens = tokenize(sentence)
    
    input_ids = get_ids(stokens, tokenizer, max_len)
    input_masks = get_masks(stokens, max_len)
    input_segments = get_segments(stokens, max_len)
    
    model_input = {
        'input_word_ids': np.array([input_ids]),
        'input_masks': np.array([input_masks]),
        'input_segments': np.array([input_segments])
    }
    
    mask_pos = np.array(np.array(stokens) == '[MASK]', dtype='int')
    mask_pos = np.concatenate((mask_pos, np.zeros(max_len - len(mask_pos))))
    mask_pos = mask_pos.astype('int')
    
    return model_input, mask_pos

In [7]:
def tokenize(sentence):
    stokens = tokenizer.tokenize(sentence)
    i = 0
    while i < len(stokens) - 2:
        if stokens[i] == '[' and stokens[i+1] == 'mask' and stokens[i+2] == ']':
            stokens[i] = '[MASK]'
            stokens.pop(i+2)
            stokens.pop(i+1)
        i = i + 1
    
    stokens = ['[CLS]'] + stokens + ['[SEP]']
    return stokens

In [8]:
text = "Hello, I'm a [MASK] model."
tokenize(text)

['[CLS]', 'Hello', ',', 'I', "'", 'm', 'a', '[MASK]', 'model', '.', '[SEP]']

In [9]:
# Example of BERT inputs
text = "Hello, I'm a [MASK] model."
encode_input(text, max_len=15) # max_len=15 for display purpose

({'input_word_ids': array([[  101, 31178,   117,   146,   112,   181,   169,   103, 13192,
            119,   102,     0,     0,     0,     0]]),
  'input_masks': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]),
  'input_segments': array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])},
 array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]))

### Torch encoder

Torch has already implemented similar code as above, thus you should use these functions if you do not need any custom behaviour.

Here you can see the result of `tokenize`. Note that it does not include `[CLS]` and `[SEP]`, but the result of `encoded_input` does include them.

In [10]:
text = "Hello, I'm a [MASK] model."
tokenizer.tokenize(text)

['Hello', ',', 'I', "'", 'm', 'a', '[MASK]', 'model', '.']

In [11]:
text = "Hello, I'm a [MASK] model."

encoded_input = tokenizer(text, return_tensors='pt')
encoded_input

{'input_ids': tensor([[  101, 31178,   117,   146,   112,   181,   169,   103, 13192,   119,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

## Infer masked word

Now, we can use the described encoded data as input of our BERT model and, then, we apply softmax to the output. The result are the probabilities that a token fit in a certain position, so we should focus on the position of the masked token and get the top probabilities.

In [12]:
def get_topk_predictions(model, tokenizer, text, topk=5):
    encoded_input = tokenizer(text, return_tensors='pt')
    logits = model(encoded_input['input_ids'],
                   encoded_input['token_type_ids'],
                   encoded_input['attention_mask'],
                   masked_lm_labels=None)[0]

    logits = logits.squeeze(0)
    probs = torch.softmax(logits, dim=-1)

    mask_cnt = 0
    token_ids = encoded_input['input_ids'][0]
    
    top_preds = []

    for idx, _ in enumerate(token_ids):
        if token_ids[idx] == tokenizer.mask_token_id:
            mask_cnt += 1
            
            topk_prob, topk_indices = torch.topk(probs[idx, :], topk)
            topk_indices = topk_indices.cpu().numpy()
            topk_tokens = tokenizer.convert_ids_to_tokens(topk_indices)
            for prob, tok_str, tok_id in zip(topk_prob, topk_tokens, topk_indices):
                top_preds.append({'token_str': tok_str,
                                  'token_id': tok_id,
                                  'probability': float(prob)})
    
    return top_preds

In [13]:
def display_topk_predictions(model, tokenizer, text, pretty_prob=False):
    top_preds = get_topk_predictions(model, tokenizer, text)
    
    print(text)
    print('=' * 40)
    for item in top_preds:
        if not pretty_prob:
            print('%s %.4f' % (item['token_str'], item['probability']))
        else:
            probability = item['probability'] * 100
            print('%s %.2f%%' % (item['token_str'], probability))

In [20]:
text = "Hello, I'm a [MASK] model."
display_topk_predictions(model, tokenizer, text, pretty_prob=True)

Hello, I'm a [MASK] model.
model 6.66%
real 4.59%
business 3.30%
mathematical 3.16%
new 2.78%


In [27]:
text = 'The doctor ran to the emergency room to see a [MASK].'
display_topk_predictions(model, tokenizer, text, pretty_prob=True)

The doctor ran to the emergency room to see a [MASK].
doctor 7.59%
woman 3.52%
fire 2.40%
patient 2.38%
problem 1.87%


The predictions in English seems to work quite good.

Since I am using a multilingual model, it should have nice predictions in other languages too, e.g. Spanish.

In [22]:
text = 'Este coche es [MASK].'
display_topk_predictions(model, tokenizer, text, pretty_prob=True)

Este coche es [MASK].
perdido 2.83%
coupé 2.72%
retirado 2.69%
motor 1.57%
abandonado 1.56%


Despite I introduced a very simple sentence (`This car is [MASK]`), the outputs do not make sense in a sentence (although they are related to cars). I think that the model is confusing the two Spanish verbs `ser` and `estar` (both mean `to be` in English) or it does not have enough context to output a good result.

Here I added the adverb "*very*" to the previous sentence (`This car is very [MASK].`), so I expect that it outputs better tokens since the context is better too.

In [23]:
text = 'Este coche es muy [MASK].'
display_topk_predictions(model, tokenizer, text, pretty_prob=True)

Este coche es muy [MASK].
popular 59.34%
sencillo 3.47%
raro 2.98%
vendido 1.42%
pequeño 1.41%


In this case, the model success to return tokens which make sense.

Here is one more example but, in this case, the sentence is written in Russian: `I think that Nastya is a very [MASK] person`. Spoiler: the output words make sense :)

In [18]:
text = 'Я считаю, что Настя очень [MASK] человек.'
display_topk_predictions(model, tokenizer, text, pretty_prob=True)

Я считаю, что Настя очень [MASK] человек.
молодой 54.84%
большой 20.15%
великий 5.03%
лучший 4.91%
новый 1.64%
