<a href="https://colab.research.google.com/github/Onamihoang/NLP-BERT/blob/master/BERT_MLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
pip install pytorch_transformers

#MLM
This is a demo for BERT's MLM task

This tutorial is based on Eyal Gruss's own [tutorial](https://github.com/eyaler/workshop/blob/master/bert.ipynb)


In [0]:
import torch
from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM

In [0]:
#Init tokenizer, BERT uses its own word part tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

# Set the model in evaluation mode to desactivate the DropOut modules
# This is IMPORTANT to have reproductible results during evaluation!
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

model = model.to('cuda')


In this function we will run masked language modeling.

This function will preprocess the text.
It will add the mandatory BERT tokens of **[CLS]** and **[SEP]**
It will then convert those tokens into their respective ID's in the vocab.

For every token fed into the model we will get a #vocab size dimension softmax vector indicating which words better fit this position in the sentence.

We then take the tokens corrseponding to these top indexes

<img src='http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png' width="600px"/>


(image source: [The Illustrated BERT](http://jalammar.github.io/illustrated-bert/))






In [0]:
"""
Find the best matching guesses for the masked word
text: a sentence much include exactly one [MASK] token to predict.
      For example:
      'Alex likes to have [MASK] with his best friend'
model: a BertForMaskedLM
tokenizer: a Bert tokenizer
topn: Number of candidates for mask

Returns candidates and their probs
"""
def predict_word(text, model, tokenizer, topn=10):
  # Prepare tex
  text = '[CLS] '+ text.lstrip('[CLS] ').rstrip(' [SEP]')+' [SEP]'
  # Tokenize input
  tokenized_text = tokenizer.tokenize(text)

  # Mask a token that we will try to predict back with `BertForMaskedLM`
  masked_index = -1
  for i, token in enumerate(tokenized_text):
    if token=='[MASK]':
      masked_index = i
      break
  assert i>=0

  # Convert token to vocabulary indices
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
  # Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
  segments_ids = [0]*len(tokenized_text)

  # Convert inputs to PyTorch tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensors = torch.tensor([segments_ids])  

  # If you have a GPU, put everything on cuda
  tokens_tensor = tokens_tensor.to('cuda')
  segments_tensors = segments_tensors.to('cuda')  

  # Predict all tokens
  with torch.no_grad():
      outputs = model(tokens_tensor, token_type_ids=segments_tensors)
      predictions = outputs[0]
  print("Predictions shape: " + str(predictions[0].shape))
  predicted_inds = torch.argsort(-predictions[0, masked_index])
  predicted_probs = [round(p.item(),4) for p in torch.softmax(predictions[0, masked_index], 0)[predicted_inds]]
  predicted_tokens = tokenizer.convert_ids_to_tokens([ind.item() for ind in predicted_inds])
  
  return list(zip(predicted_tokens, predicted_probs))[:topn]



In [0]:
predict_word('The boy [MASK] to the school', model, tokenizer)

In [0]:
predict_word('My friend [MASK] is a mother', model, tokenizer)

In [0]:
predict_word('My friend [MASK] is a programmer', model, tokenizer)

In [0]:
predict_word('My friend [MASK] is a doctor', model, tokenizer)