# BERT Masked Language Modeling

This notebook demonstrates the masked language modeling (MLM) task using the BERT model. MLM is a technique for training language models to predict masked words in a sentence. BERT is a powerful pre-trained language model that can be fine-tuned for various NLP tasks, including MLM.

This tutorial is based on Eyal Gruss's own [tutorial](https://github.com/eyaler/workshop/blob/master/bert.ipynb)

In [None]:
!pip install transformers==4.31.0

In [None]:
import torch
from transformers import BertTokenizer, BertForMaskedLM

## Loading the Model and Tokenizer

This section loads the pre-trained BERT model and tokenizer. We use the `bert-base-uncased` variant of BERT, which is a smaller and faster version of the model.

In [None]:
# Init tokenizer, BERT uses its own word part tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load pre-trained model (weights) and set to evaluation mode
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# Move model to CUDA if available
try:
    model = model.to('cuda')
except:
    pass

## Running Masked Language Modeling

This section shows how to run masked language modeling using the `predict_word` function.

### The `predict_word` Function

The `predict_word` function takes a text input with a masked token (`[MASK]`) and predicts the most likely words to fill the mask. It uses the BERT model and tokenizer to process the input and generate predictions.

**Inputs:**

* `text`: A sentence containing exactly one `[MASK]` token to predict. For example: 'Alex likes to have [MASK] with his best friend'.
* `model`: A `BertForMaskedLM` model.
* `tokenizer`: A BERT tokenizer.
* `topn`: The number of candidates to return (default: 10).

**Outputs:**

* A list of tuples, where each tuple contains a predicted token and its probability.

**Example Usage:**


```
predictions = predict_word('The boy [MASK] to the school', model, tokenizer)
print(predictions)
```

This will print the top 10 predicted words and their probabilities for the masked token in the sentence "The boy [MASK] to the school".

In [None]:
"""
Find the best matching guesses for the masked word
text: a sentence much include exactly one [MASK] token to predict.
      For example:
      'Alex likes to have [MASK] with his best friend'
model: a BertForMaskedLM
tokenizer: a Bert tokenizer
topn: Number of candidates for mask

Returns candidates and their probs
"""
def predict_word(text, model, tokenizer, topn=10):
  # Prepare text
  text = '[CLS] '+ text.lstrip('[CLS] ').rstrip(' [SEP]')+' [SEP]'
  # Tokenize input
  tokenized_text = tokenizer.tokenize(text)

  # Mask a token that we will try to predict back with `BertForMaskedLM`
  masked_index = -1
  for i, token in enumerate(tokenized_text):
    if token=='[MASK]':
      masked_index = i
      break
  assert i>=0

  # Convert token to vocabulary indices
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
  # Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
  segments_ids = [0]*len(tokenized_text)

  # Convert inputs to PyTorch tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensors = torch.tensor([segments_ids])

  # If you have a GPU, put everything on cuda
  tokens_tensor = tokens_tensor.to(model.device) # Instead of hardcoding 'cuda'
  segments_tensors = segments_tensors.to(model.device) # Instead of hardcoding 'cuda'

  # Predict all tokens
  with torch.no_grad():
      outputs = model(tokens_tensor, token_type_ids=segments_tensors)
      predictions = outputs[0]
  print("Predictions shape: " + str(predictions[0].shape))
  predicted_inds = torch.argsort(-predictions[0, masked_index])
  predicted_probs = [round(p.item(),4) for p in torch.softmax(predictions[0, masked_index], 0)[predicted_inds]]
  predicted_tokens = tokenizer.convert_ids_to_tokens([ind.item() for ind in predicted_inds])

  return list(zip(predicted_tokens, predicted_probs))[:topn]

## Masked Language Modeling with BERT

This function implements masked language modeling using the BERT model. Here's a breakdown of the process:

1. **Preprocessing:** The input text is preprocessed by adding the special BERT tokens `[CLS]` at the beginning and `[SEP]` at the end. These tokens are crucial for BERT's understanding of the text structure.
2. **Tokenization and Conversion:** The text is tokenized into individual words or subwords, and these tokens are converted into their corresponding numerical IDs within BERT's vocabulary.
3. **Prediction:** For each token, the BERT model outputs a probability distribution over its entire vocabulary, essentially predicting the most likely words to fill that position. This distribution is represented as a softmax vector with dimensions equal to the vocabulary size.
4. **Selection:**  The function identifies the tokens with the highest probabilities, indicating the most likely candidates for the masked word.

<img src='http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png' width="600px"/>


(image source: [The Illustrated BERT](http://jalammar.github.io/illustrated-bert/))






In [None]:
predict_word('The boy [MASK] to the school', model, tokenizer)

In [None]:
predict_word('My friend [MASK] is a mother', model, tokenizer)

In [None]:
predict_word('My friend [MASK] is a programmer', model, tokenizer)

In [None]:
predict_word('My friend [MASK] is a doctor', model, tokenizer)