#  **Masked Language Modeling with BERT**

Explore masked language modeling (MLM) using the BERT model to understand context and predict missing words in sentences.

##  Setup and Installation

Begin by installing the necessary libraries to manage data processing and modeling.

In [None]:
!pip install -U transformers



##  Importing Libraries

Import essential modules for our tasks.

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import pandas as pd
import numpy as np
from scipy.special import softmax

##  Model Setup

Load the pre-trained BERT model and tokenizer, specifically designed for masked language modeling.

In [None]:
model_name = "bert-base-cased"

# Loading the pre-trained model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


##  Defining the Mask Token

Identify the mask token used by BERT to signify where predictions are needed in the sentence.

##  Creating the Input Sentence

Craft a sentence with a missing word indicated by the mask token, to test the model's predictive power.

In [None]:
# Defining the mask token
mask = tokenizer.mask_token

# Defining the sentence
sentence = f"I want to {mask} pizza for tonight."

# Tokenizing the sentence
tokens = tokenizer.tokenize(sentence)

##  Tokenization and Encoding

Tokenize and encode the sentence to format it properly for the model.

##  Model Prediction

Feed the encoded inputs to the model and extract logits for predictions.

In [None]:
# Encoding the input sentence and getting model predictions
encoded_inputs = tokenizer(sentence, return_tensors="pt")
output = model(**encoded_inputs)

In [None]:
# Detaching the logits from the model output and converting to numpy array
logits = output.logits.detach().numpy()[0]
logits

array([[ -7.3722925,  -7.2488613,  -7.4421444, ...,  -6.311862 ,
         -5.936892 ,  -6.425681 ],
       [ -7.9311185,  -8.2282095,  -8.032589 , ...,  -6.7387457,
         -6.4877234,  -6.9525247],
       [-12.050008 , -11.797209 , -12.577608 , ...,  -8.451776 ,
         -6.7310185,  -8.258566 ],
       ...,
       [-10.22041  , -10.4314785,  -9.999257 , ...,  -7.9569917,
         -6.7193975,  -9.361793 ],
       [-12.447125 , -12.536707 , -12.561406 , ...,  -9.908555 ,
         -9.421911 , -11.176952 ],
       [-14.365711 , -14.522715 , -15.001671 , ..., -11.971546 ,
        -11.65692  , -13.449785 ]], dtype=float32)

In [None]:
logits.shape

(10, 28996)


##  Analyzing Predictions

Retrieve logits for the masked token and calculate confidence scores for possible replacements.

In [None]:
# Extracting the logits for the masked token and calculating the confidence scores
masked_logits = logits[tokens.index(mask) + 1]
confidence_scores = softmax(masked_logits)

In [None]:
masked_logits

array([-6.714628 , -6.379109 , -6.1184893, ..., -5.651309 , -3.6572778,
       -4.9947314], dtype=float32)

In [None]:
masked_logits.shape

(28996,)

In [None]:
confidence_scores

array([2.9159888e-10, 4.0784978e-10, 5.2928079e-10, ..., 8.4446000e-10,
       6.2026344e-09, 1.6282734e-09], dtype=float32)

##  Displaying Top Predictions

Cycle through the top 5 predicted tokens, substituting the masked token in the original sentence to show the model's suggestions.


In [None]:
# Iterating over the top 5 predicted tokens and printing the sentences with the masked token replaced
for i in np.argsort(confidence_scores)[::-1][:5]:
    pred_token = tokenizer.decode(i)
    score = confidence_scores[i]

    # print(pred_token, score)
    print(sentence.replace(mask, pred_token))

I want to have pizza for tonight.
I want to get pizza for tonight.
I want to eat pizza for tonight.
I want to make pizza for tonight.
I want to order pizza for tonight.
