# Homework: Implementing and Training a Masked Language Model (MLM)

## Overview
In this assignment, you will implement a basic Masked Language Model (MLM) from scratch using PyTorch. This exercise will help you understand the architecture and training process of MLMs, which are crucial for many NLP tasks. You will use the `bookcorpus` dataset from the Hugging Face `datasets` library for training.

## Objectives
- Implement a Transformer-based MLM.
- Preprocess and prepare a dataset for MLM training.
- Train your model on the `bookcorpus` dataset.
- Evaluate the model on masked sentence predictions.

## Getting Started
First, let's import all necessary libraries and set up our environment.


In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2

In [2]:
# Import libraries
import torch
from torch import nn, optim
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer
from datasets import load_dataset
from tqdm import tqdm
import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

# Check device availability

# TODO :
# Check whether the cuda is available (if not assign cpu)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f"Using device: {device}")

Using device: cuda


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Dataset Preparation
For this assignment, we will use the `bookcorpus` dataset available through Hugging Face's datasets library. You need to load the dataset, preprocess it for MLM, and create a PyTorch dataset class.

### Load the Dataset
First, load the `bookcorpus` dataset.


In [3]:
# TO DO:
# load bookcorpus dataset
# we just need train split of it.
dataset = load_dataset("bookcorpus", split="train")
print("Dataset loaded successfully.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/313M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/313M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/311M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/313M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/313M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/209M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/74004228 [00:00<?, ? examples/s]

Dataset loaded successfully.


In [4]:
# Shuffle the dataset and reduce its size
dataset = dataset.shuffle(seed=42).select(range(100000))
print("Reduced dataset loaded successfully. Total samples:", len(dataset))

Reduced dataset loaded successfully. Total samples: 100000


### Preprocess the Data
To train our MLM, we need to preprocess our textual data appropriately. This preprocessing involves tokenizing our text into tokens that the MLM can understand and randomly masking some of these tokens to create input-output pairs for our model to learn from.

In [5]:
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove special characters and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join(word for word in text.split() if word not in stop_words)

    return text


### TextDataset Class
Below is the `TextDataset` class where you will implement the tokenization, and specifically, the logic to create the masking array. The class should take a list of texts and a tokenizer as input. It should tokenize the texts, apply the masking randomly to 15% of the tokens, and prepare the input and label pairs for the MLM.


In [6]:
# TextDataset Class
class TextDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.dataset = dataset
        self.max_len = max_len
        self.inputs = []

        for index, text in tqdm(enumerate(self.dataset['text']), total = len(self.dataset['text'])):
            # clean the text
            text = preprocess_text(text)

            # Tokenize the text
            tokenized_text = self.tokenizer.encode_plus(
                text,
                max_length=self.max_len,
                padding='max_length',
                truncation=True,
                return_tensors='pt'
            )
            input_ids = tokenized_text['input_ids'].squeeze(0)
            labels = input_ids.clone()

            # Create random array to determine which tokens to mask
            rand = torch.rand(input_ids.shape)

            # To Do: Implement the masking logic here
            # Create a mask array - mask 15% of tokens that are not special tokens
            # special tokens are: Pad Token, CLS Token, SEP Token
            # Your code here: mask_arr = ...
            # Masking logic
            mask_arr = (rand < 0.15) & (input_ids != self.tokenizer.pad_token_id) & \
                       (input_ids != self.tokenizer.cls_token_id) & (input_ids != self.tokenizer.sep_token_id)

            labels[~mask_arr] = -100  # We only compute loss on masked tokens

            # 80% of the time, we replace masked input tokens with the mask token
            indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & mask_arr
            input_ids[indices_replaced] = self.tokenizer.mask_token_id

            # 10% of the time, replace masked input tokens with random words
            indices_random = torch.bernoulli(torch.full(labels.shape, 0.1)).bool() & mask_arr & ~indices_replaced
            random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
            input_ids[indices_random] = random_words[indices_random]

            # Add the prepared inputs and labels to the list
            self.inputs.append({"input_ids": input_ids, "labels": labels})

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return self.inputs[idx]


## Understanding the Masking Strategy in Masked Language Models

### Question Overview

In the training process of Masked Language Models (MLMs) such as BERT, a specific strategy for masking tokens is commonly employed:
- **80%** of the masked tokens are replaced with the `[MASK]` token.
- **10%** are replaced with random words.
- **10%** are left unchanged.

This methodical approach to token masking plays a crucial role in how the model learns during the pre-training phase.

### Detailed Questions

Please provide a comprehensive explanation addressing the rationale behind this masking strategy. Your response should cover the following aspects:

1. **80% Masked with `[MASK]` Token:**
   - **Why are 80% of the masked tokens replaced with the `[MASK]` token?**
   - Discuss how this percentage influences the model's focus during training and its ability to learn contextual information from surrounding tokens.

2. **10% Replaced with Random Words:**
   - **Why are 10% of the masked tokens randomly replaced with other words from the vocabulary?**
   - Analyze the impact of this strategy on the model's robustness and its handling of unexpected or novel input during real-world applications.

3. **10% Left Unchanged:**
   - **Why are the remaining 10% of the masked tokens left as is, unchanged?**
   - Consider how leaving some masked tokens unchanged might help the model generalize better and avoid overfitting to the `[MASK]` token specifically.




Answers are in the HW's doc!

## Transformer Model Components Explanation

In this section of the assignment, you will implement the core components of a Transformer model tailored for Masked Language Modeling (MLM). Understanding the functionality of each component is crucial for your implementation. Below, we detail the roles and responsibilities of each component within the Transformer architecture.

### Transformer Block

The Transformer Block is the fundamental building unit of a Transformer model. Each block consists of two main parts: a multi-head self-attention mechanism and a position-wise feed-forward network.

- **Multi-Head Self-Attention:** This component allows the model to dynamically focus on different parts of the input sequence, learning nuanced dependencies between words regardless of their positional distance from each other. It helps the model understand the context and relationships within the text.
- **Feed-Forward Network:** Following the attention mechanism, each position is passed through the same feed-forward network independently. This network transforms the attended features to help in predicting the correct output tokens.
- **Normalization and Dropout:** Each sub-layer (attention and feed-forward) in the block includes a residual connection followed by layer normalization. Dropout is applied for regularization to prevent overfitting.

### Encoder

The Encoder aggregates multiple Transformer Blocks to process the input tokens. Its main responsibilities include:

- **Embedding Inputs:** Initially, input tokens are converted into embeddings that represent them in a continuous vector space. Positional embeddings are added to these token embeddings to retain positional information.
- **Processing through Transformer Blocks:** The embedded input tokens are then sequentially passed through multiple Transformer Blocks. Each block processes the input and passes its output to the next block, iteratively refining the representations.
- **Output:** The final output of the Encoder is a sequence of vectors, where each vector is a rich representation of the corresponding input token, considering the entire input sequence context.

### MLM Model

The MLM Model is the overarching architecture that utilizes the Encoder for the MLM task. It is specifically designed for predicting masked tokens, replicating the pre-training objective of models like BERT.

- **Encoder Utilization:** The MLM Model embeds the input tokens and passes them through the Encoder to obtain contextualized token representations.
- **Prediction Layer:** On top of the Encoder's output, a linear layer is used to map the high-dimensional token representations back to the vocabulary space. This setup predicts the original token from its masked version or context.
- **Training Objective:** The primary goal during training is to minimize the prediction error of the masked tokens, encouraging the model to understand and generate language effectively.

### Implementation Notes

As you implement these components:
- Focus on how each part contributes to handling and transforming the input data into useful representations.
- Consider the flow of data through the model and how each component’s output serves as input to the next.
- Ensure your implementation supports backpropagation, as this will be crucial for training the model.

In the upcoming coding tasks, you will implement these components based on the provided templates and hints. This hands-on experience will deepen your understanding of how modern NLP models leverage Transformer architectures for complex language understanding tasks.


In [7]:
# Transformer Block
class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = nn.MultiheadAttention(embed_dim=embed_size, num_heads=heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        # Feed-forward layer
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            # To Do: Add the final linear layer (hint: the output size should be the same as embed_size)
            nn.Linear(forward_expansion * embed_size, embed_size) # Add the final linear layer

        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(query, key, value, attn_mask=mask)[0]
        x = self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))
        return out

# Encoder
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout, max_length):
        super(Encoder, self).__init__()
        self.embed_size = embed_size
        self.device = device
        self.word_embedding = nn.Embedding(vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_length, embed_size)

        self.layers = nn.ModuleList(
            [TransformerBlock(embed_size, heads, dropout, forward_expansion) for _ in range(num_layers)]
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).unsqueeze(0).repeat(N, 1).to(self.device)
        out = self.dropout(self.word_embedding(x) + self.position_embedding(positions))

        for layer in self.layers:
            out = layer(out, out, out, mask)

        return out

# MLM Model
class MLM(nn.Module):
    def __init__(self, vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, device, max_length):
        super(MLM, self).__init__()
        self.encoder = Encoder(vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout, max_length)
        # To Do: Initialize an output layer for predicting masked tokens (hint: use a linear layer)
        self.fc_out = nn.Linear(embed_size, vocab_size)  # Initialize the output layer for predicting masked tokens
        self.device = device
        self.vocab_size = vocab_size

    def forward(self, x, mask):
        out = self.encoder(x, mask)
        # To Do: Apply the output layer to 'out' (hint: remember to reshape if needed before applying it)
        out = out.view(out.size(0) * out.size(1), out.size(2))  # Reshape the output for the linear layer
        out = self.fc_out(out)  # Apply the output layer to 'out'
        return out


## Training the Model

Training the MLM is a crucial step where the model learns to predict the masked tokens based on the context provided by the surrounding words. This process involves passing batches of preprocessed data through the model, calculating the loss, and updating the model parameters to minimize this loss.

### Setting Up the Training Loop

The objective of the training loop is to iteratively improve the model's predictions. During each epoch, the model will process all batches of data, calculate the loss for each batch, and update its weights. You will complete parts of the training function to ensure proper loss calculation and parameter optimization.

**To Do:**
- Complete the loss calculation using the appropriate loss function.
- Implement the steps for backpropagation and updating model parameters.
- Monitor and print the average loss after each epoch to track the training progress.

In the next section, you will find a partially completed training function. Fill in the missing parts as instructed to complete the training loop.


In [8]:
def train(model, data_loader, optimizer, device, epochs=10):
    model = model.to(device)
    model.train()  # Set the model to training mode
    total_loss = 0

    # To Do : Initialize the loss function with ignore_index set to -100
    # Hint: CrossEntropyLoss might be appropriate here.
    loss_function = nn.CrossEntropyLoss(ignore_index=-100)

    for epoch in range(epochs):
        progress_bar = tqdm(enumerate(data_loader), total=len(data_loader), desc=f"Epoch {epoch + 1}/{epochs}")
        for i, batch in progress_bar:
            input_ids, labels = batch['input_ids'].to(device), batch['labels'].to(device)

            # Forward pass
            outputs = model(input_ids, None)  # Assuming mask=None for simplicity

            # To Do: Calculate loss. Attention to shapes of the outputs and labels
            loss = loss_function(outputs.view(-1, model.vocab_size), labels.view(-1))

            # Backpropagation
            optimizer.zero_grad()  # Clear existing gradients
            loss.backward()  # Compute gradient of loss with respect to model parameters
            optimizer.step()  # Update model parameters

            total_loss += loss.item()

        average_loss = total_loss / len(data_loader)
        print(f"Epoch {epoch + 1}/{epochs}, Average Loss: {average_loss}")
        total_loss = 0  # Reset total loss for the next epoch


## Model Evaluation

Evaluating the performance of your Masked Language Model (MLM) is essential to understand how well it has learned to predict masked tokens. This evaluation typically involves using the model to predict tokens in place of `[MASK]` and comparing these predictions to the actual tokens. This step is crucial for assessing the model's ability to generalize to unseen data and for verifying its learning efficacy.

### Setting Up the Evaluation Function

The evaluation function will test the model's ability to fill in masked tokens correctly within given text examples. You will complete parts of this function to ensure the model can perform forward passes on input data and process the output to generate human-readable predictions.

**To Do:**
- Implement the logic to convert model predictions to token IDs.
- Translate these token IDs back to words using the tokenizer.




In [9]:
def evaluate(model, text, tokenizer, device):
    model.eval()  # Set the model to evaluation mode

    # Tokenize the input text where `[MASK]` is the token to predict
    tokenized_input = tokenizer.encode_plus(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
    input_ids = tokenized_input['input_ids'].to(device)

    # Identify the position of the `[MASK]` token
    mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

    with torch.no_grad():  # No need to calculate gradients for evaluation
        # Forward pass
        outputs = model(input_ids, None)

        # Get the logits and find the 10 tokens with the highest probability at the mask position
        # To Do: Implement the logic to extract the 10 token IDs with the highest probability for the mask position
        # use torch.topk function for mask token logits
        mask_token_logits = outputs[0]
        top_k_probabilities, top_k_indices = torch.topk(mask_token_logits[0, mask_token_index], k=10, dim=-1)

        # Convert the predicted token IDs to words using the tokenizer
        top_k_tokens = [tokenizer.convert_ids_to_tokens(indices.cpu().numpy()) for indices in top_k_indices]

    return top_k_tokens



## Starting the MLM Training and Evaluation with bookcorpus Dataset

In [10]:
# To Do:  Load the BERT tokenizer (bert base uncased)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text_dataset = TextDataset(dataset, tokenizer, max_len=512)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

100%|██████████| 100000/100000 [12:23<00:00, 134.51it/s]


In [11]:
data_loader = DataLoader(text_dataset, batch_size=32, shuffle=True)

In [12]:
# To Do: Initialize the MLM model with the following hyper-paramaters
# vocab size = (comes from tokenizer)
# embed size = 256
# num of layers = 2
# heads = 8
# forward expansion = 4
# dropout = 0.1
# max_length = 512

vocab_size = len(tokenizer)

# Hyperparameters
embed_size = 256
num_layers = 2
num_heads = 8
forward_expansion = 4
dropout = 0.1
max_length = 512


model = MLM(vocab_size, embed_size, num_layers, num_heads, forward_expansion, dropout, device, max_length)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

In [13]:
# Training the model
train(model, data_loader, optimizer, device, epochs=3)

Epoch 1/3: 100%|██████████| 3125/3125 [16:19<00:00,  3.19it/s]


Epoch 1/3, Average Loss: 8.76510227279663


Epoch 2/3: 100%|██████████| 3125/3125 [16:25<00:00,  3.17it/s]


Epoch 2/3, Average Loss: 8.149457678375244


Epoch 3/3: 100%|██████████| 3125/3125 [16:26<00:00,  3.17it/s]

Epoch 3/3, Average Loss: 7.852560283813476





In [14]:
# Evaluate the model
test_sentences = ["Hello, my name is [MASK].", "The capital of France is [MASK].", "I love to [MASK] a song.", '[MASK]']
for sentence in test_sentences:
    predicted_masks = evaluate(model, sentence, tokenizer, device)
    print(f"Original: {sentence}")
    print("10 most probable words for Mask token: ", predicted_masks)

IndexError: too many indices for tensor of dimension 1

## Improving Model Performance

### Evaluation Results
As you can see, the output of the evaluation is quite poor. Why? Because we started training the MLM from scratch. If we want to achieve an acceptable performance similar to a pretrained BERT model, we need to perform several steps.
### Question
**What steps can you take to improve the performance of your Masked Language Model (MLM)?**
