# Fine-tuning a Model for Masked Language Modeling (MLM) Exam

In this exam, you will be tasked with performing dataset preprocessing and fine-tuning a model for a masked language modeling task. Complete each step carefully according to the instructions provided.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `bert-base-uncased` for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/math_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

In [1]:
!pip install transformers
!pip install datasets



In [2]:
from datasets import load_dataset
from transformers import BertTokenizer, BertForMaskedLM, Trainer, TrainingArguments
import torch
import re
from sklearn.model_selection import train_test_split

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [3]:
dataset = load_dataset("CUTD/math_df")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'text'],
        num_rows: 10000
    })
})

In [5]:
train_test_split = dataset['train'].train_test_split(test_size=0.2)

train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

print(f"Training set size: {len(train_dataset)}")
print(f"Test set size: {len(test_dataset)}")

Training set size: 8000
Test set size: 2000


In [24]:
for example in train_dataset:
  print(example)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
{'Unnamed: 0': 7109, 'text': 'A professional esports team manager looking for data-driven insights to gain a competitive edge'}
{'Unnamed: 0': 9472, 'text': 'A jewelry enthusiast and blogger who is passionate about supporting queer artists.'}
{'Unnamed: 0': 4462, 'text': 'A doctor who gives first-hand experiences of the importance of disease prevention'}
{'Unnamed: 0': 743, 'text': 'A graduate student studying the impact of the Cold War on developing countries'}
{'Unnamed: 0': 5095, 'text': 'A sports memorabilia expert who has an extensive collection of autographed items from the former player'}
{'Unnamed: 0': 2317, 'text': 'An experienced catcher who provides valuable advice on improving throwing accuracy and pitch selection'}
{'Unnamed: 0': 2561, 'text': 'A retired executive with a rich background in corporate social responsibility and fair trade practices'}
{'Unnamed: 0': 1743, 'text': 'A product manager collaborating 

## Step 2: Load the Pretrained Model and Tokenizer

Use a pre-trained model and tokenizer for this task. Initialize both in this step.

In [6]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Step 3: Preprocess the Dataset

Define a preprocessing function that tokenizes the text data and prepares the inputs for the model. Ensure that you truncate the sequences to a maximum length of 512 tokens and pad them appropriately.

**Bonus**: If you performed more comprehensive preprocessing, such as removing links, converting text to lowercase, or applying additional preprocessing techniques.

In [7]:
def preprocess_data(examples):

  inputs = tokenizer(
      examples['text'],
      truncation=True,
      padding='max_length',
      max_length=512,
      return_tensors='pt'
  )

  inputs['labels'] = inputs['input_ids'].clone()
  return inputs


In [8]:
text = re.sub(r'http\S+', '', dataset['train'][0]['text'])
text = text.lower()
text

"a software engineer who disagrees with the established computer scientist's methodologies and approaches"

## Step 4: Define Training Arguments

Set up the training configuration, including parameters like learning rate, batch size, number of epochs, and weight decay.

In [9]:
train_data = dataset.map(preprocess_data, batched=True, remove_columns=["text"])

training_args = TrainingArguments(
    output_dir='./results2',
    evaluation_strategy="no",
    learning_rate=2e-5,
    batch_size=8,
    epochs=1
)



## Step 5: Initialize the Trainer

Initialize the Trainer using the model, training arguments, and datasets (both training and evaluation).

In [11]:
train_data = dataset['train'].map(preprocess_data, batched=True, remove_columns=["text"])

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data
)

## Step 6: Fine-tune the Model

Run the training process using the initialized Trainer to fine-tune the model on the masked language modeling task.

In [12]:
trainer.train()

Step,Training Loss
500,0.4644
1000,0.0004


TrainOutput(global_step=1250, training_loss=0.18594475184679032, metrics={'train_runtime': 272.9584, 'train_samples_per_second': 36.636, 'train_steps_per_second': 4.579, 'total_flos': 2632048128000000.0, 'train_loss': 0.18594475184679032, 'epoch': 1.0})

In [13]:
tokenizer.save_pretrained('/content/tokenizer_directory')

('/content/tokenizer_directory/tokenizer_config.json',
 '/content/tokenizer_directory/special_tokens_map.json',
 '/content/tokenizer_directory/vocab.txt',
 '/content/tokenizer_directory/added_tokens.json')

## Step 7: Inference

Use the fine-tuned model for inference. Create a pipeline for masked language modeling and test it with a sample sentence.

In [14]:
tokenizer = BertTokenizer.from_pretrained('/content/tokenizer_directory')
model = BertForMaskedLM.from_pretrained('/content/results2/checkpoint-1000')

In [19]:
text = "The capital of Egypt is [MASK]."

In [20]:
inputs = tokenizer(text, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

mask_token_index = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)

mask_token_positions = (inputs['input_ids'] == mask_token_index).nonzero(as_tuple=True)

predicted_token_id = predictions[0, mask_token_positions[1]].argmax(dim=-1)
predicted_token = tokenizer.decode(predicted_token_id)

print(f"Original text: {text}")
print(f"Predicted token: {predicted_token}")

Original text: The capital of Egypt is [MASK].
Predicted token: cairo
