# Fine-tuning a Model for Masked Language Modeling (MLM) Exam

In this exam, you will be tasked with performing dataset preprocessing and fine-tuning a model for a masked language modeling task. Complete each step carefully according to the instructions provided.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `bert-base-uncased` for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/math_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

In [1]:
!pip install transformers datasets evaluate



In [2]:
from sklearn.model_selection import train_test_split
from datasets import load_dataset
from transformers import BertTokenizer
from transformers import TFAutoModelForMaskedLM

from transformers import TrainingArguments
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer
from transformers import create_optimizer, AdamWeightDecay
from transformers import DataCollatorForLanguageModeling
from transformers import pipeline

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [3]:
dataset= load_dataset("CUTD/math_df", split="train[:50]") # take first 5000 samples
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['Unnamed: 0', 'text'],
    num_rows: 50
})

In [4]:
dataset = dataset.train_test_split(test_size=0.2) # splits the dataset into 80% for training and 20% for testing
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'text'],
        num_rows: 40
    })
    test: Dataset({
        features: ['Unnamed: 0', 'text'],
        num_rows: 10
    })
})

In [5]:
dataset["train"][0] # show the first row of train

{'Unnamed: 0': 28,
 'text': 'A veteran fine arts instructor known for pushing students to explore various mediums and techniques'}

In [6]:
dataset["test"][0] # show the first row of test

{'Unnamed: 0': 20,
 'text': 'A retired statesman known for their diplomatic skills and strategic thinking'}

## Step 2: Load the Pretrained Model and Tokenizer

Use a pre-trained model and tokenizer for this task. Initialize both in this step.

In [7]:
# initialze the pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model = TFAutoModelForMaskedLM.from_pretrained("bert-base-uncased")

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


## Step 3: Preprocess the Dataset

Define a preprocessing function that tokenizes the text data and prepares the inputs for the model. Ensure that you truncate the sequences to a maximum length of 512 tokens and pad them appropriately.

**Bonus**: If you performed more comprehensive preprocessing, such as removing links, converting text to lowercase, or applying additional preprocessing techniques.

In [8]:
# the preprocessing function
def preprocess_function(examples):
  examples['text'] = [x.lower() for x in examples['text']] # convert text to lower case
  tokenized_inputs = tokenizer(examples['text'], max_length=512, truncation=True, padding='max_length') # tokenize text with ma truncate to a maximum length of 512
  return tokenized_inputs

dataset = dataset.map(preprocess_function, batched=True) # apply preprocess

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf") # take 15% randomly of tokens to apply mask

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

## Step 4: Define Training Arguments

Set up the training configuration, including parameters like learning rate, batch size, number of epochs, and weight decay.

In [9]:
#prepare the training dataset
tf_train_set = model.prepare_tf_dataset(
    dataset["train"], # select train data
    shuffle=True, # random the order of sample
    batch_size=8,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    dataset["test"],# select test data
    shuffle=False,
    batch_size=8,
    collate_fn=data_collator,
)

In [10]:
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)

## Step 5: Initialize the Trainer

Initialize the Trainer using the model, training arguments, and datasets (both training and evaluation).

In [11]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7d3940ca2ec0>

## Step 6: Fine-tune the Model

Run the training process using the initialized Trainer to fine-tune the model on the masked language modeling task.

In [12]:
model.save_pretrained("mlm_model")
tokenizer.save_pretrained("mlm_tokenizer")

('mlm_tokenizer/tokenizer_config.json',
 'mlm_tokenizer/special_tokens_map.json',
 'mlm_tokenizer/vocab.txt',
 'mlm_tokenizer/added_tokens.json')

## Step 7: Inference

Use the fine-tuned model for inference. Create a pipeline for masked language modeling and test it with a sample sentence.

In [15]:
mask_filler = pipeline("fill-mask", model="mlm_model", tokenizer="mlm_tokenizer")

text = "The Milky Way is a [MASK] galaxy."

mask_filler(text)

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at mlm_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


[{'score': 0.7072992324829102,
  'token': 12313,
  'token_str': 'spiral',
  'sequence': 'the milky way is a spiral galaxy.'},
 {'score': 0.04892466589808464,
  'token': 11229,
  'token_str': 'dwarf',
  'sequence': 'the milky way is a dwarf galaxy.'},
 {'score': 0.031989093869924545,
  'token': 27213,
  'token_str': 'elliptical',
  'sequence': 'the milky way is a elliptical galaxy.'},
 {'score': 0.02805796079337597,
  'token': 5871,
  'token_str': 'satellite',
  'sequence': 'the milky way is a satellite galaxy.'},
 {'score': 0.0156062301248312,
  'token': 2235,
  'token_str': 'small',
  'sequence': 'the milky way is a small galaxy.'}]