# Fine-tuning a Model for Masked Language Modeling (MLM) Exam

In this exam, you will be tasked with performing dataset preprocessing and fine-tuning a model for a masked language modeling task. Complete each step carefully according to the instructions provided.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `bert-base-uncased` for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/math_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [25]:
!pip install datasets



Importting libraries

In [26]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForMaskedLM
import nltk
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorForLanguageModeling

nltk.download('stopwords')
nltk.download('english')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Error loading english: Package 'english' not found in
[nltk_data]     index
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [27]:
ds = load_dataset("CUTD/math_df", )

In [28]:
ds

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'text'],
        num_rows: 10000
    })
})

In [29]:
ds= ds["train"]

In [30]:
ds = ds.remove_columns(["Unnamed: 0"])

## Step 2: Load the Pretrained Model and Tokenizer

Use a pre-trained model and tokenizer for this task. Initialize both in this step.

In [31]:
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-uncased")

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Step 3: Preprocess the Dataset

Define a preprocessing function that tokenizes the text data and prepares the inputs for the model. Ensure that you truncate the sequences to a maximum length of 512 tokens and pad them appropriately.

**Bonus**: If you performed more comprehensive preprocessing, such as removing links, converting text to lowercase, or applying additional preprocessing techniques.

In [32]:
ds[1]

{'text': 'A German literature college student who is interested in historical contexts of works and the personalities of authors, but normally dislikes theatrical works.'}

In [33]:
stop_words = set(stopwords.words('english'))
punctuations = '''`÷×؛<>_()*&^%][.ـ،/:"!?..,'{}~¦+|!”…“–ـ'''
stemmer = PorterStemmer()

def text_cleaning(text):

  # Removing links (URLs)
  text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

  # Removing special characters and punctuations
  text = re.sub(r'[^\w\s]', '', text)
  pun = str.maketrans('', '', punctuations)
  text = text.translate(pun)

  # Lowercasing
  text = text.lower()

  # Removing English stopwords
  word_tokens = word_tokenize(text)
  filtered_text = [word for word in word_tokens if word not in stop_words]

  # Stemming
  stemmed_words = []
  for word in word_tokens:
    stemmed_words.append(stemmer.stem(word))
  stemmed_words = ' '.join(filtered_text)

  return stemmed_words

In [34]:
text= '''Natural language processing (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and'''

In [35]:
text_cleaning(text)

'natural language processing nlp interdisciplinary subfield computer science artificial intelligence primarily concerned providing computers ability process data encoded natural language thus closely related information retrieval knowledge representation'

In [36]:
cleaned_text = ds.map(lambda x: {'text': text_cleaning(x['text'])})

In [37]:
cleaned_text[1]

{'text': 'german literature college student interested historical contexts works personalities authors normally dislikes theatrical works'}

In [38]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["text"]] , max_length=512, padding="max_length", truncation=True)

tokenized_ds = cleaned_text.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=cleaned_text.column_names,
)

In [39]:
tokenized_ds = tokenized_ds.train_test_split(test_size= 0.2)

In [40]:
train_ds = tokenized_ds['train']
val_ds =tokenized_ds['test']

In [41]:
train_ds

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 8000
})

In [43]:
val_ds

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2000
})

## Step 4: Define Training Arguments

Set up the training configuration, including parameters like learning rate, batch size, number of epochs, and weight decay.

In [48]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="pt")

In [49]:
training_args = TrainingArguments(
    output_dir="our_model",
    eval_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=1,
    weight_decay=0.01,
)

## Step 5: Initialize the Trainer

In [50]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset = train_ds,
    eval_dataset = val_ds,
    data_collator=data_collator,
)

Initialize the Trainer using the model, training arguments, and datasets (both training and evaluation).

## Step 6: Fine-tune the Model

In [51]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.4852,1.335223


TrainOutput(global_step=1000, training_loss=1.5730953369140626, metrics={'train_runtime': 982.9487, 'train_samples_per_second': 8.139, 'train_steps_per_second': 1.017, 'total_flos': 2105638502400000.0, 'train_loss': 1.5730953369140626, 'epoch': 1.0})

In [52]:
trainer.evaluate()

{'eval_loss': 1.3439737558364868,
 'eval_runtime': 73.3912,
 'eval_samples_per_second': 27.251,
 'eval_steps_per_second': 3.406,
 'epoch': 1.0}

Run the training process using the initialized Trainer to fine-tune the model on the masked language modeling task.

## Step 7: Inference

Use the fine-tuned model for inference. Create a pipeline for masked language modeling and test it with a sample sentence.

In [54]:
model.save_pretrained("ourModel")
tokenizer.save_pretrained("ourModel")

('ourModel/tokenizer_config.json',
 'ourModel/special_tokens_map.json',
 'ourModel/vocab.txt',
 'ourModel/added_tokens.json',
 'ourModel/tokenizer.json')

In [55]:
from transformers import pipeline

MLM = pipeline("fill-mask", model="/content/ourModel")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [56]:
MLM("Hello I'm a [MASK] model.")

[{'score': 0.26639196276664734,
  'token': 2535,
  'token_str': 'role',
  'sequence': "hello i'm a role model."},
 {'score': 0.08713365346193314,
  'token': 1039,
  'token_str': 'c',
  'sequence': "hello i'm a c model."},
 {'score': 0.0815446600317955,
  'token': 1038,
  'token_str': 'b',
  'sequence': "hello i'm a b model."},
 {'score': 0.07847759872674942,
  'token': 2047,
  'token_str': 'new',
  'sequence': "hello i'm a new model."},
 {'score': 0.024538952857255936,
  'token': 1050,
  'token_str': 'n',
  'sequence': "hello i'm a n model."}]

In [57]:
MLM('T5 is a [MASK] bootcamp')

[{'score': 0.1572318822145462,
  'token': 1040,
  'token_str': 'd',
  'sequence': 't5 is a d bootcamp'},
 {'score': 0.10642322152853012,
  'token': 1048,
  'token_str': 'l',
  'sequence': 't5 is a l bootcamp'},
 {'score': 0.10559210926294327,
  'token': 1056,
  'token_str': 't',
  'sequence': 't5 is a t bootcamp'},
 {'score': 0.09486913681030273,
  'token': 1050,
  'token_str': 'n',
  'sequence': 't5 is a n bootcamp'},
 {'score': 0.0886613205075264,
  'token': 1039,
  'token_str': 'c',
  'sequence': 't5 is a c bootcamp'}]

In [58]:
MLM('A food manufacturer aim to [MASK] consumer trust')


[{'score': 0.1837501972913742,
  'token': 1041,
  'token_str': 'e',
  'sequence': 'a food manufacturer aim to e consumer trust'},
 {'score': 0.14127621054649353,
  'token': 1037,
  'token_str': 'a',
  'sequence': 'a food manufacturer aim to a consumer trust'},
 {'score': 0.07786485552787781,
  'token': 1051,
  'token_str': 'o',
  'sequence': 'a food manufacturer aim to o consumer trust'},
 {'score': 0.05086292698979378,
  'token': 1050,
  'token_str': 'n',
  'sequence': 'a food manufacturer aim to n consumer trust'},
 {'score': 0.04155458137392998,
  'token': 1055,
  'token_str': 's',
  'sequence': 'a food manufacturer aim to s consumer trust'}]