<a href="https://colab.research.google.com/github/Mahdi-Golizadeh/Natural-Language-Processing/blob/main/transformers/Masked_LM/fine_tuning_mlm_CL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning Masked Language Model
In this notebook roberta model has been fine-tuned on glue, cola dataset for the purpose of mlm
* huggingface transformers library is used
* for saving computation small dataset like glue, cola is used
* the process can be used for any suitable model and dataset
* the training loop is customizeable 
* to have effecient training huggingface accelerate library is used

## Installing and Importing necessary libraries

we have to install transformers, datasets, accelerate libraries in colab since they are not installed by default

In [1]:
!pip install -q transformers
!pip install -q datasets
!pip install -q accelerate

[K     |████████████████████████████████| 5.8 MB 26.3 MB/s 
[K     |████████████████████████████████| 7.6 MB 63.8 MB/s 
[K     |████████████████████████████████| 182 kB 79.0 MB/s 
[K     |████████████████████████████████| 452 kB 13.7 MB/s 
[K     |████████████████████████████████| 132 kB 65.3 MB/s 
[K     |████████████████████████████████| 213 kB 66.5 MB/s 
[K     |████████████████████████████████| 127 kB 59.8 MB/s 
[K     |████████████████████████████████| 191 kB 38.5 MB/s 
[?25h

In [2]:
import transformers
import datasets
import torch
import accelerate
import math
from tqdm.auto import tqdm

## Selecting Model

roberta-base model is used for fine-tuning

In [3]:
checkpoint = "roberta-base"

In [4]:
model = transformers.RobertaForMaskedLM.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

In [5]:
model.num_parameters()

124697433

a test case to see how unmasking works
the mask token in roberta is defined as `<mask>`

In [6]:
text = "tomorrow the weather will be <mask>."

In [7]:
tokenizer = transformers.RobertaTokenizerFast.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [8]:
inputs = tokenizer(text, return_tensors= "pt")

In [9]:
token_logits = model(**inputs).logits

In [10]:
token_logits.shape

torch.Size([1, 10, 50265])

In [11]:
inputs, tokenizer.mask_token_id

({'input_ids': tensor([[    0, 33063, 24805,     5,  1650,    40,    28, 50264,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])},
 50264)

In [12]:
tokenizer.mask_token

'<mask>'

In [13]:
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

In [14]:
mask_token_index

tensor([7])

In [15]:
mask_token_logits = token_logits[0, mask_token_index, :]

In [16]:
top_3_tokens = torch.topk(mask_token_logits, 3, dim= -1).indices[0].tolist()

In [17]:
top_3_tokens

[1099, 699, 357]

In [18]:
for token in top_3_tokens:
    print(f"{text.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

tomorrow the weather will be  bad.
tomorrow the weather will be  clear.
tomorrow the weather will be  better.


## Selecting and Downloading DataSet

for computation limitation in colab a small dataset has been used that could be changed with any suitable dataset

In [19]:
raw_dataset = datasets.load_dataset("glue", "cola")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [20]:
def tokenize(example):
    result = tokenizer(example["sentence"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

## Prepocessing Dataset

In [21]:
tokenized_dataset = raw_dataset.map(tokenize, batched= True, remove_columns= raw_dataset["train"].column_names)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [22]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1063
    })
})

In [23]:
chunk_size = 128

In [24]:
tokenizer.model_max_length

512

For mlm purpose we need to tail all the dataset together and divide it into chunks

In [25]:
def group_texts(example):
    concat_example = {k: sum(example[k], []) for k in example.keys()}
    total_length = len(concat_example[list(example.keys())[0]])
    total_length = (total_length // chunk_size) * chunk_size
    result = {k: [t[i: i + chunk_size] for i in range(0, total_length, chunk_size)] for k, t in concat_example.items()}
    # we create a copy of input ids because in mlm we randomly mask some of this ids so the masked ones will be the labels
    result["labels"] = result["input_ids"].copy()
    return result

In [26]:
datasets = tokenized_dataset.map(group_texts, batched= True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [27]:
datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 753
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 94
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 95
    })
})

## Selecting DataCollator

In [28]:
data_collator = transformers.DataCollatorForLanguageModeling(
    tokenizer= tokenizer,
    mlm_probability= .15,
)

In [29]:
def insert_random_mask(batch):
    features = [dict(zip(batch,t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [30]:
datasets = datasets.remove_columns(["word_ids"])

In [31]:
eval_dataset = datasets["test"].map(insert_random_mask, batched= True, remove_columns= datasets["test"].column_names)

  0%|          | 0/1 [00:00<?, ?ba/s]

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [32]:
eval_dataset = eval_dataset.rename_columns({'masked_input_ids': 'input_ids', 'masked_attention_mask': 'attention_mask', 'masked_labels': 'labels'})

In [33]:
batch_size= 8

## Preparing Data for Custom Loop

In [34]:
train_dataloader = torch.utils.data.DataLoader(
    datasets["train"],
    shuffle= True,
    batch_size= batch_size,
    collate_fn= data_collator,
)

In [35]:
eval_dataloader = torch.utils.data.DataLoader(
    eval_dataset,
    batch_size= batch_size,
    collate_fn= data_collator,
)

## Defining Optimizer

In [36]:
optimizer = torch.optim.AdamW(model.parameters(), lr= 5e-5)

initializing accelerator

In [37]:
accelerator = accelerate.Accelerator()

In [38]:
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [39]:
output_dir = "."
num_training_epochs = 5
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_training_epochs * num_update_steps_per_epoch

In [40]:
lr_scheduler = transformers.get_scheduler("linear", optimizer= optimizer,
                                          num_warmup_steps= 0,
                                          num_training_steps= num_training_steps)

## Training Loop

In [43]:
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_training_epochs):
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    model.eval()
    losses = []
    for steps, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))
    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")
    print(f"Epoch {epoch}: perplexity: {perplexity}")
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function= accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)

  0%|          | 0/475 [00:00<?, ?it/s]

Epoch 0: perplexity: 10.583824440778177
Epoch 1: perplexity: 11.70735174861322
Epoch 2: perplexity: 11.4341746298693
Epoch 3: perplexity: 9.965140052847232
Epoch 4: perplexity: 10.22415000316489


## Testing Model

to test the model pipeline is used for simplicity

In [45]:
unmasker = transformers.pipeline("fill-mask", model=".")

In [48]:
unmasker(text)

[{'score': 0.11665277928113937,
  'token': 1099,
  'token_str': ' bad',
  'sequence': 'tomorrow the weather will be bad.'},
 {'score': 0.07154655456542969,
  'token': 2569,
  'token_str': ' cold',
  'sequence': 'tomorrow the weather will be cold.'},
 {'score': 0.05540602281689644,
  'token': 699,
  'token_str': ' clear',
  'sequence': 'tomorrow the weather will be clear.'},
 {'score': 0.042571429163217545,
  'token': 2051,
  'token_str': ' fine',
  'sequence': 'tomorrow the weather will be fine.'},
 {'score': 0.04048221558332443,
  'token': 2579,
  'token_str': ' nice',
  'sequence': 'tomorrow the weather will be nice.'}]

In [49]:
for pred in unmasker(text):
    print(pred["sequence"])

tomorrow the weather will be bad.
tomorrow the weather will be cold.
tomorrow the weather will be clear.
tomorrow the weather will be fine.
tomorrow the weather will be nice.
