<a href="https://colab.research.google.com/github/Mahdi-Golizadeh/Natural-Language-Processing/blob/main/transformers/Masked_LM/fine_tuning_MLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a masked language model with 🤗 transformers

## Installing and importing necessary libraries

In [1]:
!pip install -q transformers
!pip install -q datasets

[K     |████████████████████████████████| 5.8 MB 34.1 MB/s 
[K     |████████████████████████████████| 7.6 MB 55.7 MB/s 
[K     |████████████████████████████████| 182 kB 67.9 MB/s 
[K     |████████████████████████████████| 452 kB 23.8 MB/s 
[K     |████████████████████████████████| 212 kB 59.4 MB/s 
[K     |████████████████████████████████| 132 kB 12.4 MB/s 
[K     |████████████████████████████████| 127 kB 61.6 MB/s 
[?25h

In [2]:
import transformers
import datasets
import torch

we want to fine-tune "bert-base-uncased" in this notebook so we will initialize our model with proper checkpoint

In [3]:
checkpoint = "bert-base-uncased"

In [4]:
mlm_model = transformers.BertForMaskedLM.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


to see the number of parameters in loaded model

In [5]:
mlm_model.num_parameters()

109514298

## Usecase example

to see the performance of the model before training 

we need a sentence with `[MASK]` token in the place we want our model to fill it

In [6]:
text = "tomorrow the weather will be [MASK]."

we need to tokenize the sentence in order to give it to the model

### tokenizer

In [7]:
mlm_tokenizer = transformers.BertTokenizerFast.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [8]:
inputs = mlm_tokenizer(text, return_tensors= "pt")

now to get the logits out of model

In [9]:
token_logits = mlm_model(**inputs).logits

number of tokens in tokenizer

In [10]:
len(mlm_tokenizer)

30522

to find where in the input mask token is placed

In [11]:
mask_token_index = torch.where(inputs["input_ids"] == mlm_tokenizer.mask_token_id)[1]

extracting target token output vector

In [12]:
mask_token_logits = token_logits[0, mask_token_index, :]

next to select the top k best answers (`k = 3`)

In [13]:
top_3_tokens = torch.topk(mask_token_logits, 3, dim= -1).indices[0].tolist()

In [14]:
top_3_tokens

[2488, 2204, 2986]

now print out the predicion result

In [15]:
for token in top_3_tokens:
    print(f"{text.replace(mlm_tokenizer.mask_token, mlm_tokenizer.decode([token]))}")

tomorrow the weather will be better.
tomorrow the weather will be good.
tomorrow the weather will be fine.


## dataset

In [16]:
raw_dataset = datasets.load_dataset("glue", "cola")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [17]:
raw_dataset["train"][0]

{'sentence': "Our friends won't buy this analysis, let alone the next one we propose.",
 'label': 1,
 'idx': 0}

In [18]:
mlm_tokenizer(raw_dataset["train"][1]["sentence"]).word_ids()[2]

1

In [19]:
def tokenize(example):
    result = mlm_tokenizer(example["sentence"])
    if mlm_tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

In [20]:
tokenized_dataset = raw_dataset.map(tokenize, batched= True, remove_columns= raw_dataset["train"].column_names)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [21]:
tokenized_dataset["train"][0]

{'input_ids': [101,
  2256,
  2814,
  2180,
  1005,
  1056,
  4965,
  2023,
  4106,
  1010,
  2292,
  2894,
  1996,
  2279,
  2028,
  2057,
  16599,
  1012,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'word_ids': [None,
  0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  None]}

because of computation limits we set max_length to be 128

In [22]:
# model max size
mlm_tokenizer.model_max_length

512

In [23]:
# custom max size
chunk_size = 128

concatinate dataset sentences into each other in order to not lose the sentences end by truncation 

an example of procedure

In [24]:
examples = tokenized_dataset["train"][:3]

In [25]:
concat_examples = {k: sum(examples[k], []) for k in examples.keys()}

In [26]:
total_length = len(concat_examples["input_ids"])

In [27]:
total_length

47

In [28]:
chunks = {k: [t[i: i + 16] for i in range(0, total_length, 16)] for k, t in concat_examples.items()}

In [29]:
# len(list(concat_examples.keys())[0])

now defining a function for above mentioned process

In [30]:
def group_texts(example):
    concat_example = {k: sum(example[k], []) for k in example.keys()}
    total_length = len(concat_example[list(example.keys())[0]])
    total_length = (total_length // chunk_size) * chunk_size
    result = {k: [t[i: i + chunk_size] for i in range(0, total_length, chunk_size)] for k, t in concat_example.items()}
    # we create a copy of input ids because in mlm we randomly mask some of this ids so the masked ones will be the labels
    result["labels"] = result["input_ids"].copy()
    return result

In [31]:
datasets = tokenized_dataset.map(group_texts, batched= True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [32]:
datasets["train"][0]

{'input_ids': [101,
  2256,
  2814,
  2180,
  1005,
  1056,
  4965,
  2023,
  4106,
  1010,
  2292,
  2894,
  1996,
  2279,
  2028,
  2057,
  16599,
  1012,
  102,
  101,
  2028,
  2062,
  18404,
  2236,
  3989,
  1998,
  1045,
  1005,
  1049,
  3228,
  2039,
  1012,
  102,
  101,
  2028,
  2062,
  18404,
  2236,
  3989,
  2030,
  1045,
  1005,
  1049,
  3228,
  2039,
  1012,
  102,
  101,
  1996,
  2062,
  2057,
  2817,
  16025,
  1010,
  1996,
  13675,
  16103,
  2121,
  2027,
  2131,
  1012,
  102,
  101,
  2154,
  2011,
  2154,
  1996,
  8866,
  2024,
  2893,
  14163,
  8024,
  3771,
  1012,
  102,
  101,
  1045,
  1005,
  2222,
  8081,
  2017,
  1037,
  4392,
  1012,
  102,
  101,
  5965,
  27129,
  1996,
  4264,
  4257,
  1012,
  102,
  101,
  3021,
  19055,
  2010,
  2126,
  2041,
  1997,
  1996,
  4825,
  1012,
  102,
  101,
  2057,
  1005,
  2128,
  5613,
  1996,
  2305,
  2185,
  1012,
  102,
  101,
  11458,
  25756,
  1996,
  3384,
  4257,
  1012,
  102,
  101,
  1996,
  440

## Data Collator
we need a special data collator to randomly mask some of the tokens

In [33]:
data_collator = transformers.DataCollatorForLanguageModeling(
    tokenizer= mlm_tokenizer,
    mlm_probability= .15,
)

In [34]:
data_collator

DataCollatorForLanguageModeling(tokenizer=PreTrainedTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}), mlm=True, mlm_probability=0.15, pad_to_multiple_of=None, tf_experimental_compile=False, return_tensors='pt')

to test it

In [35]:
samples = [datasets["train"][i] for i in range(2)]

In [36]:
for s in samples:
    _ = s.pop("word_ids")
for chunk in data_collator(samples)["input_ids"]:
    print(mlm_tokenizer.decode(chunk))

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[CLS] our friends [MASK]'t [MASK] this analysis, let [MASK] the next [MASK] we propose. [SEP] [CLS] [MASK] more pseudo generalization and i'm [MASK] [MASK]. [SEP] [CLS] one more [MASK] generalization or i'm giving [MASK]. [SEP] [CLS] the more we [MASK] verbs, the crazier they get. [SEP] [CLS] day by day the [MASK] are [MASK] [MASK]rk [MASK] [MASK] [SEP] [CLS] i [MASK] ll fix you a drink. [SEP] [CLS] fred [MASK] the plants flat. [SEP] [CLS] bill coughed [MASK] way out [MASK] the [MASK]. [SEP] [CLS] we're dancing the night away. [SEP] [CLS] herman hammered the metal flat. [SEP] [CLS] the [MASK] laughed the play
[MASK] [MASK] stage. [SEP] [CLS] the pond froze solid. [SEP] [CLS] bill rolled [MASK] of the room. [SEP] [CLS] the gardener watered the flowers [MASK]. [SEP] [CLS] the gardener watered the flowers. [SEP] [CLS] bill [MASK] the bathtub into pieces. [SEP] [CLS] bill broke the bath [MASK]. [SEP] [CLS] they drank the pub dry. [SEP] [CLS] they drank [MASK] [MASK]. [SEP] [CLS] the profes

as you cas see mask token has been inserted randomly in input text

## Training

In [37]:
batch_size = 64
log_step = len(datasets["train"]) // batch_size
training_args = transformers.TrainingArguments(
    output_dir= "finetuned-bert-base-cased-glue",
    overwrite_output_dir= True,
    evaluation_strategy= "epoch",
    learning_rate= 2e-5,
    weight_decay= .01,
    per_device_train_batch_size= batch_size,
    per_device_eval_batch_size= batch_size,
    logging_steps= log_step,
    num_train_epochs= 5,
    fp16= True,
    save_strategy= "epoch",
    save_total_limit= 1,
    optim= "adamw_torch"
)

In [47]:
trainer = transformers.Trainer(
    model= mlm_model,
    tokenizer= mlm_tokenizer,
    args= training_args,
    train_dataset= datasets["train"],
    eval_dataset= datasets["validation"],
    data_collator= data_collator,
)

Using cuda_amp half precision backend


In [48]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 753
  Num Epochs = 5
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 60
  Number of trainable parameters = 109514298


Epoch,Training Loss,Validation Loss
1,1.6298,2.214781
2,1.5691,2.170288
3,1.5442,2.077898
4,1.652,2.150103
5,1.6476,1.951814


The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 94
  Batch size = 64
Saving model checkpoint to finetuned-bert-base-cased-glue/checkpoint-12
Configuration saved in finetuned-bert-base-cased-glue/checkpoint-12/config.json
Model weights saved in finetuned-bert-base-cased-glue/checkpoint-12/pytorch_model.bin
tokenizer config file saved in finetuned-bert-base-cased-glue/checkpoint-12/tokenizer_config.json
Special tokens file saved in finetuned-bert-base-cased-glue/checkpoint-12/special_tokens_map.json
Deleting older checkpoint [finetuned-bert-base-cased-glue/checkpoint-60] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are

TrainOutput(global_step=60, training_loss=1.6145898183186849, metrics={'train_runtime': 57.9171, 'train_samples_per_second': 65.007, 'train_steps_per_second': 1.036, 'total_flos': 247741530048000.0, 'train_loss': 1.6145898183186849, 'epoch': 5.0})

## Evaluation

In [49]:
eval_result = trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 94
  Batch size = 64


In [50]:
eval_result

{'eval_loss': 2.1994807720184326,
 'eval_runtime': 0.3491,
 'eval_samples_per_second': 269.29,
 'eval_steps_per_second': 5.73,
 'epoch': 5.0}

In [51]:
import math
print(f"perplexity is : {math.exp(eval_result['eval_loss']):4.2f}")

perplexity is : 9.02


testing the model with `pipeline`

In [52]:
test_model = transformers.pipeline("fill-mask", model= "/content/finetuned-bert-base-cased-glue/checkpoint-60")

loading configuration file /content/finetuned-bert-base-cased-glue/checkpoint-60/config.json
Model config BertConfig {
  "_name_or_path": "/content/finetuned-bert-base-cased-glue/checkpoint-60",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading configuration file /content/finetuned-bert-base-cased-glue/checkpoint-60/config.json
Model config BertConfig {
  "_name_or_path": "/content/finetuned-bert-base-cased-glue/che

In [54]:
test_model(text, top_k= 2)

[{'score': 0.21149860322475433,
  'token': 2204,
  'token_str': 'good',
  'sequence': 'tomorrow the weather will be good.'},
 {'score': 0.1700924187898636,
  'token': 2488,
  'token_str': 'better',
  'sequence': 'tomorrow the weather will be better.'}]