<a href="https://colab.research.google.com/github/Mahdi-Golizadeh/Natural-Language-Processing/blob/main/transformers/Masked_LM/fine_tuning_MLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a masked language model with 🤗 transformers

## Installing and importing necessary libraries

In [1]:
!pip install -q transformers
!pip install -q datasets

[K     |████████████████████████████████| 5.8 MB 7.8 MB/s 
[K     |████████████████████████████████| 7.6 MB 46.5 MB/s 
[K     |████████████████████████████████| 182 kB 78.0 MB/s 
[K     |████████████████████████████████| 452 kB 8.1 MB/s 
[K     |████████████████████████████████| 212 kB 75.1 MB/s 
[K     |████████████████████████████████| 132 kB 80.3 MB/s 
[K     |████████████████████████████████| 127 kB 56.1 MB/s 
[?25h

In [16]:
import transformers
import datasets
import torch

we want to fine-tune "bert-base-uncased" in this notebook so we will initialize our model with proper checkpoint

In [3]:
checkpoint = "bert-base-uncased"

In [4]:
mlm_model = transformers.BertForMaskedLM.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


to see the number of parameters in loaded model

In [5]:
mlm_model.num_parameters()

109514298

## Usecase example

to see the performance of the model before training 

we need a sentence with `[MASK]` token in the place we want our model to fill it

In [6]:
text = "tomorrow the weather will be [MASK]."

we need to tokenize the sentence in order to give it to the model

### tokenizer

In [7]:
mlm_tokenizer = transformers.BertTokenizerFast.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [9]:
inputs = mlm_tokenizer(text, return_tensors= "pt")

now to get the logits out of model

In [10]:
token_logits = mlm_model(**inputs).logits

number of tokens in tokenizer

In [13]:
len(mlm_tokenizer)

30522

to find where in the input mask token is placed

In [20]:
mask_token_index = torch.where(inputs["input_ids"] == mlm_tokenizer.mask_token_id)[1]

extracting target token output vector

In [36]:
mask_token_logits = token_logits[0, mask_token_index, :]

next to select the top k best answers (`k = 3`)

In [37]:
top_3_tokens = torch.topk(mask_token_logits, 3, dim= -1).indices[0].tolist()

In [38]:
top_3_tokens

[2488, 2204, 2986]

now print out the predicion result

In [39]:
for token in top_3_tokens:
    print(f"{text.replace(mlm_tokenizer.mask_token, mlm_tokenizer.decode([token]))}")

tomorrow the weather will be better.
tomorrow the weather will be good.
tomorrow the weather will be fine.


## dataset

In [48]:
raw_dataset = datasets.load_dataset("glue", "cola")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [49]:
raw_dataset["train"][0]

{'sentence': "Our friends won't buy this analysis, let alone the next one we propose.",
 'label': 1,
 'idx': 0}

In [61]:
def tokenize(example):
    result = mlm_tokenizer(example["sentence"])
    if mlm_tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

In [62]:
tokenized_dataset = raw_dataset.map(tokenize)

  0%|          | 0/8551 [00:00<?, ?ex/s]

IndexError: ignored

In [58]:
tokenized_dataset["train"][0]

{'sentence': "Our friends won't buy this analysis, let alone the next one we propose.",
 'label': 1,
 'idx': 0,
 'input_ids': [101,
  2256,
  2814,
  2180,
  1005,
  1056,
  4965,
  2023,
  4106,
  1010,
  2292,
  2894,
  1996,
  2279,
  2028,
  2057,
  16599,
  1012,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

because of computation limits we set max_length to be 128

In [51]:
# model max size
mlm_tokenizer.model_max_length

512

In [52]:
# custom max size
chunk_size = 128

concatination dataset sentences into each other in order to not lose the sentences end by truncation 

an example of procedure

In [None]:
examples = 