# RoBERTa Model Pre-Training

In this notebook we will train a transformer based model more specifically `KantaiBERT` model using a data set based on three books by Immanuel Kant. 

[Orignal Sourcecode](https://github.com/PacktPublishing/Transformers-for-Natural-Language-Processing/blob/main/Chapter03/KantaiBERT.ipynb)

We will first create the dataset, train a tokenizer(mehh basically word2tokenid, tokenid2word functionality) and then we will pretrain the model that can be used on downstream tasks.

The model we are trying to build is a `Robustly Optimized BERT Pretraining Approach (RoBERTa)` like model which is based on a BERT architecture.

The initial BERT models are udertrained compared to the data it was given. RoBERTa models were introduced to fix that issue by providing improved mechanisms to the pretraining process. Also it increase the training performance of downstreaming tasks as well. The techniques used by RoBERTa is interesting and it is worth reading!


### General Details:

In this notebook, KantaiBERT will be trained as a small model with 6 layers, 12 heads of attention. Also it will utilize a Byte-Level Byte-Pair encoding tokenizer like the GPT2 models. Also the sequences will be segmented using a separation token \</s> rather than using token type ids.

We will train the model with Masked Language Modelling technique. 


#### 1. Loading the dataset

We are using a dataset consist of 3 books data. It is available in the data folder of the project. Original data is from [Project Gutenberg](https://www.gutenberg.org).

#### 2. Training the tokenizer

We will use the HuggingFace provided ByteLevelBPETokenizer as the tokenizer for our model. This also will break the words in to substring/ subword. But the approach is bit different. 
[A medium article on BytePair Encoding](https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0)


Below are some parameter values we will use for the tokenizer.

    - files: Path for the data

    - vocab_size: This is the required vocabulary size. We will use 52000.

    - min_frequency: This is the threshold for selecting a token. We will use 2.
    
    - special_tokens: The list of special tokens we need to have. In our case \<s>, \<pad>, \</s>, \<unk>, \<mask>.

In [1]:
from lib2to3.pgen2 import token
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files='data/kant.txt', vocab_size=52000, min_frequency=2, 
                special_tokens=[
                    "<s>",
                    "<pad>",
                    "</s>",
                    "<unk>",
                    "<mask>"
                ])

So now our tokenizer is trained. We can use it to our later works.
But is is beneficial to save the tokenizer to disk, so we can use it later any time. When we save the tokenizers, it usually creates 2 files.

1. merges.txt --> which contains merged tokenized sub-strings
2. vocab.json --> which contains the indices of the tokenized sub-strings.

(Apparently the weird 'Ġ' symbol denotes the space character.)

In [4]:
tokenizer.save_model('models/kantaiBERT')

['models/kantaiBERT\\vocab.json', 'models/kantaiBERT\\merges.txt']

Below include some samples tokenized using the above tokenizer.

In [6]:
tokenizer.encode('My name is Dilan and I am old').tokens

['My', 'Ġname', 'Ġis', 'ĠD', 'ilan', 'Ġand', 'ĠI', 'Ġam', 'Ġold']

But this is not enough for BERT processing. We need to make sure that sentences get seperated with related tokens (\<\s>, \<s>) as well. We can do this manually or we can use the transformer libnary provided BERTProcessing class to do that.

In [7]:
from tokenizers.processors import BertProcessing


tokenizer._tokenizer.post_processor = BertProcessing(
                            ("</s>", tokenizer.token_to_id("</s>")),
                            ("<s>", tokenizer.token_to_id("<s>"))
                        )

Now if we tokenize the same sentence as before, output would be like this.

In [8]:
tokenizer.encode('My name is Dilan and I am old').tokens

['<s>',
 'My',
 'Ġname',
 'Ġis',
 'ĠD',
 'ilan',
 'Ġand',
 'ĠI',
 'Ġam',
 'Ġold',
 '</s>']

Now tokenizer is in hand, we can move on to model configuration.

#### 3. Model Configuration

As we said earlier we will be training a RoBERTa type transformer with same number of layers and heads as DistilBERT model. So we will define our configurations accordingly.

In [10]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52000,
    max_postion_embedding=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1
)

We will go into more details about these little later. 

But as you remember we earlier trained a tokenizer with base BytePairlevel class. Instead we will load the saved data of it using the RobertaTokenizer class.

In [8]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('models/kantaiBERT/', max_length=512)
tokenizer.convert_ids_to_tokens(tokenizer.encode("My name is Dilan and I am old."))

['<s>',
 'My',
 'Ġname',
 'Ġis',
 'ĠD',
 'ilan',
 'Ġand',
 'ĠI',
 'Ġam',
 'Ġold',
 '.',
 '</s>']

Now lets load the Roberta Model specified for the MLM task.

In [15]:
from transformers import RobertaForMaskedLM

# Initializing the model with previously created config data.
model = RobertaForMaskedLM(config=config)

In [16]:
print(model)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(512, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

#### 4. Prepare the Dataset

Now we need to prepare our dataset for training. We already have the tokenizer ready. So all we need to do is read the dataset and encode them in a proper manner.

In [None]:
from dataclasses import dataclass
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(tokenizer, 'data/kant.txt', block_size=128)

Above code returns a dataset with encoded ids like below in Tensor format.

In [11]:
dataset.examples[1]

{'input_ids': tensor([   0, 1536, 2574,  300,  348,  267,  787,  270, 3525, 4460,  435,  512,
         3716,  305,  359,    2])}

In [12]:
tokenizer.convert_ids_to_tokens(dataset.examples[1]['input_ids'])

['<s>',
 'This',
 'ĠeBook',
 'Ġis',
 'Ġfor',
 'Ġthe',
 'Ġuse',
 'Ġof',
 'Ġanyone',
 'Ġanywhere',
 'Ġat',
 'Ġno',
 'Ġcost',
 'Ġand',
 'Ġwith',
 '</s>']

But we need to get this data as a batch for the training process. To do that we can either manually type the code or use the transformer provided DataCollator class like below.

In [24]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)


Note the paremeters mlm and mlm_probability. These define the DataCollator to use MLM like data preparation and sets the token masking probability to 15%.

With those details Now we can move onto the actual model training.

In [25]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='models/kantaiBERT/',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_eval_batch_size=64,
    save_steps=10000,
    save_total_limit=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

In [26]:
trainer.train()

***** Running training *****
  Num examples = 170964
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 21371
  2%|▏         | 503/21371 [00:37<22:16, 15.61it/s] 

{'loss': 7.3093, 'learning_rate': 4.8830190444995554e-05, 'epoch': 0.02}


  5%|▍         | 1003/21371 [01:09<22:10, 15.31it/s]

{'loss': 6.4482, 'learning_rate': 4.766038088999111e-05, 'epoch': 0.05}


  7%|▋         | 1503/21371 [01:42<21:59, 15.05it/s]

{'loss': 6.0866, 'learning_rate': 4.649057133498666e-05, 'epoch': 0.07}


  9%|▉         | 2003/21371 [02:16<22:09, 14.57it/s]

{'loss': 6.0697, 'learning_rate': 4.532076177998222e-05, 'epoch': 0.09}


 12%|█▏        | 2501/21371 [02:49<21:36, 14.56it/s]

{'loss': 5.9185, 'learning_rate': 4.415095222497778e-05, 'epoch': 0.12}


 14%|█▍        | 3003/21371 [03:23<22:42, 13.48it/s]

{'loss': 5.9707, 'learning_rate': 4.298114266997333e-05, 'epoch': 0.14}


 16%|█▋        | 3501/21371 [03:58<20:52, 14.27it/s]

{'loss': 5.8269, 'learning_rate': 4.181133311496889e-05, 'epoch': 0.16}


 19%|█▊        | 4003/21371 [04:32<18:03, 16.02it/s]

{'loss': 5.6643, 'learning_rate': 4.064152355996444e-05, 'epoch': 0.19}


 21%|██        | 4501/21371 [05:02<16:59, 16.55it/s]

{'loss': 5.4834, 'learning_rate': 3.947171400495999e-05, 'epoch': 0.21}


 23%|██▎       | 5003/21371 [05:35<17:06, 15.95it/s]

{'loss': 5.4193, 'learning_rate': 3.830190444995555e-05, 'epoch': 0.23}


 26%|██▌       | 5503/21371 [06:06<16:00, 16.53it/s]

{'loss': 5.219, 'learning_rate': 3.7132094894951105e-05, 'epoch': 0.26}


 28%|██▊       | 6003/21371 [06:38<15:36, 16.40it/s]

{'loss': 5.1452, 'learning_rate': 3.596228533994666e-05, 'epoch': 0.28}


 30%|███       | 6503/21371 [07:09<15:44, 15.75it/s]

{'loss': 5.046, 'learning_rate': 3.4792475784942214e-05, 'epoch': 0.3}


 33%|███▎      | 7001/21371 [07:40<16:13, 14.76it/s]

{'loss': 4.9376, 'learning_rate': 3.3622666229937766e-05, 'epoch': 0.33}


 35%|███▌      | 7503/21371 [08:11<13:48, 16.73it/s]

{'loss': 4.9074, 'learning_rate': 3.2452856674933324e-05, 'epoch': 0.35}


 37%|███▋      | 8003/21371 [08:42<13:41, 16.26it/s]

{'loss': 4.8558, 'learning_rate': 3.1283047119928875e-05, 'epoch': 0.37}


 40%|███▉      | 8503/21371 [09:13<13:08, 16.33it/s]

{'loss': 4.6967, 'learning_rate': 3.011323756492443e-05, 'epoch': 0.4}


 42%|████▏     | 9003/21371 [09:44<12:24, 16.62it/s]

{'loss': 4.7142, 'learning_rate': 2.8943428009919987e-05, 'epoch': 0.42}


 44%|████▍     | 9503/21371 [10:17<14:25, 13.71it/s]

{'loss': 4.6807, 'learning_rate': 2.7773618454915538e-05, 'epoch': 0.44}


 47%|████▋     | 10000/21371 [10:49<13:34, 13.95it/s]Saving model checkpoint to models/kantaiBERT/checkpoint-10000
Configuration saved in models/kantaiBERT/checkpoint-10000\config.json


{'loss': 4.605, 'learning_rate': 2.66038088999111e-05, 'epoch': 0.47}


Model weights saved in models/kantaiBERT/checkpoint-10000\pytorch_model.bin
 49%|████▉     | 10503/21371 [11:28<11:53, 15.23it/s]  

{'loss': 4.5966, 'learning_rate': 2.543399934490665e-05, 'epoch': 0.49}


 51%|█████▏    | 11003/21371 [12:01<10:52, 15.89it/s]

{'loss': 4.5584, 'learning_rate': 2.4264189789902205e-05, 'epoch': 0.51}


 54%|█████▍    | 11503/21371 [12:32<10:43, 15.33it/s]

{'loss': 4.5885, 'learning_rate': 2.309438023489776e-05, 'epoch': 0.54}


 56%|█████▌    | 12003/21371 [13:02<09:17, 16.81it/s]

{'loss': 4.4376, 'learning_rate': 2.1924570679893314e-05, 'epoch': 0.56}


 59%|█████▊    | 12503/21371 [13:33<09:10, 16.11it/s]

{'loss': 4.4295, 'learning_rate': 2.075476112488887e-05, 'epoch': 0.58}


 61%|██████    | 13003/21371 [14:04<08:48, 15.84it/s]

{'loss': 4.4385, 'learning_rate': 1.9584951569884423e-05, 'epoch': 0.61}


 63%|██████▎   | 13503/21371 [14:35<08:04, 16.26it/s]

{'loss': 4.3853, 'learning_rate': 1.841514201487998e-05, 'epoch': 0.63}


 66%|██████▌   | 14003/21371 [15:05<07:36, 16.14it/s]

{'loss': 4.2875, 'learning_rate': 1.7245332459875532e-05, 'epoch': 0.66}


 68%|██████▊   | 14501/21371 [15:36<07:31, 15.22it/s]

{'loss': 4.2697, 'learning_rate': 1.6075522904871087e-05, 'epoch': 0.68}


 70%|███████   | 15003/21371 [16:08<06:52, 15.45it/s]

{'loss': 4.357, 'learning_rate': 1.4905713349866643e-05, 'epoch': 0.7}


 73%|███████▎  | 15503/21371 [16:41<05:54, 16.54it/s]

{'loss': 4.2639, 'learning_rate': 1.3735903794862197e-05, 'epoch': 0.73}


 75%|███████▍  | 16003/21371 [17:12<05:35, 16.02it/s]

{'loss': 4.221, 'learning_rate': 1.256609423985775e-05, 'epoch': 0.75}


 77%|███████▋  | 16503/21371 [17:42<05:01, 16.16it/s]

{'loss': 4.2503, 'learning_rate': 1.1396284684853306e-05, 'epoch': 0.77}


 80%|███████▉  | 17003/21371 [18:13<04:32, 16.00it/s]

{'loss': 4.1409, 'learning_rate': 1.0226475129848861e-05, 'epoch': 0.8}


 82%|████████▏ | 17501/21371 [18:46<04:40, 13.78it/s]

{'loss': 4.2199, 'learning_rate': 9.056665574844416e-06, 'epoch': 0.82}


 84%|████████▍ | 18003/21371 [19:21<04:06, 13.69it/s]

{'loss': 4.1898, 'learning_rate': 7.88685601983997e-06, 'epoch': 0.84}


 87%|████████▋ | 18503/21371 [19:57<03:27, 13.84it/s]

{'loss': 4.1788, 'learning_rate': 6.7170464648355245e-06, 'epoch': 0.87}


 89%|████████▉ | 19003/21371 [20:33<02:52, 13.72it/s]

{'loss': 4.189, 'learning_rate': 5.54723690983108e-06, 'epoch': 0.89}


 91%|█████████▏| 19503/21371 [21:07<01:54, 16.25it/s]

{'loss': 4.1341, 'learning_rate': 4.377427354826634e-06, 'epoch': 0.91}


 94%|█████████▎| 20000/21371 [21:38<01:26, 15.90it/s]Saving model checkpoint to models/kantaiBERT/checkpoint-20000
Configuration saved in models/kantaiBERT/checkpoint-20000\config.json


{'loss': 4.1325, 'learning_rate': 3.2076177998221893e-06, 'epoch': 0.94}


Model weights saved in models/kantaiBERT/checkpoint-20000\pytorch_model.bin
 96%|█████████▌| 20501/21371 [22:14<00:56, 15.42it/s]

{'loss': 4.083, 'learning_rate': 2.037808244817744e-06, 'epoch': 0.96}


 98%|█████████▊| 21003/21371 [22:46<00:23, 15.64it/s]

{'loss': 4.075, 'learning_rate': 8.679986898132984e-07, 'epoch': 0.98}


100%|██████████| 21371/21371 [23:09<00:00, 16.26it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 21371/21371 [23:09<00:00, 15.38it/s]

{'train_runtime': 1389.9396, 'train_samples_per_second': 123.001, 'train_steps_per_second': 15.375, 'train_loss': 4.833053845191425, 'epoch': 1.0}





TrainOutput(global_step=21371, training_loss=4.833053845191425, metrics={'train_runtime': 1389.9396, 'train_samples_per_second': 123.001, 'train_steps_per_second': 15.375, 'train_loss': 4.833053845191425, 'epoch': 1.0})

Once the training is completed, we can save the model to the disk.

In [27]:
trainer.save_model('models/kantaiBERT/')

Saving model checkpoint to models/kantaiBERT/
Configuration saved in models/kantaiBERT/config.json
Model weights saved in models/kantaiBERT/pytorch_model.bin


#### 5. Evaluating the model



Now we can use our trained model for some basic inferencing tasks. For that we will use the trained model using transformer pipeline.

In [1]:
from cmath import pi
from transformers import pipeline

fill_mask = pipeline('fill-mask', 
                     model='models/kantaiBERT/', 
                     tokenizer='models/kantaiBERT/')

Once we get the relevant pipeline we can use it for our usecase like below.

In [7]:
fill_mask("Our thinking is based on our <mask>.")

[{'score': 0.021723872050642967,
  'token': 605,
  'token_str': ' conceptions',
  'sequence': 'Our thinking is based on our conceptions.'},
 {'score': 0.0203399658203125,
  'token': 418,
  'token_str': ' conception',
  'sequence': 'Our thinking is based on our conception.'},
 {'score': 0.018971821293234825,
  'token': 670,
  'token_str': ' principles',
  'sequence': 'Our thinking is based on our principles.'},
 {'score': 0.013673394918441772,
  'token': 600,
  'token_str': ' understanding',
  'sequence': 'Our thinking is based on our understanding.'},
 {'score': 0.013549803756177425,
  'token': 604,
  'token_str': ' existence',
  'sequence': 'Our thinking is based on our existence.'}]

The output is really interesting as you can see. It should be noted that this is trained only on very small set of data compared to what original transformers were trained. But still it yields a great result.

RoBERTa models are intersting as they provide more improved method for pretraining a BERT model. Read the techniques!