<h1 style="text-align:center">Pre-Training BERT from scratch</h1>

<h3 style="text-align:center">This notebook is part of multilingual codemixed NLP research I did at SSN</h3>
<p style="text-align:center">Since there is not much documentation on pre-training BERT so I decided to put it all in one place.</p>

BERT (Bidirectional Encoder Representation form Transformers) is a state-of-art NLP neural network developed by google and is even used for Google searches. Recurrent Neural Networks (RNN's) were a standard in NLP before transformers but RNN's flawed in multiple ways as they cannot remember long term dependencies, in much more layman terms the words occuring earlier in sentences looses its dependency from a word occuring far later in the sentence a solution to this was ELMO architecture which provided running 2 seperate LSTM's from left and right and performing shallow concatenation which can also be done using keras Bidirectional wrapper.
    Bert on the other hand computes dependency of each word with every other word in sentence by performing "self attention" . Attention mechanism makes transformer NLP's deeply bidirectional as the neural network is able to capture dependencies occuring far later in sentences. 
    
BERT requires a pre-training task for capturing dependencies(or understanding the language basics) based on thier position or occurance in sentence. Transformer models are pre-trained on huge datasets for creating language understanding then are fine-tuned on a down stream task for classification , parts of speech tagging etc.
When working with newer or different type of data as multilingual or codemixed we may need to train a BERT model from scratch we'll see how to achieve it in this notebook

### Training tokenizer for BERT

In [None]:
# install required dependencies for transformers and dataset
!pip install transformers
!pip install datasets

In [None]:
''' The dataset used in this notebook is part of codalabs competition and can be found
here : https://competitions.codalab.org/competitions/31146 '''
# for creating bert vocabulary
from tokenizers import BertWordPieceTokenizer

# Initialize an empty BERT tokenizer
tokenizer = BertWordPieceTokenizer(
  clean_text=False,
  handle_chinese_chars=False,
  strip_accents=False,
  lowercase=False,
)

# prepare text files to train vocab on them
files = ['OffensiveLanguage/Task1/tamil_offensive.txt','OffensiveLanguage/Task1/tamil_offensive1.txt']

# train BERT tokenizer
tokenizer.train(
  files,
  vocab_size=30000,
  min_frequency=2,
  show_progress=True,
  special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
  limit_alphabet=1000,
  wordpieces_prefix="##"
)
# save the vocab
tokenizer.save('OffensiveLanguage/Task1/tamil_offensive_bert.json', pretty=True)


### Loading the Tokenizer

In [None]:
import json
f=open("tamil_offensive_bert.json")
k=json.load(f)
d=k["model"]["vocab"] # this is a dictionary mapping of vocabulary
f.close()

In [None]:
from tokenizers.implementations import BertWordPieceTokenizer
from tokenizers.processors import BertProcessing, TemplateProcessing
# create a Bert tokenizer object and pass the vocabulary we made above
tokenizer = BertWordPieceTokenizer(
    d,lowercase=False,
)

In [None]:
#initialize tokenizer post processing function with bert's processor 
# Note : BertProcessing takes 2 arguments the seprator token its id from our vocab and cls token and its id
tokenizer._tokenizer.post_processor = BertProcessing(
    ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ("[CLS]", tokenizer.token_to_id("[CLS]")),
)
tokenizer.enable_truncation(max_length=128)

In [None]:
# tokenizer in action
# here we're testing our tokenizer which has mixed languages namely tamil and english 
tokenizer.encode("தலைவா STR இதுக்குதான் கதுருந்தோம் மாஸ் தலைவா").tokens

['[CLS]',
 'தலைவா',
 'STR',
 'இதுக்கு',
 '##தான்',
 'கத',
 '##ுர',
 '##ுந்த',
 '##ோம்',
 'மாஸ்',
 'தலைவா',
 '[SEP]']

In [2]:
# check for cuda availability
import torch
torch.cuda.is_available()

True

<h2 style="text-align:center">Creating model from a BERT configuration</h2>
<p>Below is configuration table provided by google for different BERT configurations</p>
We'll be training models on tiny bert and mini bert as they train much faster and would be more than enough for this dataset

|   |H=128|H=256|H=512|H=768|
|---|:---:|:---:|:---:|:---:|
| **L=2**  |**2/128 (BERT-Tiny)**|2/256|2/512]|2/768|
| **L=4**  |4/128|**4/256 (BERT-Mini)**|**4/512 (BERT-Small)**|4/768|
| **L=6**  |6/128|6/256|6/512|6/768|
| **L=8**  |8/128|8/256|**8/512 (BERT-Medium)**|8/768|
| **L=10** |10/128|10/256|10/512|10/768|
| **L=12** |12/128|12/256|12/512|**12/768 (BERT-Base)**|


In [19]:
from transformers import BertConfig
# max_position_embeddings is often referred to as A and is equal to our max sentence length

#tiny_bert=BertConfig(hidden_size=128,num_attention_heads=2,max_position_embeddings=128) # perf 80.66%
#mini_bert=BertConfig(hidden_size=128,num_attention_heads=4,max_position_embeddings=128) # perf 80.38%
tiny_bert=BertConfig(hidden_size=256,num_attention_heads=2,max_position_embeddings=128) # 81.2 %

In [20]:
from transformers import BertTokenizerFast
tokenizer=BertTokenizerFast("tamil_offensive_bert.json",tokenizer_file="tamil_offensive_bert.json",do_lower_case=False)

## Create a model for BERT pre-training

In [21]:
# In original paper BERT models were pre trained on 2 tasks namely Masked language model and next sentence prediction
# here we'll train model using maskedLM only
from transformers import BertForMaskedLM #BertForPreTraining,BertForMaskedLM
model=BertForMaskedLM(config=tiny_bert)

In [22]:
# check the number of paramenters in our model 
model.num_parameters()

30028858

In [None]:
from transformers import LineByLineTextDataset
# creating input pipeline for our model
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="tamil_offensive.txt",
    block_size=128, #must be same as number of positional embeddings in bert 
)

In [8]:
from transformers import DataCollatorForLanguageModeling
# Initialize data collator which tokenizes and pre-processes data in our input pipeline
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [24]:
from transformers import Trainer, TrainingArguments
# initilize trainer with training arguments 
training_args = TrainingArguments(
    output_dir="task1/tamilBERT",
    overwrite_output_dir=True,
    num_train_epochs=20,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=3,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [27]:
trainer.train() #begin training

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 6534
  Num Epochs = 20
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 2060
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss
500,8.0973
1000,7.3541
1500,7.1739
2000,7.0837




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=2060, training_loss=7.4172653050098605, metrics={'train_runtime': 833.0329, 'train_samples_per_second': 156.873, 'train_steps_per_second': 2.473, 'total_flos': 2212364461953096.0, 'train_loss': 7.4172653050098605, 'epoch': 20.0})

In [28]:
trainer.save_model("task1/tamilBERT") #save model for fine-tuning on down stream task

Saving model checkpoint to task1/tamilBERT
Configuration saved in task1/tamilBERT/config.json
Model weights saved in task1/tamilBERT/pytorch_model.bin


In [None]:
from datasets import load_dataset
# 'test':['tam_offesive_withoutlabels_test.tsv']
train_dataset=load_dataset('csv',data_files={'train':['train.csv']},split='train[:70%]')
eval_dataset=load_dataset('csv',data_files={'train':['train.csv']},split='train[70%:]')

In [None]:
train_dataset[0]

{'id': 'tam1',
 'label': 0,
 'text': 'திருமலை நாயக்கர் பேரவை சார்பாக படம் வெற்றி பெற வாழ்த்துக்கள்'}

In [None]:
# apply pre processing on dataset here
def tokenize_dataset(data):
    return tokenizer(data["text"],padding="max_length",max_length=22,truncation=True)

train_dataset=train_dataset.map(tokenize_dataset,batched=True)
eval_dataset=eval_dataset.map(tokenize_dataset,batched=True)

In [None]:
train_dataset[0]

### Fine tune previously trained MLM model

In [31]:
from transformers import AutoModelForSequenceClassification
model=AutoModelForSequenceClassification.from_pretrained("task1/tamilBERT",num_labels=2)

loading configuration file task1/tamilBERT/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 128,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.8.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file task1/tamilBERT/pytorch_model.bin
Some weights of the model checkpoint at task1/tamilBERT were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias',

In [32]:
train_args=TrainingArguments("test_trainer")

trainer=Trainer(model=model,
                args=train_args,
                train_dataset=train_dataset,
                eval_dataset=eval_dataset)
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 4116
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1545


Step,Training Loss
500,0.4404
1000,0.3503
1500,0.2668


Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json
Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-1500
Configuration saved in test_trainer/checkpoint-1500/config.json
Model weights saved in test_trainer/checkpoint-1500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1545, training_loss=0.35003116076818175, metrics={'train_runtime': 73.0205, 'train_samples_per_second': 169.103, 'train_steps_per_second': 21.158, 'total_flos': 48895371046368.0, 'train_loss': 0.35003116076818175, 'epoch': 3.0})

In [17]:
# here we'll create a function that'll compute metrics and output performance of our trained model from evaluator
import numpy as np
from datasets import load_metric
metric=load_metric("accuracy")
def compute_metrics(pred):
    logits,labels=pred
    model_preds=np.argmax(logits,axis=-1)
    return metric.compute(predictions=model_preds,references=labels)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1362.0, style=ProgressStyle(description…




In [33]:
# intialize trainer and evaluate
trainer=Trainer(model=model,
                args=train_args,
                train_dataset=train_dataset,
                eval_dataset=eval_dataset,
                compute_metrics=compute_metrics)
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1764
  Batch size = 8


{'eval_accuracy': 0.8117913832199547,
 'eval_loss': 0.6345648169517517,
 'eval_runtime': 2.3954,
 'eval_samples_per_second': 736.404,
 'eval_steps_per_second': 92.259}

In [26]:
!rm -r test_trainer