**Initialization**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [35]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

#@ IGNORING WARNINGS: 
import warnings
warnings.filterwarnings("ignore")

**Step 1: Loading Dataset**

In [4]:
#@ LOADING THE DATASET: UNCOMMENT BELOW: 
# !curl -L https://raw.githubusercontent.com/PacktPublishing/Transformers-for-Natural-Language-Processing/master/Chapter03/kant.txt --output "kant.txt"

**Step 2: Downloading Libraries and Dependencies**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [6]:
#@ IMPORTING MODULES: UNCOMMENT BELOW:
# !pip install transformers                               # Installing transformers. 
# !pip list | grep -E "transformers|tokenizers"           # Inspecting versions. 

**Step 3: Training a Tokenizer**
- We will train Hugging Face's ByteLevelBPETokenizer using `kant.txt`. A byte-level tokenizer will break a string or word down into a sub-string or sub-word. The tokenizer will be trained to generate merged sub-string tokens and analyze their frequency.

In [17]:
#@ TRAINING A TOKENIZER: 
%%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = [str(x) for x in Path(".").glob("**/*.txt")]                            # Initialization. 
tokenizer = ByteLevelBPETokenizer()                                             # Initializing a tokenizer. 
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, 
                special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])      # Training the tokenizer. 

CPU times: user 6.87 s, sys: 262 ms, total: 7.13 s
Wall time: 3.76 s


**Step 4: Saving Files**

In [18]:
#@ SAVING FILES:
import os 
token_dir = "/content/KantaiBERT"                           # Initialization. 
if not os.path.exists(token_dir):
    os.makedirs(token_dir)                                  # Creating directory.
tokenizer.save_model("KantaiBERT")                          # Saving the tokenizer.

['KantaiBERT/vocab.json', 'KantaiBERT/merges.txt']

**Step 5: Loading Trained Tokenizer**
- We will load our trained tokenizer files.

In [19]:
#@ LOADING TRAINED TOKENIZER: 
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "/content/KantaiBERT/vocab.json",
    "/content/KantaiBERT/merges.txt"
)                                                                       # Initializing trained tokenizer.
tokenizer.encode("The Critique of Pure Reason.").tokens                 # Initializing tokens.

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

In [20]:
#@ IMPLEMENTATION OF TRAINED TOKENIZER:
tokenizer.encode("The Critique of Pure Reason.")                        # Initializing encoding. 

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [22]:
#@ PROCESSING THE TOKENS:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),                            # Adding end tokens. 
    ("<s>", tokenizer.token_to_id("<s>")),                              # Adding start tokens.
)
tokenizer.enable_truncation(max_length=512)                             # Initializing truncation. 
tokenizer.encode("The Critique of Pure Reason.").tokens                 # Initializing encoding. 

['<s>', 'The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.', '</s>']

**Step 6: Checking Resource Constraints**

In [24]:
#@ RESOURCE CONSTRAINTS: 
# !nvidia-smi

In [25]:
#@ RESOURCE CONSTRAINTS: 
import torch
torch.cuda.is_available()

True

**Step 7: Model Configuration**
- We will be pretraining a **RoBERTa** type transformer model using the same number of layers and heads as a **DistilBERT** transformer. 

In [26]:
#@ DEFINING MODEL CONFIGURATIONS: 
from transformers import RobertaConfig

config = RobertaConfig(vocab_size=52_000, 
                       max_position_embeddings=514,
                       num_attention_heads=12,
                       num_hidden_layers=6,
                       type_vocab_size=1)                      # Initializing model configurations. 

**Step 8: Loading Tokenizer in Transformers**

In [27]:
#@ LOADING TOKENIZER IN TRANSFORMERS:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT", max_length=512)     # Initializing trained tokenizer. 

**Step 9: Initializing Model**

In [28]:
#@ INIIALIZING MODEL:
from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)                                        # Iniitalizing model. 
print(model)                                                                     # Inspecting model.

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [29]:
#@ EXPLORING NUMBER OF PARAMETERS:
print(model.num_parameters())

83504416


In [30]:
#@ EXPLORING PARAMETERS:
LP = list(model.parameters())           # Initializing parameters.
lp = len(LP)                            # Number of lists.
print(lp)                               # Inspection. 

106


In [31]:
#@ EXPLORING PARAMETERS:
for p in range(0, lp):
    print(LP[p])                        # Inspection. 

Parameter containing:
tensor([[ 0.0140,  0.0033,  0.0230,  ..., -0.0140,  0.0324,  0.0189],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0149,  0.0065, -0.0136,  ..., -0.0156, -0.0033, -0.0037],
        ...,
        [-0.0081,  0.0404, -0.0095,  ..., -0.0051,  0.0008,  0.0136],
        [-0.0104,  0.0105, -0.0119,  ..., -0.0085, -0.0063,  0.0025],
        [ 0.0049,  0.0075,  0.0276,  ..., -0.0164,  0.0030, -0.0439]],
       requires_grad=True)
Parameter containing:
tensor([[-9.4639e-03,  3.0284e-02,  2.8622e-02,  ..., -2.9630e-02,
         -2.3601e-02, -1.0316e-02],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 6.0179e-04, -1.9458e-02, -3.9265e-03,  ..., -4.9365e-02,
         -2.0553e-02, -1.8126e-02],
        ...,
        [ 2.7487e-02,  3.0826e-02,  5.7874e-03,  ...,  4.2850e-04,
          2.1950e-03, -1.2396e-02],
        [ 2.8772e-02, -4.4883e-02, -9.7859e-03,  ...,  6.3186e-04,
   

In [32]:
#@ COUNTING PARAMETERS:
np = 0
for p in range(0, lp):
    PL2 = True
    try:
        L2 = len(LP[p][0])
    except:
        L2 = 1
        PL2 = False
    L1 = len(LP[p])
    L3 = L1 * L2  
    np += L3
    if PL2 == True:
        print(p, L1, L2, L3)
    if PL2 == False:
        print(p, L1, L3)
print(np)   

0 52000 768 39936000
1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768
11 768 768 589824
12 768 768
13 768 768
14 768 768
15 3072 768 2359296
16 3072 3072
17 768 3072 2359296
18 768 768
19 768 768
20 768 768
21 768 768 589824
22 768 768
23 768 768 589824
24 768 768
25 768 768 589824
26 768 768
27 768 768 589824
28 768 768
29 768 768
30 768 768
31 3072 768 2359296
32 3072 3072
33 768 3072 2359296
34 768 768
35 768 768
36 768 768
37 768 768 589824
38 768 768
39 768 768 589824
40 768 768
41 768 768 589824
42 768 768
43 768 768 589824
44 768 768
45 768 768
46 768 768
47 3072 768 2359296
48 3072 3072
49 768 3072 2359296
50 768 768
51 768 768
52 768 768
53 768 768 589824
54 768 768
55 768 768 589824
56 768 768
57 768 768 589824
58 768 768
59 768 768 589824
60 768 768
61 768 768
62 768 768
63 3072 768 2359296
64 3072 3072
65 768 3072 2359296
66 768 768
67 768 768
68 768 768
69 768 768 589824
70 768 768
71 768 768

**Step 10: Building Dataset**

In [36]:
#@ BUILDING THE DATASET:
%%time
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path="./kant.txt",
                                block_size=128)                                 # Preparing dataset. 

CPU times: user 35 s, sys: 987 ms, total: 36 s
Wall time: 35.6 s


**Step 11: Data Collator**
- A data collator will take samples from the dataset and collate them into batches. 

In [37]:
#@ DEFINING DATA COLLATOR:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, 
                                                mlm_probability=0.15)           # Initializing data collator. 

**Step 12: Initializing Trainer**

In [38]:
#@ INITIALIZING TRAINER:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(output_dir="./KantaiBERT", 
                                  overwrite_output_dir=True, 
                                  num_train_epochs=1,
                                  per_device_train_batch_size=64,
                                  save_steps=10_000,
                                  save_total_limit=2)                           # Initializing training arguments. 
trainer = Trainer(model=model, args=training_args, 
                  data_collator=data_collator, train_dataset=dataset)           # Initializing trainer. 

**Step 13: Pretraining Model**

In [39]:
#@ PRETRAINING THE MODEL:
%%time
trainer.train()

***** Running training *****
  Num examples = 170964
  Num Epochs = 1
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 2672


Step,Training Loss
500,6.5902
1000,5.7344
1500,5.2794
2000,5.0218
2500,4.8702




Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 18min 43s, sys: 5.62 s, total: 18min 49s
Wall time: 18min 59s


TrainOutput(global_step=2672, training_loss=5.455814338729767, metrics={'train_runtime': 1139.4871, 'train_samples_per_second': 150.036, 'train_steps_per_second': 2.345, 'total_flos': 873620128952064.0, 'train_loss': 5.455814338729767, 'epoch': 1.0})

**Step 14: Saving Model**

In [40]:
#@ SAVING THE FINAL MODEL:
trainer.save_model("./KantaiBERT")

Saving model checkpoint to ./KantaiBERT
Configuration saved in ./KantaiBERT/config.json
Model weights saved in ./KantaiBERT/pytorch_model.bin


**Step 15: Language Modeling**

In [41]:
#@ LANGUAGE MODELING WITH FILL MASK PIPELINE:
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="./KantaiBERT", tokenizer="./KantaiBERT")   # Initializing fill mask pipeline.
fill_mask("Human thinking involves human <mask>.")

loading configuration file ./KantaiBERT/config.json
Model config RobertaConfig {
  "_name_or_path": "./KantaiBERT",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file ./KantaiBERT/config.json
Model config RobertaConfig {
  "_name_or_path": "./KantaiBERT",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout"

[{'score': 0.04941876232624054,
  'sequence': 'Human thinking involves human reason.',
  'token': 394,
  'token_str': ' reason'},
 {'score': 0.018178889527916908,
  'sequence': 'Human thinking involves human experience.',
  'token': 535,
  'token_str': ' experience'},
 {'score': 0.012516315095126629,
  'sequence': 'Human thinking involves human conceptions.',
  'token': 610,
  'token_str': ' conceptions'},
 {'score': 0.011297235265374184,
  'sequence': 'Human thinking involves human law.',
  'token': 448,
  'token_str': ' law'},
 {'score': 0.009799298830330372,
  'sequence': 'Human thinking involves human object.',
  'token': 396,
  'token_str': ' object'}]