<a href="https://colab.research.google.com/github/Sihan-A/transformers_NLP/blob/main/03_RoBERTa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 3 RoBERTa

1. RoBERTa- and DistilBERT-like models
1. How to train a tokenizer from scratch
1. Byte-level byte-pair encoding
1. Saving the trained tokenizer to files
1. Recreating the tokenizer for the pretraining process
1. Initializing a RoBERTa model from scratch
1. Exploring the configuration of the model
1. Exploring the 80 million parameters of the model
1. Building the dataset for the trainer
1. Initializing the trainer
1. Pretraining the model
1. Saving the model
1. Applying the model to the downstream tasks of masked language modeling

## Step 1: Loading the dataset

In [None]:
!curl -L https://raw.githubusercontent.com/PacktPublishing/Transformers-for-Natural-Language-Processing/master/Chapter03/kant.txt --output "kant.txt"

## Step 2: Installing Hugging Face transformers

In [2]:
!pip uninstall -y tensorflow
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'

Found existing installation: tensorflow 2.6.0
Uninstalling tensorflow-2.6.0:
  Successfully uninstalled tensorflow-2.6.0
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-j33qp6g9
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-j33qp6g9
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 6.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 66.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K

## Step 3: Training a tokenizer

In [3]:
%%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = [str(x) for x in Path(".").glob("**/*.txt")]

tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2,
                special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])

CPU times: user 5.22 s, sys: 187 ms, total: 5.41 s
Wall time: 2.81 s


## Step 4: Saving the files to disk

In [5]:
import os
token_dir = "/content/KantaiBERT"
if not os.path.exists(token_dir):
    os.makedirs(token_dir)
tokenizer.save_model("KantaiBERT")

['KantaiBERT/vocab.json', 'KantaiBERT/merges.txt']

## Step 5: Loading the trained tokenizer files

In [6]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer("./KantaiBERT/vocab.json",
                                  "./KantaiBERT/merges.txt")

In [8]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

In [9]:
tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [11]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [12]:
tokenizer.encode("The Critique of Pure Reason.")

Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [13]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['<s>', 'The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.', '</s>']

## Step 6: Checking resource constraints: GPU and CUDA

In [14]:
!nvidia-smi

Fri Sep 17 05:04:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [15]:
import torch
torch.cuda.is_available()

True

## Step 7: Defining the configuration of the model

In [16]:
from transformers import RobertaConfig
config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [19]:
print(config)

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.11.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}



In [22]:
# create config.json
config.save_pretrained("./KantaiBERT")

## Step 8: Reloading the tokenizer in transformers

In [23]:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT", max_length=512)

## Step 9: Initializing a model from scratch

In [24]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [29]:
print(model)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

### Exploring the parameters

In [30]:
print(model.num_parameters())

83504416


In [32]:
LP = list(model.parameters())
lp = len(LP)
print(lp)

106


In [None]:
for p in range(0,lp):
    print(LP[p])

In [35]:
np=0
for p in range(0,lp):
    PL2=True
    try:
        L2=len(LP[p][0])
    except:
        L2=1
        PL2=False
    L1=len(LP[p])
    L3=L1*L2
    np+=L3
    if PL2==True:
        print(p,L1,L2,L3)
    if PL2==False:
        print(p,L1,L3)
print(np)

0 52000 768 39936000
1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768
11 768 768 589824
12 768 768
13 768 768
14 768 768
15 3072 768 2359296
16 3072 3072
17 768 3072 2359296
18 768 768
19 768 768
20 768 768
21 768 768 589824
22 768 768
23 768 768 589824
24 768 768
25 768 768 589824
26 768 768
27 768 768 589824
28 768 768
29 768 768
30 768 768
31 3072 768 2359296
32 3072 3072
33 768 3072 2359296
34 768 768
35 768 768
36 768 768
37 768 768 589824
38 768 768
39 768 768 589824
40 768 768
41 768 768 589824
42 768 768
43 768 768 589824
44 768 768
45 768 768
46 768 768
47 3072 768 2359296
48 3072 3072
49 768 3072 2359296
50 768 768
51 768 768
52 768 768
53 768 768 589824
54 768 768
55 768 768 589824
56 768 768
57 768 768 589824
58 768 768
59 768 768 589824
60 768 768
61 768 768
62 768 768
63 3072 768 2359296
64 3072 3072
65 768 3072 2359296
66 768 768
67 768 768
68 768 768
69 768 768 589824
70 768 768
71 768 768

## Step 10: Building the dataset

In [37]:
%%time
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./kant.txt",
    block_size=128,
)



CPU times: user 26.2 s, sys: 467 ms, total: 26.7 s
Wall time: 26.7 s


## Step 11: Defining a data collator

In [39]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15,
)

## Step 12: Initializing the trainer

In [40]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./KantaiBERT",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

## Step 13: Pretraining the model

In [41]:
%%time
trainer.train()

***** Running training *****
  Num examples = 170964
  Num Epochs = 1
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 2672


Step,Training Loss
500,6.6095
1000,5.7504
1500,5.2707
2000,5.0062
2500,4.8507




Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 9min 38s, sys: 2.56 s, total: 9min 41s
Wall time: 9min 42s


TrainOutput(global_step=2672, training_loss=5.452941757476259, metrics={'train_runtime': 582.9666, 'train_samples_per_second': 293.266, 'train_steps_per_second': 4.583, 'total_flos': 873620128952064.0, 'train_loss': 5.452941757476259, 'epoch': 1.0})

## Step 14: Saving the final model (tokenizer + config) to disk

In [42]:
trainer.save_model("./KantaiBERT")

Saving model checkpoint to ./KantaiBERT
Configuration saved in ./KantaiBERT/config.json
Model weights saved in ./KantaiBERT/pytorch_model.bin


## Step 15: Language modeling with FillMaskPipeline

In [None]:
from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model="./KantaiBERT",
    tokenizer="./KantaiBERT",
)

In [44]:
fill_mask("Human thinking involves human <mask>.")

[{'score': 0.03740358352661133,
  'sequence': 'Human thinking involves human reason.',
  'token': 393,
  'token_str': ' reason'},
 {'score': 0.015837745741009712,
  'sequence': 'Human thinking involves human experience.',
  'token': 531,
  'token_str': ' experience'},
 {'score': 0.013494659215211868,
  'sequence': 'Human thinking involves human it.',
  'token': 306,
  'token_str': ' it'},
 {'score': 0.009716475382447243,
  'sequence': 'Human thinking involves human conceptions.',
  'token': 605,
  'token_str': ' conceptions'},
 {'score': 0.009529100731015205,
  'sequence': 'Human thinking involves human law.',
  'token': 446,
  'token_str': ' law'}]