# Trénink s destilací nad datasetem DBpedia s modelem BERT TINY
V tomto notebooku je trénován BERT TINY nad původním i zmenšeným datasetem DBpedia, jako učitelský model je využíván finetunued BERT nad stejným datasetem. V tomto případě není experimentováno s augmentací, a to vzhledem k velikosti datasetu a výborným výsledkům modelů i během normální tréninku. Namísto toho je proveden experiment se zmenšením datasetu (využitím pouze 10 %).

V tomto případě nejsou k dispozici výstupy z prohledávání hyperparametrů, a to z důvodu velikosti datasetu a doby tréninku s ním spojené. Hyperparametry jsou de facto ponechány na výchozích hodnotách a i přesto je dosahováno výborných výsledků. 

Pro úplnost jsou v závěru notebooku spočteny velikost modelu a rychlost inference, avšak využitelnost výstupů nad tímto datasetem není příliš veliká.

## Import knihoven a základní nastavení

In [2]:
from transformers import Trainer, BertForSequenceClassification, BertTokenizer, EarlyStoppingCallback
from torch.utils.data import DataLoader
from datasets import load_from_disk
import torch
import base
import copy
import os

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Resetování náhodného seedu pro replikovatelnost výsledků.

In [3]:
base.reset_seed()

Ověření dostupnosti GPU.

In [4]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and will be used:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU.")

GPU is available and will be used: NVIDIA A100 80GB PCIe MIG 2g.20gb


Načtení datasetu a jeho základní předzpracování.

In [5]:
DATASET = "dbpedia"

In [6]:
train = load_from_disk(f"~/data/{DATASET}/train-logits")
eval = load_from_disk(f"~/data/{DATASET}/eval-logits")
test = load_from_disk(f"~/data/{DATASET}/test-logits")

train_aug = load_from_disk(f"~/data/{DATASET}/train-logits-augmented")

In [7]:
tokenizer = BertTokenizer.from_pretrained("gchhablani/bert-base-cased-finetuned-sst2")

tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Tokenizace, padding a převod na IDčka skrze tokenizer učitele.

In [8]:
train = train.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the train dataset")
eval = eval.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the eval dataset")
test = test.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the test dataset")

Tokenizing the train dataset:   0%|          | 0/448000 [00:00<?, ? examples/s]

Tokenizing the eval dataset:   0%|          | 0/112000 [00:00<?, ? examples/s]

Tokenizing the test dataset:   0%|          | 0/70000 [00:00<?, ? examples/s]

Příprava dataloaderů pro finální ověření rychlosti inference. Testování probíhá pouze nad jedním záznamem z trénovací části.

In [9]:
train_data_gpu = copy.deepcopy(train)
train_data_gpu.set_format(type="torch", columns=["input_ids", "attention_mask"], device="cuda")
gpu_data_loader = DataLoader(train_data_gpu, batch_size=1, shuffle=False)

train_data_cpu = copy.deepcopy(train)
train_data_cpu.set_format(type="torch", columns=["input_ids", "attention_mask"], device="cpu")
cpu_data_loader = DataLoader(train_data_cpu, batch_size=1, shuffle=False)

In [10]:
base.reset_seed()

## Normální trénink s původním datasetem

Získání předtrénovaného modelu.

In [11]:
model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=14)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku.

In [12]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-base", logging_dir=f"~/logs/{DATASET}/bert-base", batch_size=128, epochs=5)

Konfigurace trenéra s trpělivostí 3 epoch. 

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)


Spuštění tréninku, výstupy nad validační částí datasetu.

In [14]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.5366,0.094703,0.977125,0.977152,0.977125,0.977125
2,0.0864,0.069198,0.982375,0.982369,0.982375,0.982353
3,0.0624,0.064321,0.984,0.984013,0.984,0.983999
4,0.052,0.060338,0.984991,0.984987,0.984991,0.984987
5,0.0463,0.060093,0.985205,0.9852,0.985205,0.9852


TrainOutput(global_step=17500, training_loss=0.15675655081612724, metrics={'train_runtime': 551.1446, 'train_samples_per_second': 4064.269, 'train_steps_per_second': 31.752, 'total_flos': 1673755776000000.0, 'train_loss': 0.15675655081612724, 'epoch': 5.0})

Přepnutí modelu do evaluačního režimu.


In [15]:
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1


Otestování modelu nad testovací částí datasetu.

In [16]:
trainer.evaluate(test)

{'eval_loss': 0.0605352483689785,
 'eval_accuracy': 0.9853,
 'eval_precision': 0.9852876434295242,
 'eval_recall': 0.9853000000000002,
 'eval_f1': 0.9852890099009545,
 'eval_runtime': 12.7692,
 'eval_samples_per_second': 5481.957,
 'eval_steps_per_second': 42.838,
 'epoch': 5.0}

Uložení modelu.


In [17]:
torch.save(model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-base.pth")

## Trénink s destilací s původním datasetem

Získání předtrénovaného studentského modelu.

In [18]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=14)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku.

In [19]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-distill", logging_dir=f"~/logs/{DATASET}/bert-distill", remove_unused_columns=False, batch_size=128, epochs=5, temp=5, lambda_param=.5)

In [20]:
base.reset_seed()

Konfigurace destilačního trenéra s trpělivostí 3 epoch. 

In [None]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

Spuštění tréninku s destilací, výstupy nad validační částí datasetu.

In [22]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.0714,0.340927,0.975723,0.975699,0.975723,0.975695
2,0.2348,0.136569,0.982071,0.982045,0.982071,0.982042
3,0.1512,0.115858,0.98375,0.983764,0.98375,0.983747
4,0.1291,0.106492,0.984812,0.984808,0.984813,0.984805
5,0.1191,0.104359,0.985152,0.985143,0.985152,0.985143


TrainOutput(global_step=17500, training_loss=0.5411351597377232, metrics={'train_runtime': 552.1792, 'train_samples_per_second': 4056.654, 'train_steps_per_second': 31.693, 'total_flos': 1673755776000000.0, 'train_loss': 0.5411351597377232, 'epoch': 5.0})

Přepnutí studenta do evaluačního režimu.

In [None]:
student_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1


Otestování modelu nad testovací částí datasetu.

In [24]:
trainer.evaluate(test)

{'eval_loss': 0.10508575290441513,
 'eval_accuracy': 0.9849,
 'eval_precision': 0.9848861407583568,
 'eval_recall': 0.9849000000000002,
 'eval_f1': 0.9848863992604471,
 'eval_runtime': 13.1417,
 'eval_samples_per_second': 5326.556,
 'eval_steps_per_second': 41.623,
 'epoch': 5.0}

Uložení studentského modelu.

In [25]:
torch.save(student_model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-distill.pth")

## Normální trénink se zmenšeným datasetem
Zmenšení datasetu stratifikovaným rozdělením na 10 % své původní velikosti.

In [26]:
data = train.train_test_split(test_size=0.1, seed=42, stratify_by_column="labels")
train = data["test"]

In [27]:
base.reset_seed()

Získání předtrénovaného modelu.

In [28]:
model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=14)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku.

In [29]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-base-small", logging_dir=f"~/logs/{DATASET}/bert-base-small", batch_size=128, epochs=5)

Konfigurace trenéra s trpělivostí 3 epoch. 

In [30]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)


Spuštění tréninku, výstupy nad validační částí datasetu.

In [31]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.2929,1.692617,0.662982,0.705507,0.662982,0.622149
2,1.3349,0.921764,0.874027,0.876658,0.874027,0.869974
3,0.807,0.610662,0.912009,0.913787,0.912009,0.911571
4,0.5937,0.493739,0.926616,0.926949,0.926616,0.926475
5,0.5131,0.460695,0.930152,0.930016,0.930152,0.929927


TrainOutput(global_step=1750, training_loss=1.1083301304408482, metrics={'train_runtime': 140.7913, 'train_samples_per_second': 1591.007, 'train_steps_per_second': 12.43, 'total_flos': 167375577600000.0, 'train_loss': 1.1083301304408482, 'epoch': 5.0})

Přepnutí modelu do evaluačního režimu.


In [32]:
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1


Otestování modelu nad testovací částí datasetu.

In [33]:
trainer.evaluate(test)

{'eval_loss': 0.4596535563468933,
 'eval_accuracy': 0.9308714285714286,
 'eval_precision': 0.9306752859287907,
 'eval_recall': 0.9308714285714286,
 'eval_f1': 0.9306204363239735,
 'eval_runtime': 12.8347,
 'eval_samples_per_second': 5453.985,
 'eval_steps_per_second': 42.619,
 'epoch': 5.0}

Uložení modelu.


In [34]:
torch.save(model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-base-small.pth")

In [35]:
base.reset_seed()

## Trénink s destilací se zmenšeným datasetem
Získání předtrénovaného studentského modelu.

In [36]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=14)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku.

In [37]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-distil-small", logging_dir=f"~/logs/{DATASET}/bert-distil-small", remove_unused_columns=False, batch_size=128, epochs=5, temp=5, lambda_param=.5)

Konfigurace destilačního trenéra s trpělivostí 3 epoch. 

In [40]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

Spuštění tréninku s destilací, výstupy nad validační částí datasetu.

In [41]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,6.1145,5.177001,0.622348,0.6795,0.622348,0.5627
2,4.5553,3.831782,0.797696,0.825916,0.797696,0.780174
3,3.5365,3.104917,0.872679,0.882128,0.872679,0.864674
4,3.0038,2.750278,0.898482,0.902815,0.898482,0.895672
5,2.7661,2.638188,0.903464,0.906945,0.903464,0.901162


TrainOutput(global_step=1750, training_loss=3.995237618582589, metrics={'train_runtime': 141.1614, 'train_samples_per_second': 1586.836, 'train_steps_per_second': 12.397, 'total_flos': 167375577600000.0, 'train_loss': 3.995237618582589, 'epoch': 5.0})

Přepnutí studenta do evaluačního režimu.

In [42]:
student_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1

Otestování studenta nad testovací částí datasetu.

In [43]:
trainer.evaluate(test)

{'eval_loss': 2.6322505474090576,
 'eval_accuracy': 0.9052857142857142,
 'eval_precision': 0.9087587920273819,
 'eval_recall': 0.9052857142857142,
 'eval_f1': 0.9030386768242152,
 'eval_runtime': 12.4953,
 'eval_samples_per_second': 5602.128,
 'eval_steps_per_second': 43.777,
 'epoch': 5.0}

Uložení studentského modelu.

In [44]:
torch.save(student_model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-distil-small.pth")

Získání počtu trénovatelných parametrů v modelu. 

In [45]:
base.count_parameters(student_model)

model size: 16.746MB.
Total Trainable Params: 4387726.


Unnamed: 0,Modules,Parameters
0,bert.embeddings.word_embeddings.weight,3906816
1,bert.embeddings.position_embeddings.weight,65536
2,bert.embeddings.token_type_embeddings.weight,256
3,bert.embeddings.LayerNorm.weight,128
4,bert.embeddings.LayerNorm.bias,128
5,bert.encoder.layer.0.attention.self.query.weight,16384
6,bert.encoder.layer.0.attention.self.query.bias,128
7,bert.encoder.layer.0.attention.self.key.weight,16384
8,bert.encoder.layer.0.attention.self.key.bias,128
9,bert.encoder.layer.0.attention.self.value.weight,16384


Změření rychlosti inference při použití CPU, 1000 pokusů s jedním záznamem.

In [50]:
cpu_benchmark = base.BenchMarkRunner(student_model, cpu_data_loader, "cpu", 1000)
print(cpu_benchmark.run_benchmark())

<torch.utils.benchmark.utils.common.Measurement object at 0x7d3f7a74b8b0>
self.infer_speed_comp()
  3.71 ms
  1 measurement, 1000 runs , 4 threads


Změření rychlosti inference při použití GPU, 1000 pokusů s jedním záznamem.

In [51]:
gpu_benchmark = base.BenchMarkRunner(student_model, gpu_data_loader, "cuda", 1000)
print(gpu_benchmark.run_benchmark())

<torch.utils.benchmark.utils.common.Measurement object at 0x7d3f7a7f2020>
self.infer_speed_comp()
  2.35 ms
  1 measurement, 1000 runs , 4 threads
