# Trénink s destilací nad datasetem TREC (coarse) s modelem BERT TINY
V tomto notebooku je trénován BERT TINY nad původním i augmentovaným datasetem TREC (coarse), jako učitelský model je využíván finetunued BERT nad stejným datasetem.

Pro původní i augmentovaný dataset je na základě nalezených hyperparametrů ze sešitu hp_search proveden normální trénink a trénink s destilací znalostí. V rámci tréninků je oproti prohledávání hyperparametrů využito EarlyStoppingu pro zamezení přeučení. Navíc jsou získány také výsledky nad testovací částí datasetu a další metriky využívané v práci (velikost modelu a rychlost inference).

Při destilaci je využíváno předpočítaných logitů ze sešitu precompute_logits. Konfigurace jednotlivých tréninků odpovídá výstup pěti nejlepších běhů z prohledávání hyperparametrů u dané konfigurace. Maximální délka tréninku je nastavena na 20 epoch. EarlyStopping pracuje s trpělivostí čtyř epoch.

## Import knihoven a základní nastavení

In [2]:
from transformers import Trainer, BertForSequenceClassification, BertTokenizer, EarlyStoppingCallback
from datasets import load_from_disk
from torch.utils.data import DataLoader
import torch
import base
import os
import copy

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Resetování náhodného seedu pro replikovatelnost výsledků.

In [3]:
base.reset_seed()

Ověření dostupnosti GPU.

In [5]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and will be used:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU.")

GPU is available and will be used: NVIDIA A100 80GB PCIe MIG 2g.20gb


Načtení datasetu a jeho základní předzpracování.

In [None]:
DATASET = "trec"

In [6]:
train = load_from_disk(f"~/data/{DATASET}/train-logits_coarse")
eval = load_from_disk(f"~/data/{DATASET}/eval-logits_coarse")
test = load_from_disk(f"~/data/{DATASET}/test-logits_coarse")

train_aug = load_from_disk(f"~/data/{DATASET}/train-logits-augmented_coarse")

In [7]:
tokenizer = BertTokenizer.from_pretrained("carrassi-ni/bert-base-trec-question-classification")

Tokenizace, padding a převod na IDčka skrze tokenizer učitele.

In [8]:
train = train.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the train dataset")
eval = eval.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the eval dataset")
test = test.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the test dataset")

train_aug = train_aug.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the augmented dataset")

Příprava dataloaderů pro finální ověření rychlosti inference. Testování probíhá pouze nad jedním záznamem z trénovací části.

In [9]:
train_data_gpu = copy.deepcopy(train)
train_data_gpu.set_format(type="torch", columns=["input_ids", "attention_mask"], device="cuda")
gpu_data_loader = DataLoader(train_data_gpu, batch_size=1, shuffle=False)

train_data_cpu = copy.deepcopy(train)
train_data_cpu.set_format(type="torch", columns=["input_ids", "attention_mask"], device="cpu")
cpu_data_loader = DataLoader(train_data_cpu, batch_size=1, shuffle=False)

In [10]:
base.reset_seed()

## Normální trénink s původním datasetem

Získání předtrénovaného modelu.

In [11]:
model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=6)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku, zvolené parametry odpovídají pěti nejlepším výstupům z prohledávání hyperparametrů.

In [12]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-base_coarse", logging_dir=f"~/logs/{DATASET}/bert-base_coarse", epochs=20, lr=0.00045, weight_decay=.003, warmup_steps=3)

Konfigurace trenéra s trpělivostí 4 epoch. 

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 4)]
)


Spuštění tréninku, výstupy nad validační částí datasetu.

In [14]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.5392,1.230905,0.571952,0.549015,0.474968,0.481256
2,1.0179,0.846594,0.690192,0.597085,0.589547,0.589943
3,0.6295,0.711778,0.75802,0.808883,0.66091,0.663113
4,0.4033,0.654502,0.788268,0.833215,0.720405,0.744266
5,0.2623,0.648941,0.799267,0.817606,0.748744,0.769059
6,0.1724,0.715405,0.792851,0.81846,0.741942,0.765189
7,0.1113,0.73791,0.791934,0.818045,0.751066,0.772577
8,0.0808,0.74479,0.805683,0.828014,0.771898,0.791861
9,0.0663,0.822157,0.79835,0.826576,0.755023,0.778773
10,0.0473,0.888195,0.788268,0.788297,0.758031,0.767168


TrainOutput(global_step=420, training_loss=0.36715816543215796, metrics={'train_runtime': 61.639, 'train_samples_per_second': 1415.013, 'train_steps_per_second': 11.356, 'total_flos': 39005907393600.0, 'train_loss': 0.36715816543215796, 'epoch': 12.0})

Přepnutí modelu do evaluačního režimu.


In [15]:
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1


Otestování modelu nad testovací částí datasetu.

In [16]:
trainer.evaluate(test)

{'eval_loss': 0.5971861481666565,
 'eval_accuracy': 0.844,
 'eval_precision': 0.8336845142766452,
 'eval_recall': 0.8331648587390704,
 'eval_f1': 0.8316674032757624,
 'eval_runtime': 3.2722,
 'eval_samples_per_second': 152.803,
 'eval_steps_per_second': 1.222,
 'epoch': 12.0}

Uložení modelu.


In [17]:
torch.save(model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-base_coarse.pth")

In [18]:
base.reset_seed()

## Trénink s destilací s původním datasetem

Získání předtrénovaného studentského modelu.

In [19]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=6)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku s destilací, zvolené parametry odpovídají pěti nejlepším výstupům z prohledávání hyperparametrů.

In [20]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-distill_coarse", logging_dir=f"~/logs/{DATASET}/bert-distill_coarse", remove_unused_columns=False, epochs=20, lr=.0004, weight_decay=.006, warmup_steps=3, temp=2.5, lambda_param=.6)

Konfigurace destilačního trenéra s trpělivostí 4 epoch. 

In [21]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 4)]
)

Spuštění tréninku s destilací, výstupy nad validační částí datasetu.

In [22]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.8744,3.274858,0.545371,0.518519,0.452891,0.45555
2,2.8588,2.414468,0.669111,0.592933,0.569114,0.572565
3,1.9464,1.865984,0.742438,0.627257,0.638587,0.632136
4,1.2782,1.559126,0.772686,0.823563,0.673642,0.678479
5,0.8224,1.452547,0.781852,0.661701,0.671452,0.665183
6,0.5305,1.428145,0.791017,0.834769,0.714999,0.734863
7,0.3454,1.453631,0.793767,0.813232,0.735755,0.755643
8,0.2614,1.432611,0.805683,0.825315,0.753287,0.774574
9,0.1987,1.520586,0.802016,0.820757,0.760647,0.779352
10,0.1583,1.616826,0.785518,0.81085,0.747417,0.766915


TrainOutput(global_step=630, training_loss=0.7270351145002577, metrics={'train_runtime': 96.4329, 'train_samples_per_second': 904.464, 'train_steps_per_second': 7.259, 'total_flos': 58508861090400.0, 'train_loss': 0.7270351145002577, 'epoch': 18.0})

Přepnutí studenta do evaluačního režimu.

In [23]:
student_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1


Otestování modelu nad testovací částí datasetu.

In [24]:
trainer.evaluate(test)

{'eval_loss': 1.4004614353179932,
 'eval_accuracy': 0.838,
 'eval_precision': 0.821224466394745,
 'eval_recall': 0.8388925302933874,
 'eval_f1': 0.8259890406235443,
 'eval_runtime': 3.2361,
 'eval_samples_per_second': 154.508,
 'eval_steps_per_second': 1.236,
 'epoch': 18.0}

Uložení studentského modelu.

In [25]:
torch.save(student_model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-distil_coarse.pth")

In [26]:
base.reset_seed()

## Normální trénink s augmentovaným datasetem
Získání předtrénovaného modelu.

In [27]:
model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=6)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku, zvolené parametry odpovídají pěti nejlepším výstupům z prohledávání hyperparametrů.


In [28]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-base-aug_coarse", logging_dir=f"~/logs/{DATASET}/bert-base-aug_coarse", epochs=20, lr=.00003, weight_decay=.005, warmup_steps=18)

Konfigurace trenéra s trpělivostí 4 epoch. 

In [29]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_aug,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 4)]
)


Spuštění tréninku, výstupy nad validační částí datasetu.

In [30]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.5544,1.375357,0.506874,0.568107,0.402775,0.39319
2,1.1346,0.988684,0.697525,0.628769,0.591833,0.602613
3,0.8347,0.804513,0.744271,0.646096,0.635124,0.638677
4,0.636,0.719141,0.768103,0.829947,0.67557,0.692393
5,0.4927,0.663883,0.793767,0.81835,0.734783,0.758528
6,0.398,0.65159,0.8011,0.826361,0.749804,0.773761
7,0.3373,0.64169,0.806599,0.831157,0.753869,0.778146
8,0.2891,0.648874,0.805683,0.828512,0.752877,0.776549
9,0.2554,0.657757,0.804766,0.828051,0.752326,0.77607
10,0.2312,0.67107,0.8011,0.823683,0.750083,0.77265


TrainOutput(global_step=3355, training_loss=0.5793979991566051, metrics={'train_runtime': 132.1774, 'train_samples_per_second': 5889.356, 'train_steps_per_second': 46.15, 'total_flos': 319117694781600.0, 'train_loss': 0.5793979991566051, 'epoch': 11.0})

Přepnutí modelu do evaluačního režimu.


In [31]:
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1


Otestování modelu nad testovací částí datasetu.

In [32]:
trainer.evaluate(test)

{'eval_loss': 0.47989678382873535,
 'eval_accuracy': 0.854,
 'eval_precision': 0.8755966694475799,
 'eval_recall': 0.8254649373070079,
 'eval_f1': 0.8440515515873256,
 'eval_runtime': 5.451,
 'eval_samples_per_second': 91.726,
 'eval_steps_per_second': 0.734,
 'epoch': 11.0}

Uložení modelu.


In [33]:
torch.save(model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-base-aug_coarse.pth")

In [34]:
base.reset_seed()

## Trénink s destilací s augmentovaným datasetem

Získání předtrénovaného studentského modelu.

In [35]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=6)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku s destilací, zvolené parametry odpovídají pěti nejlepším výstupům z prohledávání hyperparametrů.

In [36]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-distill-aug_coarse", logging_dir=f"~/logs/{DATASET}/bert-distill-aug_coarse", remove_unused_columns=False, epochs=20, lr=.00025, weight_decay=.005, temp=4, lambda_param=.7)

Konfigurace destilačního trenéra s trpělivostí 4 epoch. 

In [37]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train_aug,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 4)]
)

Spuštění tréninku s destilací, výstupy nad validační částí datasetu.

In [38]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.3409,1.589709,0.792851,0.755488,0.688767,0.692523
2,0.5388,1.298756,0.827681,0.843854,0.782121,0.801464
3,0.294,1.343081,0.826764,0.838112,0.782522,0.798631
4,0.2271,1.255016,0.832264,0.837746,0.783864,0.802893
5,0.19,1.259071,0.840513,0.846026,0.788898,0.809829
6,0.1662,1.240512,0.84143,0.84366,0.790784,0.8099
7,0.1473,1.297121,0.840513,0.830456,0.790599,0.805768
8,0.1363,1.251859,0.842346,0.830643,0.792419,0.80655
9,0.1256,1.289922,0.84143,0.840766,0.802678,0.815739
10,0.1162,1.301587,0.842346,0.831805,0.792937,0.807246


TrainOutput(global_step=3965, training_loss=0.35348223365119846, metrics={'train_runtime': 157.083, 'train_samples_per_second': 4955.598, 'train_steps_per_second': 38.833, 'total_flos': 377139093832800.0, 'train_loss': 0.35348223365119846, 'epoch': 13.0})

Přepnutí studenta do evaluačního režimu.

In [39]:
student_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1

Otestování studenta nad testovací částí datasetu.

In [40]:
trainer.evaluate(test)

{'eval_loss': 1.409029245376587,
 'eval_accuracy': 0.854,
 'eval_precision': 0.8545244870998044,
 'eval_recall': 0.8491272999587253,
 'eval_f1': 0.8489970789959579,
 'eval_runtime': 3.7279,
 'eval_samples_per_second': 134.125,
 'eval_steps_per_second': 1.073,
 'epoch': 13.0}

Uložení studentského modelu.

In [41]:
torch.save(student_model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-distil-aug_coarse.pth")

Získání počtu trénovatelných parametrů v modelu. 

In [42]:
base.count_parameters(student_model)

model size: 16.742MB.
Total Trainable Params: 4386694.


Unnamed: 0,Modules,Parameters
0,bert.embeddings.word_embeddings.weight,3906816
1,bert.embeddings.position_embeddings.weight,65536
2,bert.embeddings.token_type_embeddings.weight,256
3,bert.embeddings.LayerNorm.weight,128
4,bert.embeddings.LayerNorm.bias,128
5,bert.encoder.layer.0.attention.self.query.weight,16384
6,bert.encoder.layer.0.attention.self.query.bias,128
7,bert.encoder.layer.0.attention.self.key.weight,16384
8,bert.encoder.layer.0.attention.self.key.bias,128
9,bert.encoder.layer.0.attention.self.value.weight,16384


Změření rychlosti inference při použití CPU, 1000 pokusů s jedním záznamem.

In [43]:
cpu_benchmark = base.BenchMarkRunner(student_model, cpu_data_loader, "cpu", 1000)
print(cpu_benchmark.run_benchmark())

<torch.utils.benchmark.utils.common.Measurement object at 0x7089f9f9e7d0>
self.infer_speed_comp()
  3.95 ms
  1 measurement, 1000 runs , 4 threads


Změření rychlosti inference při použití GPU, 1000 pokusů s jedním záznamem.

In [44]:
gpu_benchmark = base.BenchMarkRunner(student_model, gpu_data_loader, "cuda", 1000)
print(gpu_benchmark.run_benchmark())

<torch.utils.benchmark.utils.common.Measurement object at 0x7089f9f41630>
self.infer_speed_comp()
  2.30 ms
  1 measurement, 1000 runs , 4 threads
