# Trénink s destilací nad datasetem TREC (fine) s modelem BERT TINY
V tomto notebooku je trénován BERT TINY nad původním i augmentovaným datasetem TREC (fine), jako učitelský model je využíván finetunued BERT nad stejným datasetem.

Pro původní i augmentovaný dataset je na základě nalezených hyperparametrů ze sešitu hp_search proveden normální trénink a trénink s destilací znalostí. V rámci tréninků je oproti prohledávání hyperparametrů využito EarlyStoppingu pro zamezení přeučení. Navíc jsou získány také výsledky nad testovací částí datasetu a další metriky využívané v práci (velikost modelu a rychlost inference).

Při destilaci je využíváno předpočítaných logitů ze sešitu precompute_logits. Konfigurace jednotlivých tréninků odpovídá výstup pěti nejlepších běhů z prohledávání hyperparametrů u dané konfigurace. Maximální délka tréninku je nastavena na 20 epoch. EarlyStopping pracuje s trpělivostí tří epoch.

## Import knihoven a základní nastavení

In [1]:
from transformers import Trainer, BertForSequenceClassification, BertTokenizer, EarlyStoppingCallback
from datasets import load_from_disk
from torch.utils.data import DataLoader
import torch
import base
import os 
import copy

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Resetování náhodného seedu pro replikovatelnost výsledků.

In [2]:
base.reset_seed()

Ověření dostupnosti GPU.

In [4]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and will be used:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU.")

GPU is available and will be used: NVIDIA A100 80GB PCIe MIG 2g.20gb


Načtení datasetu a jeho základní předzpracování.

In [3]:
DATASET = "trec"

In [4]:
train = load_from_disk(f"~/data/{DATASET}/train-logits_fine")
eval = load_from_disk(f"~/data/{DATASET}/eval-logits_fine")
test = load_from_disk(f"~/data/{DATASET}/test-logits_fine")

train_aug = load_from_disk(f"~/data/{DATASET}/train-logits-augmented_fine")

In [5]:
tokenizer = BertTokenizer.from_pretrained("ndavid/autotrain-trec-fine-bert-739422530")

Tokenizace, padding a převod na IDčka skrze tokenizer učitele.

In [6]:
train = train.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the train dataset")
eval = eval.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the eval dataset")
test = test.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the test dataset")

train_aug = train_aug.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the augmented dataset")

Příprava dataloaderů pro finální ověření rychlosti inference. Testování probíhá pouze nad jedním záznamem z trénovací části.

In [7]:
train_data_gpu = copy.deepcopy(train)
train_data_gpu.set_format(type="torch", columns=["input_ids", "attention_mask"], device="cuda")
gpu_data_loader = DataLoader(train_data_gpu, batch_size=1, shuffle=False)

train_data_cpu = copy.deepcopy(train)
train_data_cpu.set_format(type="torch", columns=["input_ids","attention_mask"], device="cpu")
cpu_data_loader = DataLoader(train_data_cpu, batch_size=1, shuffle=False)

In [8]:
base.reset_seed()

## Normální trénink s původním datasetem

Získání předtrénovaného modelu.

In [9]:
model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=50)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku, zvolené parametry odpovídají pěti nejlepším výstupům z prohledávání hyperparametrů.

In [10]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-base_fine", logging_dir=f"~/logs/{DATASET}/bert-base_fine", lr=0.0012, weight_decay=.01, warmup_steps=4, batch_size=128, epochs=20)

Konfigurace trenéra s trpělivostí 3 epoch. 

In [155]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)


Spuštění tréninku, výstupy nad validační částí datasetu.

In [156]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.0723,2.295601,0.505041,0.161854,0.165593,0.14317
2,1.8313,1.538879,0.660862,0.273265,0.284365,0.253168
3,1.1582,1.24495,0.719523,0.345682,0.353773,0.332803
4,0.8002,1.237962,0.719523,0.379431,0.396922,0.364396
5,0.5678,1.113162,0.754354,0.489907,0.464247,0.454477
6,0.3979,1.110101,0.758937,0.55865,0.506712,0.51608
7,0.2716,1.075222,0.766269,0.566544,0.548188,0.540473
8,0.2028,1.189122,0.757104,0.603306,0.578967,0.576549
9,0.1494,1.138134,0.762603,0.623862,0.585633,0.588227
10,0.1149,1.139649,0.777269,0.694329,0.649343,0.651218


TrainOutput(global_step=525, training_loss=0.5915996124630882, metrics={'train_runtime': 79.107, 'train_samples_per_second': 1102.557, 'train_steps_per_second': 8.849, 'total_flos': 49425716214000.0, 'train_loss': 0.5915996124630882, 'epoch': 15.0})

Přepnutí modelu do evaluačního režimu.


In [157]:
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1


Otestování modelu nad testovací částí datasetu.

In [158]:
trainer.evaluate(test)

{'eval_loss': 1.0590753555297852,
 'eval_accuracy': 0.796,
 'eval_precision': 0.6639781101834673,
 'eval_recall': 0.6490349427918719,
 'eval_f1': 0.6378382138608072,
 'eval_runtime': 3.4501,
 'eval_samples_per_second': 144.924,
 'eval_steps_per_second': 1.159,
 'epoch': 15.0}

Uložení modelu.


In [None]:
torch.save(model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-base_fine.pth")

In [138]:
base.reset_seed()

## Trénink s destilací s původním datasetem

Získání předtrénovaného studentského modelu.

In [139]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=50)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku s destilací, zvolené parametry odpovídají pěti nejlepším výstupům z prohledávání hyperparametrů.

In [140]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-distill_fine", logging_dir=f"~/logs/{DATASET}/bert-distill_fine", remove_unused_columns=False, lr=0.0015, weight_decay=.003, warmup_steps=4, batch_size=128, epochs=20, temp=6, lambda_param=.4)

Konfigurace destilačního trenéra s trpělivostí 3 epoch. 

In [141]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

Spuštění tréninku s destilací, výstupy nad validační částí datasetu.

In [142]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.1418,1.637039,0.51879,0.153312,0.173378,0.142616
2,1.3074,1.139175,0.651696,0.246957,0.261602,0.236382
3,0.8984,0.960543,0.707608,0.288119,0.320861,0.298602
4,0.6579,0.88636,0.738772,0.383868,0.38007,0.367709
5,0.4969,0.833152,0.75802,0.490537,0.4516,0.44975
6,0.3923,0.811514,0.770852,0.537126,0.479521,0.490648
7,0.2912,0.806628,0.770852,0.553484,0.524728,0.524161
8,0.233,0.808993,0.778185,0.601631,0.569304,0.56971
9,0.1863,0.803415,0.778185,0.621739,0.5578,0.573217
10,0.1624,0.818579,0.779102,0.644498,0.615526,0.611536


TrainOutput(global_step=560, training_loss=0.4668802525315966, metrics={'train_runtime': 86.4269, 'train_samples_per_second': 1009.177, 'train_steps_per_second': 8.099, 'total_flos': 52720763961600.0, 'train_loss': 0.4668802525315966, 'epoch': 16.0})

Přepnutí studenta do evaluačního režimu.

In [143]:
student_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1


Otestování modelu nad testovací částí datasetu.

In [144]:
trainer.evaluate(test)

{'eval_loss': 0.8118124008178711,
 'eval_accuracy': 0.766,
 'eval_precision': 0.6861004212477909,
 'eval_recall': 0.6483048743655664,
 'eval_f1': 0.6356455735630214,
 'eval_runtime': 3.3471,
 'eval_samples_per_second': 149.384,
 'eval_steps_per_second': 1.195,
 'epoch': 16.0}

Uložení studentského modelu.

In [44]:
torch.save(student_model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-distil_fine.pth")

In [11]:
base.reset_seed()

## Normální trénink s augmentovaným datasetem
Získání předtrénovaného modelu.

In [12]:
model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=50)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku, zvolené parametry odpovídají pěti nejlepším výstupům z prohledávání hyperparametrů.


In [13]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-base-aug_fine", logging_dir=f"~/logs/{DATASET}/bert-base-aug_fine", lr=0.00022, warmup_steps=25,  epochs=20)

Konfigurace trenéra s trpělivostí 3 epoch. 

In [14]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_aug,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)


Spuštění tréninku, výstupy nad validační částí datasetu.

In [15]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.5787,1.058241,0.769019,0.45243,0.456731,0.445207
2,0.3346,0.990006,0.787351,0.602185,0.577015,0.569473
3,0.1247,1.047967,0.793767,0.722047,0.692194,0.692173
4,0.0603,1.13838,0.784601,0.705846,0.66196,0.671306
5,0.0353,1.131671,0.797434,0.761771,0.716732,0.719464
6,0.0252,1.209514,0.791934,0.776325,0.707774,0.72032
7,0.0179,1.261764,0.797434,0.789216,0.716466,0.727945
8,0.0137,1.24703,0.800183,0.759759,0.70883,0.717203
9,0.0111,1.292787,0.800183,0.794633,0.721876,0.735688
10,0.0077,1.360192,0.79835,0.792659,0.723839,0.737968


TrainOutput(global_step=6760, training_loss=0.17146615989109468, metrics={'train_runtime': 363.6762, 'train_samples_per_second': 3658.145, 'train_steps_per_second': 28.597, 'total_flos': 653378274385200.0, 'train_loss': 0.17146615989109468, 'epoch': 13.0})

Přepnutí modelu do evaluačního režimu.


In [16]:
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1


Otestování modelu nad testovací částí datasetu.

In [17]:
trainer.evaluate(test)

{'eval_loss': 1.397212266921997,
 'eval_accuracy': 0.782,
 'eval_precision': 0.6775712982424679,
 'eval_recall': 0.6855010398941199,
 'eval_f1': 0.6550720722675537,
 'eval_runtime': 2.9465,
 'eval_samples_per_second': 169.692,
 'eval_steps_per_second': 1.358,
 'epoch': 13.0}

Uložení modelu.


In [26]:
torch.save(model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-base-aug_fine.pth")

In [27]:
base.reset_seed()

## Trénink s destilací s augmentovaným datasetem

Získání předtrénovaného studentského modelu.

In [28]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=50)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace tréninku s destilací, zvolené parametry odpovídají pěti nejlepším výstupům z prohledávání hyperparametrů.

In [29]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-distill-aug_fine", logging_dir=f"~/logs/{DATASET}/bert-distill-aug_fine", remove_unused_columns=False, lr=0.00047, weight_decay=.007, warmup_steps=15, epochs=20, temp=4, lambda_param=.8)

Konfigurace destilačního trenéra s trpělivostí 3 epoch. 

In [30]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train_aug,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

Spuštění tréninku s destilací, výstupy nad validační částí datasetu.

In [31]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.5156,0.471725,0.779102,0.549351,0.514425,0.513555
2,0.1128,0.459124,0.786434,0.618439,0.601509,0.598734
3,0.0808,0.475293,0.780935,0.719573,0.626858,0.651858
4,0.0709,0.463248,0.796517,0.736075,0.654855,0.680492
5,0.0661,0.45625,0.793767,0.748872,0.664825,0.69125
6,0.0622,0.470117,0.7956,0.748254,0.668805,0.693948
7,0.06,0.481648,0.785518,0.760266,0.672599,0.70089
8,0.0587,0.460094,0.796517,0.749415,0.693793,0.709451
9,0.0572,0.471035,0.797434,0.785497,0.69248,0.720783
10,0.0561,0.473911,0.791934,0.774716,0.700257,0.72222


TrainOutput(global_step=10560, training_loss=0.08340635299682617, metrics={'train_runtime': 363.5742, 'train_samples_per_second': 3717.315, 'train_steps_per_second': 29.045, 'total_flos': 1021170128832000.0, 'train_loss': 0.08340635299682617, 'epoch': 20.0})

Přepnutí studenta do evaluačního režimu.

In [32]:
student_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1

Otestování studenta nad testovací částí datasetu.

In [33]:
trainer.evaluate(test)

{'eval_loss': 0.41090407967567444,
 'eval_accuracy': 0.806,
 'eval_precision': 0.6752235121669131,
 'eval_recall': 0.6893776846211949,
 'eval_f1': 0.6622485946860568,
 'eval_runtime': 3.5522,
 'eval_samples_per_second': 140.757,
 'eval_steps_per_second': 1.126,
 'epoch': 20.0}

Uložení studentského modelu.

In [34]:
torch.save(student_model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-distil-aug_fine.pth")

Získání počtu trénovatelných parametrů v modelu. 

In [None]:
base.count_parameters(student_model)

Změření rychlosti inference při použití CPU, 1000 pokusů s jedním záznamem.

In [None]:
cpu_benchmark = base.BenchMarkRunner(student_model, cpu_data_loader, "cpu", 1000)
print(cpu_benchmark.run_benchmark())

Změření rychlosti inference při použití GPU, 1000 pokusů s jedním záznamem.

In [None]:
gpu_benchmark = base.BenchMarkRunner(student_model, gpu_data_loader, "cuda", 1000)
print(gpu_benchmark.run_benchmark())