# Trénink s destilací vnitřních stavů nad datasetem SST2 s modelem BERT TINY
V tomto notebooku je trénován BERT TINY nad původním i augmentovaným datasetem SST2, jako učitelský model je využíván finetunued BERT nad stejným datasetem. Prováděna je destilace vnitřních stavů. 

Konfigurace hyperparametrů vychází z poznatků získaných napříč notebooky s tímto modelem a datasetem. Pro destilaci vnitřních stavů není k dispozici prohledávání parametrů a to především s ohledem na dobu tréninku. 

Při tréninku je využito EarlyStoppingu pro zmenšení přeučení a jsou získány výkonnostní metriky nad umělou i oficiální testovací částí datasetu.

## Import knihoven a základní nastavení

In [1]:
from transformers import BertForSequenceClassification, BertTokenizer, EarlyStoppingCallback
from datasets import load_from_disk
from torch.utils.data import DataLoader
import torch
import base
import os
import copy

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Resetování náhodného seedu pro replikovatelnost výsledků.

In [2]:
base.reset_seed()

Ověření dostupnosti GPU.

In [4]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and will be used:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU.")

GPU is available and will be used: NVIDIA A100 80GB PCIe MIG 2g.20gb


In [None]:
DATASET = "sst2"

Načtení datasetu a jeho základní předzpracování.

Pracováno je jak s oficiální tak umělou testovací částí.

In [5]:
train = load_from_disk(f"~/data/{DATASET}/train-logits")
eval = load_from_disk(f"~/data/{DATASET}/eval-logits")
test = load_from_disk(f"~/data/{DATASET}/test-logits")

train_aug = load_from_disk(f"~/data/{DATASET}/train-logits-augmented")
test_blank= load_from_disk(f"~/data/{DATASET}/test-blank-logits")

In [6]:
tokenizer = BertTokenizer.from_pretrained("gchhablani/bert-base-cased-finetuned-sst2")

Tokenizace, padding a převod na IDčka skrze tokenizer učitele.

In [7]:
train = train.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the train dataset")
eval = eval.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the eval dataset")
test = test.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the test dataset")

train_aug = train_aug.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the augmented dataset")
test_blank = test_blank.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=300), batched=True, desc="Tokenizing the blank test dataset")

Příprava dataloaderů pro finální ověření rychlosti inference. Testování probíhá pouze nad jedním záznamem z trénovací části.

In [8]:
train_data_gpu = copy.deepcopy(train)
train_data_gpu.set_format(type="torch", columns=["input_ids", "attention_mask"], device="cuda")
gpu_data_loader = DataLoader(train_data_gpu, batch_size=1, shuffle=False)

train_data_cpu = copy.deepcopy(train)
train_data_cpu.set_format(type="torch", columns=["input_ids", "attention_mask"], device="cpu")
cpu_data_loader = DataLoader(train_data_cpu, batch_size=1, shuffle=False)

In [9]:
base.reset_seed()

## Trénink s destilací vnitřních stavů s původním datasetem

Získání učitelského modelu.

In [10]:
teacher_model = BertForSequenceClassification.from_pretrained("gchhablani/bert-base-cased-finetuned-sst2", num_labels=2)
teacher_model.to(device)
teacher_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [34]:
base.reset_seed()

Získání předtrénovaného studentského modelu.

In [35]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace destilačního tréninku, parametry jsou rozšířeny o alpha pro stanovení poměru destilace vnitřních stavů s ostatními ztrátovými funkcemi. 

In [36]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-distill-inner", logging_dir=f"~/logs/{DATASET}/bert-distill-inner", remove_unused_columns=False, lr=0.000047, weight_decay=0.07, epochs=20, temp=6, lambda_param=0.2, alpha_param=.75)

Konfigurace destilačního trenéra s trpělivostí 3 epoch. Tato varianta trenéra pracuje i s vnitřními stavy.

In [37]:
trainer = base.DistilTrainerInner(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=train,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

Spuštění tréninku s destilací, výstupy nad validační částí datasetu.

In [38]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.4649,0.549621,0.737385,0.739053,0.738076,0.737233
2,0.3379,0.590379,0.779817,0.779864,0.779953,0.779806
3,0.2637,0.614781,0.799312,0.799292,0.799392,0.799291
4,0.2243,0.721653,0.794725,0.797189,0.793835,0.793905
5,0.1984,0.700087,0.799312,0.800451,0.799855,0.799267
6,0.1829,0.736698,0.811927,0.812037,0.812116,0.811923
7,0.1706,0.807443,0.808486,0.808467,0.808569,0.808466
8,0.1629,0.788713,0.811927,0.812839,0.812411,0.811902
9,0.1546,0.820029,0.813073,0.813396,0.813368,0.813073
10,0.1479,0.858203,0.802752,0.804827,0.803486,0.802627


TrainOutput(global_step=5052, training_loss=0.21590178709415245, metrics={'train_runtime': 2096.564, 'train_samples_per_second': 513.974, 'train_steps_per_second': 4.016, 'total_flos': 481307141448000.0, 'train_loss': 0.21590178709415245, 'epoch': 12.0})

Přepnutí studenta do evaluačního režimu.

In [39]:
student_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1


Otestování modelu nad umělou testovací částí datasetu.

In [40]:
trainer.evaluate(test)

{'eval_loss': 0.35776326060295105,
 'eval_accuracy': 0.9060876020786934,
 'eval_precision': 0.904093922577215,
 'eval_recall': 0.9065983479595644,
 'eval_f1': 0.90513043598376,
 'eval_runtime': 5.3204,
 'eval_samples_per_second': 2531.757,
 'eval_steps_per_second': 19.923,
 'epoch': 12.0}

Uložení studentského modelu.

In [41]:
torch.save(student_model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-distill-inner.pth")

Vygenerování predikcí nad oficiální testovací částí a jejich export pro nahrání na GLUE Benchmark.

In [42]:
test_blank.set_format(type="torch", columns=["input_ids", "attention_mask"], device="cuda")
test_blank_dataloader = DataLoader(test_blank, batch_size=128, shuffle=False)
test_blank_logits = base.generate_logits(test_blank_dataloader, student_model)
base.generate_real_test_file_sst2(test_blank_logits, f"{os.path.expanduser('~')}/data/{DATASET}/tiny-bert-distill-inner-test.tsv")

Generating logits for given dataset:   0%|          | 0/15 [00:00<?, ?it/s]

Created output file named: /home/jovyan/data/sst2/tiny-bert-distill-inner-test.tsv upload it to GLUE benchmark to obtain results!


In [43]:
base.reset_seed()

## Trénink s destilací vnitřních stavů s augmentovaným datasetem

Získání předtrénovaného studentského modelu.

In [44]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Konfigurace destilačního tréninku, parametry jsou rozšířeny o alpha pro stanovení poměru destilace vnitřních stavů s ostatními ztrátovými funkcemi. 

In [45]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/bert-distill-inner-aug", logging_dir=f"~/logs/{DATASET}/bert-distill-inner-aug", remove_unused_columns=False, lr=0.00005, weight_decay=0.08, epochs=20, temp=7, lambda_param=0, alpha_param=.5)

Konfigurace destilačního trenéra s trpělivostí 3 epoch. Tato varianta trenéra pracuje i s vnitřními stavy.

In [46]:
trainer = base.DistilTrainerInner(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=train_aug,
    eval_dataset=eval,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)]
)

Spuštění tréninku s destilací, výstupy nad validační částí datasetu.

In [47]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2108,0.463851,0.801606,0.801924,0.801897,0.801605
2,0.1283,0.523375,0.801606,0.801566,0.801476,0.801511
3,0.1068,0.581596,0.799312,0.799304,0.799139,0.799195
4,0.0929,0.614697,0.788991,0.789363,0.788583,0.788706


TrainOutput(global_step=9180, training_loss=0.13469865753240315, metrics={'train_runtime': 3714.66, 'train_samples_per_second': 1580.958, 'train_steps_per_second': 12.356, 'total_flos': 874361091744000.0, 'train_loss': 0.13469865753240315, 'epoch': 4.0})

Přepnutí studenta do evaluačního režimu.

In [48]:
student_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-1): 2 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=128, out_features=128, bias=True)
              (key): Linear(in_features=128, out_features=128, bias=True)
              (value): Linear(in_features=128, out_features=128, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=128, out_features=128, bias=True)
              (LayerNorm): LayerNorm((128,), eps=1e-1

Otestování studenta nad umělou testovací částí datasetu.

In [49]:
trainer.evaluate(test)

{'eval_loss': 0.27749595046043396,
 'eval_accuracy': 0.8929472902746844,
 'eval_precision': 0.8912168610225423,
 'eval_recall': 0.8920526992997893,
 'eval_f1': 0.8916139549882979,
 'eval_runtime': 5.1127,
 'eval_samples_per_second': 2634.639,
 'eval_steps_per_second': 20.733,
 'epoch': 4.0}

Uložení studentského modelu.

In [50]:
torch.save(student_model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/bert-distill-inner-aug.pth")

Vygenerování predikcí nad oficiální testovací částí a jejich export pro nahrání na GLUE Benchmark.

In [51]:
test_blank_logits = base.generate_logits(test_blank_dataloader, student_model)
base.generate_real_test_file_sst2(test_blank_logits, f"{os.path.expanduser('~')}/data/{DATASET}/tiny-bert-base-inner-aug-test.tsv")

Generating logits for given dataset:   0%|          | 0/15 [00:00<?, ?it/s]

Created output file named: /home/jovyan/data/sst2/tiny-bert-base-inner-aug-test.tsv upload it to GLUE benchmark to obtain results!


Získání počtu trénovatelných parametrů v modelu. 

In [21]:
base.count_parameters(student_model)

model size: 16.740MB.
Total Trainable Params: 4386178.


Unnamed: 0,Modules,Parameters
0,bert.embeddings.word_embeddings.weight,3906816
1,bert.embeddings.position_embeddings.weight,65536
2,bert.embeddings.token_type_embeddings.weight,256
3,bert.embeddings.LayerNorm.weight,128
4,bert.embeddings.LayerNorm.bias,128
5,bert.encoder.layer.0.attention.self.query.weight,16384
6,bert.encoder.layer.0.attention.self.query.bias,128
7,bert.encoder.layer.0.attention.self.key.weight,16384
8,bert.encoder.layer.0.attention.self.key.bias,128
9,bert.encoder.layer.0.attention.self.value.weight,16384


Změření rychlosti inference při použití CPU, 1000 pokusů s jedním záznamem.

In [22]:
cpu_benchmark = base.BenchMarkRunner(student_model, cpu_data_loader, "cpu", 1000)
print(cpu_benchmark.run_benchmark())

<torch.utils.benchmark.utils.common.Measurement object at 0x74156e73b940>
self.infer_speed_comp()
  3.66 ms
  1 measurement, 1000 runs , 4 threads


Změření rychlosti inference při použití GPU, 1000 pokusů s jedním záznamem.

In [23]:
gpu_benchmark = base.BenchMarkRunner(student_model, gpu_data_loader, "cuda", 1000)
print(gpu_benchmark.run_benchmark())

<torch.utils.benchmark.utils.common.Measurement object at 0x7416d9d10550>
self.infer_speed_comp()
  2.30 ms
  1 measurement, 1000 runs , 4 threads
