# Porovnání předpočítání logitů a inference logitů během tréninku pro SST2

Tento notebook slouží k porovnání obou přístupů nad datasetem SST2. V rámci notebooku jsou ověřeny všechny varianty datasetu (augmentovaný, výchozí) nad oběma studentskými modely (BiLSTM a BERT TINY). 

Trénink ja nastaven na 5 epoch s výchozími hyperparametry, klíčová je jeho délka.

## Import knihoven

In [1]:
from datasets import concatenate_datasets, load_from_disk
from transformers import BasicTokenizer, BertForSequenceClassification, BertTokenizer
import kagglehub
import torch
import base

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Načtení embeddingů.

Načtení datasetu a jeho základní předzpracování (tokenizace, vytvoření slovníků všech tokenů, vytvoření indexu pro GloVe embeddingy).

In [2]:
my_glove = kagglehub.dataset_download("thanakomsn/glove6b300dtxt")
print(my_glove)

/home/jovyan/.cache/kagglehub/datasets/thanakomsn/glove6b300dtxt/versions/1


In [None]:
GLOVE_FILE = f"{my_glove}/glove.6B.300d.txt"
DATASET = "sst2"

In [4]:
train_data = load_from_disk(f"~/data/{DATASET}/train-logits")
eval_data = load_from_disk(f"~/data/{DATASET}/eval-logits")
test_data = load_from_disk(f"~/data/{DATASET}/test-logits")

all_train_data = load_from_disk(f"~/data/{DATASET}/train-logits-augmented")

all_data = concatenate_datasets([load_from_disk(file) for file in [f"~/data/{DATASET}/eval-logits", f"~/data/{DATASET}/test-logits", f"~/data/{DATASET}/train-logits-augmented"]])
tokenizer = BasicTokenizer(do_lower_case=True)
teacher_tokenizer = BertTokenizer.from_pretrained("gchhablani/bert-base-cased-finetuned-sst2")

Ověření dostupnosti GPU.

In [5]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and will be used:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU.")

GPU is available and will be used: NVIDIA A100 80GB PCIe MIG 2g.20gb


Tokenizace.

In [6]:
train_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), train_data))
eval_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), eval_data))
test_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), test_data))

all_train_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), all_train_data))

all_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), all_data))

Získání všech unikátních tokenů v datasetu.

In [7]:
vocab = base.get_vocab(all_data_tokens)

Přiřazení indexu jednotlivým tokenům.

In [8]:
word_index = dict(zip(vocab, range(len(vocab))))

Získání indexů z GloVe embeddingů.

In [9]:
embeddings_index = base.get_embeddings_indeces(GLOVE_FILE)

Found 400000 word vectors.


Definice velikosti slovníku a velikosti embedding dimenze. 

In [10]:
print(len(vocab))
num_tokens = len(vocab) + 2
embedding_dim = 300

14621


Vytvoření vazby mezi tokeny (jejich indexy) a embeddingy. Část tokenů nebyla nalezena, což ovšem nepředstavuje problém.

In [11]:
embedding_matrix = base.get_embedding_matrix(num_tokens, embedding_dim, word_index, embeddings_index)

Converted 14305 words (316) misses


Přiřazení indexu tokenům v každé části datasetu.

In [12]:
train_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),train_data_tokens))
eval_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),eval_data_tokens))
test_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),test_data_tokens))

all_train_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),all_train_data_tokens))

Zarovnání délky všech záznamů.

In [13]:
train_padded_data = list(map(lambda x: base.padd(x,60), train_data_index))
eval_padded_data = list(map(lambda x: base.padd(x,60), eval_data_index))
test_padded_data = list(map(lambda x: base.padd(x,60), test_data_index))

all_train_padded_data = list(map(lambda x: base.padd(x,60), all_train_data_index))

Získání ID tokenů a attention masky i pro BERT model. 

In [14]:
train_teacher_data = base.prepare_dataset_teacher(train_data, teacher_tokenizer)
eval_teacher_data = base.prepare_dataset_teacher(eval_data, teacher_tokenizer)
test_teacher_data = base.prepare_dataset_teacher(test_data, teacher_tokenizer)

all_train_teacher_data = base.prepare_dataset_teacher(all_train_data, teacher_tokenizer)

Tokenizing the provided dataset:   0%|          | 0/53879 [00:00<?, ? examples/s]

Tokenizing the provided dataset:   0%|          | 0/872 [00:00<?, ? examples/s]

Tokenizing the provided dataset:   0%|          | 0/13470 [00:00<?, ? examples/s]

Tokenizing the provided dataset:   0%|          | 0/293636 [00:00<?, ? examples/s]

Přidání ID tokenů do každé části datasetu. Přidány jsou ID pro GloVe i BERT model.

In [15]:
train_data = train_data.add_column("input_ids", train_padded_data)
train_data = train_data.add_column("teacher_ids", train_teacher_data[0])
train_data = train_data.add_column("teacher_attention", train_teacher_data[1])

eval_data = eval_data.add_column("input_ids", eval_padded_data)
eval_data = eval_data.add_column("teacher_ids", eval_teacher_data[0])
eval_data = eval_data.add_column("teacher_attention", eval_teacher_data[1])

test_data = test_data.add_column("input_ids", test_padded_data)
test_data = test_data.add_column("teacher_ids", test_teacher_data[0])
test_data = test_data.add_column("teacher_attention", test_teacher_data[1])

all_train_data = all_train_data.add_column("input_ids", all_train_padded_data)
all_train_data = all_train_data.add_column("teacher_ids", all_train_teacher_data[0])
all_train_data = all_train_data.add_column("teacher_attention", all_train_teacher_data[1])

## BiLSTM
### Neaugmentovaný dataset
#### Předpočítané logity

Získání studentského modelu s definovanou embedding vrstvou. 

In [16]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=2)

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill")

In [18]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [19]:
train_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")
eval_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")

In [20]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [21]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.6229,1.816021,0.770642,0.770595,0.770481,0.770521
2,1.8364,1.713443,0.780963,0.78094,0.780784,0.780836
3,1.6839,1.676934,0.776376,0.779433,0.77729,0.776093
4,1.6249,1.708686,0.772936,0.77661,0.771807,0.771628
5,1.5922,1.632575,0.784404,0.784402,0.784205,0.784266


TrainOutput(global_step=2105, training_loss=1.872066282150015, metrics={'train_runtime': 70.6796, 'train_samples_per_second': 3811.496, 'train_steps_per_second': 29.782, 'total_flos': 0.0, 'train_loss': 1.872066282150015, 'epoch': 5.0})

In [22]:
base.reset_seed()

#### Logity získané inferencí
Získání studentského modelu s definovanou embedding vrstvou. 

Získání předtrénovaného učitelského modelu pro inferenci logitů za běhu tréninku.

In [23]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=2)
teacher_model = BertForSequenceClassification.from_pretrained("gchhablani/bert-base-cased-finetuned-sst2", num_labels=2)
teacher_model.to(device)
teacher_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_infer")

In [25]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [26]:
train_data.reset_format()
eval_data.reset_format()   

In [27]:
trainer = base.DistilTrainerInferText(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [28]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.6654,1.821222,0.766055,0.766077,0.765808,0.765878
2,1.8441,1.729935,0.774083,0.774203,0.773775,0.773866
3,1.6863,1.672585,0.775229,0.776396,0.775785,0.775171
4,1.6279,1.722627,0.779817,0.78426,0.778606,0.778388
5,1.5947,1.64673,0.779817,0.779985,0.77949,0.779589


TrainOutput(global_step=2105, training_loss=1.8836917831892073, metrics={'train_runtime': 185.2832, 'train_samples_per_second': 1453.964, 'train_steps_per_second': 11.361, 'total_flos': 0.0, 'train_loss': 1.8836917831892073, 'epoch': 5.0})

### Augmentovaný dataset
#### Předpočítané logity

Získání studentského modelu s definovanou embedding vrstvou. 

In [29]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=2)

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_aug", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_aug")

In [31]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [32]:
all_train_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")
eval_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")

In [33]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset= all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [34]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.638,1.623549,0.790138,0.794219,0.791183,0.789759
2,1.2833,1.497001,0.799312,0.799481,0.799013,0.799119
3,1.1944,1.430632,0.799312,0.800512,0.798676,0.798823
4,1.1311,1.432064,0.81078,0.813015,0.811537,0.810648
5,1.0932,1.363312,0.809633,0.809713,0.809401,0.809489


TrainOutput(global_step=11475, training_loss=1.2679918981481482, metrics={'train_runtime': 157.1455, 'train_samples_per_second': 9342.808, 'train_steps_per_second': 73.022, 'total_flos': 0.0, 'train_loss': 1.2679918981481482, 'epoch': 5.0})

#### Logity získané inferencí

Získání studentského modelu s definovanou embedding vrstvou. 

In [35]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=2)

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_aug_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_aug_infer")

In [37]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [38]:
all_train_data.reset_format()
eval_data.reset_format()   

In [39]:
trainer = base.DistilTrainerInferText(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [40]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.638,1.62355,0.790138,0.794219,0.791183,0.789759
2,1.2834,1.498093,0.800459,0.800671,0.800139,0.800253
3,1.1947,1.430712,0.800459,0.801577,0.799844,0.799995
4,1.1314,1.43205,0.81078,0.813015,0.811537,0.810648
5,1.0934,1.363423,0.81078,0.810832,0.810569,0.810648


TrainOutput(global_step=11475, training_loss=1.268160871119281, metrics={'train_runtime': 838.5458, 'train_samples_per_second': 1750.864, 'train_steps_per_second': 13.684, 'total_flos': 0.0, 'train_loss': 1.268160871119281, 'epoch': 5.0})

## BERT TINY
### Neaumentovaný dataset
#### Předpočítané logity

Získání studentského modelu.

In [41]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill")

Konfigurace sloupců v datasetu.

In [43]:
train_data = train_data.remove_columns(["input_ids"])
train_data = train_data.rename_column("teacher_attention", "attention_mask")
train_data = train_data.rename_column("teacher_ids", "input_ids")

eval_data = eval_data.remove_columns(["input_ids"])
eval_data = eval_data.rename_column("teacher_attention", "attention_mask")
eval_data = eval_data.rename_column("teacher_ids", "input_ids")

train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "logits", "labels"], device="cpu")
eval_data.set_format(type="torch", columns=["input_ids", "attention_mask", "logits", "labels"], device="cpu")

In [44]:
base.reset_seed()

In [45]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [46]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.0963,2.423593,0.708716,0.708954,0.708954,0.708716
2,2.3215,2.040582,0.748853,0.74878,0.748705,0.748734
3,1.8838,1.929957,0.768349,0.768331,0.768144,0.768201
4,1.6455,1.87081,0.780963,0.781612,0.781374,0.780949
5,1.5398,1.864941,0.78555,0.786657,0.786089,0.785503


TrainOutput(global_step=2105, training_loss=2.097377952883759, metrics={'train_runtime': 74.1801, 'train_samples_per_second': 3631.636, 'train_steps_per_second': 28.377, 'total_flos': 40108928454000.0, 'train_loss': 2.097377952883759, 'epoch': 5.0})

In [47]:
base.reset_seed()

#### Logity získané inferencí
Získání studentského modelu.

In [48]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_infer")

In [50]:
trainer = base.DistilTrainerInfer(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [51]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.1217,2.407446,0.707569,0.707609,0.707197,0.707245
2,2.3791,2.050555,0.756881,0.758588,0.756041,0.756013
3,1.9315,1.906719,0.766055,0.767655,0.765261,0.765283
4,1.6923,1.880664,0.771789,0.775295,0.770681,0.770514
5,1.5813,1.83817,0.775229,0.776356,0.774564,0.774656


TrainOutput(global_step=2105, training_loss=2.1411750702846644, metrics={'train_runtime': 189.8196, 'train_samples_per_second': 1419.216, 'train_steps_per_second': 11.089, 'total_flos': 40108928454000.0, 'train_loss': 2.1411750702846644, 'epoch': 5.0})

### Augmentovaný dataset
#### Předpočítané logity
Získání studentského modelu.

In [52]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_aug", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_aug")

In [54]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [55]:
all_train_data = all_train_data.remove_columns(["input_ids"])
all_train_data = all_train_data.rename_column("teacher_attention", "attention_mask")
all_train_data = all_train_data.rename_column("teacher_ids", "input_ids")

all_train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "logits", "labels"], device="cpu")

In [56]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [57]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.6723,1.544953,0.818807,0.819186,0.819125,0.818806
2,0.8343,1.557289,0.817661,0.817739,0.817831,0.817655
3,0.6711,1.601203,0.813073,0.813504,0.81341,0.813071
4,0.6002,1.585576,0.811927,0.811927,0.812032,0.811911
5,0.5632,1.610687,0.811927,0.811975,0.812074,0.811918


TrainOutput(global_step=11475, training_loss=0.8682164862472767, metrics={'train_runtime': 175.2297, 'train_samples_per_second': 8378.603, 'train_steps_per_second': 65.485, 'total_flos': 218590272936000.0, 'train_loss': 0.8682164862472767, 'epoch': 5.0})

In [58]:
base.reset_seed()

#### Logity získané inferencí
Získání studentského modelu.

In [59]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_aug_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_aug_infer")

In [61]:
trainer = base.DistilTrainerInfer(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [62]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.6822,1.533957,0.805046,0.80498,0.80498,0.80498
2,0.8295,1.532401,0.811927,0.81208,0.811653,0.811759
3,0.6699,1.564467,0.81078,0.810723,0.810695,0.810708
4,0.6035,1.596093,0.807339,0.807271,0.807317,0.80729
5,0.5666,1.624046,0.807339,0.807291,0.807232,0.807257


TrainOutput(global_step=11475, training_loss=0.8703437436172385, metrics={'train_runtime': 809.3575, 'train_samples_per_second': 1814.007, 'train_steps_per_second': 14.178, 'total_flos': 218590272936000.0, 'train_loss': 0.8703437436172385, 'epoch': 5.0})