# Porovnání předpočítání logitů a inference logitů během tréninku pro TREC (fine)

Tento notebook slouží k porovnání obou přístupů nad datasetem TREC (fine). V rámci notebooku jsou ověřeny všechny varianty datasetu (augmentovaný, výchozí) nad oběma studentskými modely (BiLSTM a BERT TINY). 

Trénink ja nastaven na 5 epoch s výchozími hyperparametry, klíčová je jeho délka.

## Import knihoven

In [1]:
from datasets import concatenate_datasets, load_from_disk
from transformers import BasicTokenizer, BertForSequenceClassification, AutoConfig, BertTokenizer
import kagglehub
import torch
import base
import os

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Načtení embeddingů.

Načtení datasetu a jeho základní předzpracování (tokenizace, vytvoření slovníků všech tokenů, vytvoření indexu pro GloVe embeddingy).

In [2]:
my_glove = kagglehub.dataset_download("thanakomsn/glove6b300dtxt")
print(my_glove)

/home/jovyan/.cache/kagglehub/datasets/thanakomsn/glove6b300dtxt/versions/1


In [3]:
GLOVE_FILE = f"{my_glove}/glove.6B.300d.txt"
DATASET = "trec"

In [45]:
train_data = load_from_disk(f"~/data/{DATASET}/train-logits_fine")
eval_data = load_from_disk(f"~/data/{DATASET}/eval-logits_fine")
test_data = load_from_disk(f"~/data/{DATASET}/test-logits_fine")

all_train_data = load_from_disk(f"~/data/{DATASET}/train-logits-augmented_fine")

all_data = concatenate_datasets([load_from_disk(file) for file in [f"~/data/{DATASET}/eval-logits_fine", f"~/data/{DATASET}/test-logits_fine", f"~/data/{DATASET}/train-logits-augmented_fine"]])
tokenizer = BasicTokenizer(do_lower_case=True)
teacher_tokenizer = BertTokenizer.from_pretrained("carrassi-ni/bert-base-trec-question-classification")

Ověření dostupnosti GPU.

In [46]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and will be used:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU.")

GPU is available and will be used: NVIDIA A100 80GB PCIe MIG 2g.20gb


Tokenizace.

In [47]:
train_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), train_data))
eval_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), eval_data))
test_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), test_data))

all_train_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), all_train_data))

all_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), all_data))

Získání všech unikátních tokenů v datasetu.

In [48]:
vocab = base.get_vocab(all_data_tokens)

Přiřazení indexu jednotlivým tokenům.

In [49]:
word_index = dict(zip(vocab, range(len(vocab))))

Získání indexů z GloVe embeddingů.

In [50]:
embeddings_index = base.get_embeddings_indeces(GLOVE_FILE)

Found 400000 word vectors.


Definice velikosti slovníku a velikosti embedding dimenze. 

In [51]:
print(len(vocab))
num_tokens = len(vocab) + 2
embedding_dim = 300

8766


Vytvoření vazby mezi tokeny (jejich indexy) a embeddingy. Část tokenů nebyla nalezena, což ovšem nepředstavuje problém.

In [52]:
embedding_matrix = base.get_embedding_matrix(num_tokens, embedding_dim, word_index, embeddings_index)

Converted 8551 words (215) misses


Přiřazení indexu tokenům v každé části datasetu.

In [53]:
train_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),train_data_tokens))
eval_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),eval_data_tokens))
test_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),test_data_tokens))

all_train_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),all_train_data_tokens))

Zarovnání délky všech záznamů.

In [54]:
train_padded_data = list(map(lambda x: base.padd(x,60), train_data_index))
eval_padded_data = list(map(lambda x: base.padd(x,60), eval_data_index))
test_padded_data = list(map(lambda x: base.padd(x,60), test_data_index))

all_train_padded_data = list(map(lambda x: base.padd(x,60), all_train_data_index))

Získání ID tokenů a attention masky i pro BERT model. 

In [55]:
train_teacher_data = base.prepare_dataset_teacher(train_data, teacher_tokenizer)
eval_teacher_data = base.prepare_dataset_teacher(eval_data, teacher_tokenizer)
test_teacher_data = base.prepare_dataset_teacher(test_data, teacher_tokenizer)

all_train_teacher_data = base.prepare_dataset_teacher(all_train_data, teacher_tokenizer)

Přidání ID tokenů do každé části datasetu. Přidány jsou ID pro GloVe i BERT model.

In [56]:
train_data = train_data.add_column("input_ids", train_padded_data)
train_data = train_data.add_column("teacher_ids", train_teacher_data[0])
train_data = train_data.add_column("teacher_attention", train_teacher_data[1])

eval_data = eval_data.add_column("input_ids", eval_padded_data)
eval_data = eval_data.add_column("teacher_ids", eval_teacher_data[0])
eval_data = eval_data.add_column("teacher_attention", eval_teacher_data[1])

test_data = test_data.add_column("input_ids", test_padded_data)
test_data = test_data.add_column("teacher_ids", test_teacher_data[0])
test_data = test_data.add_column("teacher_attention", test_teacher_data[1])

all_train_data = all_train_data.add_column("input_ids", all_train_padded_data)
all_train_data = all_train_data.add_column("teacher_ids", all_train_teacher_data[0])
all_train_data = all_train_data.add_column("teacher_attention", all_train_teacher_data[1])

## BiLSTM
### Neaugmentovaný dataset
#### Předpočítané logity

Získání studentského modelu s definovanou embedding vrstvou. 

In [57]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=50)

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_fine", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_fine")

In [59]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [60]:
train_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")
eval_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")

In [61]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [62]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.4548,2.383512,0.176902,0.003538,0.02,0.006012
2,2.2833,2.16466,0.176902,0.003538,0.02,0.006012
3,2.1482,2.111145,0.176902,0.003538,0.02,0.006012
4,2.1103,2.096688,0.176902,0.003538,0.02,0.006012
5,2.1208,2.091532,0.176902,0.003538,0.02,0.006012


TrainOutput(global_step=175, training_loss=2.2234778703962053, metrics={'train_runtime': 23.8578, 'train_samples_per_second': 913.957, 'train_steps_per_second': 7.335, 'total_flos': 0.0, 'train_loss': 2.2234778703962053, 'epoch': 5.0})

In [63]:
base.reset_seed()

#### Logity získané inferencí
Získání studentského modelu s definovanou embedding vrstvou. 

Získání předtrénovaného učitelského modelu pro inferenci logitů za běhu tréninku. Načtení dotrénované vlastní varianty.

In [None]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=50)
config = AutoConfig.from_pretrained("ndavid/autotrain-trec-fine-bert-739422530")
config.max_length = 20 
config.num_labels = 50
teacher_model = BertForSequenceClassification.from_pretrained("ndavid/autotrain-trec-fine-bert-739422530", config=config, ignore_mismatched_sizes=True)
model_path = f"{os.path.expanduser('~')}/models/{DATASET}/teacher_fine.pth"
state_dict = torch.load(model_path, map_location=torch.device('cpu')) 
teacher_model.load_state_dict(state_dict)
teacher_model.to(device)
teacher_model.eval()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ndavid/autotrain-trec-fine-bert-739422530 and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([47, 768]) in the checkpoint and torch.Size([50, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([47]) in the checkpoint and torch.Size([50]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  state_dict = torch.load(model_path, map_location=torch.device('cpu'))


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_fine_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_fine_infer")

In [66]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [67]:
train_data.reset_format()
eval_data.reset_format()   

In [68]:
trainer = base.DistilTrainerInferText(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [69]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.249,2.198247,0.101742,0.018733,0.026943,0.01166
2,2.1116,2.009753,0.04033,0.000807,0.02,0.001551
3,1.9792,1.948692,0.047663,0.020813,0.020829,0.003154
4,1.9463,1.936625,0.133822,0.018795,0.03057,0.015014
5,1.9484,1.932614,0.140238,0.018633,0.031295,0.015537


TrainOutput(global_step=175, training_loss=2.0468807547433037, metrics={'train_runtime': 36.9112, 'train_samples_per_second': 590.742, 'train_steps_per_second': 4.741, 'total_flos': 0.0, 'train_loss': 2.0468807547433037, 'epoch': 5.0})

### Augmentovaný dataset
#### Předpočítané logity

Získání studentského modelu s definovanou embedding vrstvou. 

In [70]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=50)

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_fine_aug", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_fine_aug")

In [72]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [73]:
all_train_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")
eval_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")

In [74]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset= all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [75]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.7979,1.530192,0.421632,0.068049,0.092782,0.063521
2,1.3643,1.345174,0.485793,0.116851,0.122112,0.098105
3,1.2234,1.259001,0.523373,0.134814,0.146253,0.125958
4,1.1453,1.218836,0.552704,0.154985,0.166241,0.14809
5,1.108,1.204154,0.55912,0.15569,0.174193,0.156175


TrainOutput(global_step=2640, training_loss=1.3277802207253195, metrics={'train_runtime': 58.9688, 'train_samples_per_second': 5729.806, 'train_steps_per_second': 44.769, 'total_flos': 0.0, 'train_loss': 1.3277802207253195, 'epoch': 5.0})

In [76]:
base.reset_seed()

#### Logity získané inferencí

Získání studentského modelu s definovanou embedding vrstvou. 

In [77]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=50)

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_fine_aug_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_fine_aug_infer")

Zvolení správných sloupců datasetu.

In [79]:
all_train_data.reset_format()
eval_data.reset_format()   

In [80]:
trainer = base.DistilTrainerInferText(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [81]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.7253,1.52852,0.404216,0.065981,0.097191,0.073726
2,1.4065,1.386111,0.457379,0.1087,0.114824,0.094496
3,1.2934,1.31834,0.491292,0.177051,0.140626,0.132226
4,1.2331,1.284705,0.52154,0.17346,0.156864,0.150265
5,1.2046,1.273443,0.526123,0.169786,0.160804,0.152862


TrainOutput(global_step=2640, training_loss=1.3725971106326942, metrics={'train_runtime': 205.2309, 'train_samples_per_second': 1646.341, 'train_steps_per_second': 12.864, 'total_flos': 0.0, 'train_loss': 1.3725971106326942, 'epoch': 5.0})

## BERT TINY
### Neaumentovaný dataset
#### Předpočítané logity

Získání studentského modelu.

In [82]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=50)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_fine", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_fine")

In [84]:
base.reset_seed()

Konfigurace sloupců v datasetu.

In [85]:
train_data = train_data.remove_columns(["input_ids"])
train_data = train_data.rename_column("teacher_attention", "attention_mask")
train_data = train_data.rename_column("teacher_ids", "input_ids")

eval_data = eval_data.remove_columns(["input_ids"])
eval_data = eval_data.rename_column("teacher_attention", "attention_mask")
eval_data = eval_data.rename_column("teacher_ids", "input_ids")

train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "logits", "labels"], device="cpu")
eval_data.set_format(type="torch", columns=["input_ids", "attention_mask", "logits", "labels"], device="cpu")

In [86]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [87]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.4407,2.36054,0.176902,0.003538,0.02,0.006012
2,2.3374,2.29217,0.176902,0.003538,0.02,0.006012
3,2.2828,2.249319,0.176902,0.003538,0.02,0.006012
4,2.2478,2.223199,0.176902,0.003538,0.02,0.006012
5,2.2351,2.213775,0.176902,0.003538,0.02,0.006012


TrainOutput(global_step=175, training_loss=2.3087569754464288, metrics={'train_runtime': 24.9432, 'train_samples_per_second': 874.185, 'train_steps_per_second': 7.016, 'total_flos': 3295047747600.0, 'train_loss': 2.3087569754464288, 'epoch': 5.0})

In [93]:
base.reset_seed()

#### Logity získané inferencí
Získání studentského modelu.

In [94]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=50)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_fine_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_fine_infer")

In [96]:
trainer = base.DistilTrainerInfer(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [97]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.2104,2.149666,0.042163,0.014142,0.020207,0.001963
2,2.1224,2.089401,0.041247,0.020807,0.020104,0.001758
3,2.0779,2.053643,0.04033,0.000807,0.02,0.001551
4,2.0486,2.034778,0.04033,0.000807,0.02,0.001551
5,2.04,2.028477,0.04033,0.000807,0.02,0.001551


TrainOutput(global_step=175, training_loss=2.0998707362583704, metrics={'train_runtime': 38.9552, 'train_samples_per_second': 559.746, 'train_steps_per_second': 4.492, 'total_flos': 3295047747600.0, 'train_loss': 2.0998707362583704, 'epoch': 5.0})

### Augmentovaný dataset
#### Předpočítané logity
Získání studentského modelu.

In [52]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=50)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_fine_aug", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_fine_aug")

In [54]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [55]:
all_train_data = all_train_data.remove_columns(["input_ids"])
all_train_data = all_train_data.rename_column("teacher_attention", "attention_mask")
all_train_data = all_train_data.rename_column("teacher_ids", "input_ids")

all_train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "logits", "labels"], device="cpu")

In [56]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [57]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.9311,1.677075,0.431714,0.104269,0.105747,0.082142
2,1.3668,1.36222,0.546288,0.188218,0.188469,0.171267
3,1.0963,1.233016,0.585701,0.212116,0.22769,0.209865
4,0.9586,1.173726,0.613199,0.253448,0.260846,0.244193
5,0.8984,1.159171,0.616865,0.265157,0.268675,0.253462


TrainOutput(global_step=2640, training_loss=1.2502604282263554, metrics={'train_runtime': 62.1955, 'train_samples_per_second': 5432.548, 'train_steps_per_second': 42.447, 'total_flos': 51058506441600.0, 'train_loss': 1.2502604282263554, 'epoch': 5.0})

In [58]:
base.reset_seed()

#### Logity získané inferencí
Získání studentského modelu.

In [59]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=50)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_fine_aug_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_fine_aug_infer")

In [61]:
trainer = base.DistilTrainerInfer(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [62]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.8071,1.593866,0.44088,0.137232,0.135849,0.126099
2,1.3632,1.34198,0.562786,0.238346,0.21349,0.207363
3,1.1595,1.252002,0.584785,0.275587,0.229421,0.223675
4,1.0643,1.216867,0.607699,0.305345,0.251641,0.251453
5,1.0228,1.207303,0.614115,0.322097,0.25927,0.260311


TrainOutput(global_step=2640, training_loss=1.2833794564911813, metrics={'train_runtime': 215.1187, 'train_samples_per_second': 1570.668, 'train_steps_per_second': 12.272, 'total_flos': 51058506441600.0, 'train_loss': 1.2833794564911813, 'epoch': 5.0})