# Porovnání předpočítání logitů a inference logitů během tréninku pro TREC (coarse)

Tento notebook slouží k porovnání obou přístupů nad datasetem TREC (coarse). V rámci notebooku jsou ověřeny všechny varianty datasetu (augmentovaný, výchozí) nad oběma studentskými modely (BiLSTM a BERT TINY). 

Trénink ja nastaven na 5 epoch s výchozími hyperparametry, klíčová je jeho délka.

## Import knihoven

In [1]:
from datasets import concatenate_datasets, load_from_disk
from transformers import BasicTokenizer, BertForSequenceClassification, BertTokenizer
import kagglehub
import torch
import base

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Načtení embeddingů.

Načtení datasetu a jeho základní předzpracování (tokenizace, vytvoření slovníků všech tokenů, vytvoření indexu pro GloVe embeddingy).

In [2]:
my_glove = kagglehub.dataset_download("thanakomsn/glove6b300dtxt")
print(my_glove)

/home/jovyan/.cache/kagglehub/datasets/thanakomsn/glove6b300dtxt/versions/1


In [3]:
GLOVE_FILE = f"{my_glove}/glove.6B.300d.txt"
DATASET = "trec"

In [4]:
train_data = load_from_disk(f"~/data/{DATASET}/train-logits_coarse")
eval_data = load_from_disk(f"~/data/{DATASET}/eval-logits_coarse")
test_data = load_from_disk(f"~/data/{DATASET}/test-logits_coarse")

all_train_data = load_from_disk(f"~/data/{DATASET}/train-logits-augmented_coarse")

all_data = concatenate_datasets([load_from_disk(file) for file in [f"~/data/{DATASET}/eval-logits_coarse", f"~/data/{DATASET}/test-logits_coarse", f"~/data/{DATASET}/train-logits-augmented_coarse"]])
tokenizer = BasicTokenizer(do_lower_case=True)
teacher_tokenizer = BertTokenizer.from_pretrained("carrassi-ni/bert-base-trec-question-classification")

Ověření dostupnosti GPU.

In [5]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and will be used:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU.")

GPU is available and will be used: NVIDIA A100 80GB PCIe MIG 2g.20gb


Tokenizace.

In [6]:
train_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), train_data))
eval_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), eval_data))
test_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), test_data))

all_train_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), all_train_data))

all_data_tokens = list(map(lambda e: tokenizer.tokenize(e["sentence"]), all_data))

Získání všech unikátních tokenů v datasetu.

In [7]:
vocab = base.get_vocab(all_data_tokens)

Přiřazení indexu jednotlivým tokenům.

In [8]:
word_index = dict(zip(vocab, range(len(vocab))))

Získání indexů z GloVe embeddingů.

In [9]:
embeddings_index = base.get_embeddings_indeces(GLOVE_FILE)

Found 400000 word vectors.


Definice velikosti slovníku a velikosti embedding dimenze. 

In [10]:
print(len(vocab))
num_tokens = len(vocab) + 2
embedding_dim = 300

8766


Vytvoření vazby mezi tokeny (jejich indexy) a embeddingy. Část tokenů nebyla nalezena, což ovšem nepředstavuje problém.

In [11]:
embedding_matrix = base.get_embedding_matrix(num_tokens, embedding_dim, word_index, embeddings_index)

Converted 8551 words (215) misses


Přiřazení indexu tokenům v každé části datasetu.

In [12]:
train_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),train_data_tokens))
eval_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),eval_data_tokens))
test_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),test_data_tokens))

all_train_data_index = list(map(lambda x: list(map(lambda y: word_index[y], x)),all_train_data_tokens))

Zarovnání délky všech záznamů.

In [13]:
train_padded_data = list(map(lambda x: base.padd(x,60), train_data_index))
eval_padded_data = list(map(lambda x: base.padd(x,60), eval_data_index))
test_padded_data = list(map(lambda x: base.padd(x,60), test_data_index))

all_train_padded_data = list(map(lambda x: base.padd(x,60), all_train_data_index))

Získání ID tokenů a attention masky i pro BERT model. 

In [14]:
train_teacher_data = base.prepare_dataset_teacher(train_data, teacher_tokenizer)
eval_teacher_data = base.prepare_dataset_teacher(eval_data, teacher_tokenizer)
test_teacher_data = base.prepare_dataset_teacher(test_data, teacher_tokenizer)

all_train_teacher_data = base.prepare_dataset_teacher(all_train_data, teacher_tokenizer)

Přidání ID tokenů do každé části datasetu. Přidány jsou ID pro GloVe i BERT model.

In [15]:
train_data = train_data.add_column("input_ids", train_padded_data)
train_data = train_data.add_column("teacher_ids", train_teacher_data[0])
train_data = train_data.add_column("teacher_attention", train_teacher_data[1])

eval_data = eval_data.add_column("input_ids", eval_padded_data)
eval_data = eval_data.add_column("teacher_ids", eval_teacher_data[0])
eval_data = eval_data.add_column("teacher_attention", eval_teacher_data[1])

test_data = test_data.add_column("input_ids", test_padded_data)
test_data = test_data.add_column("teacher_ids", test_teacher_data[0])
test_data = test_data.add_column("teacher_attention", test_teacher_data[1])

all_train_data = all_train_data.add_column("input_ids", all_train_padded_data)
all_train_data = all_train_data.add_column("teacher_ids", all_train_teacher_data[0])
all_train_data = all_train_data.add_column("teacher_attention", all_train_teacher_data[1])

## BiLSTM
### Neaugmentovaný dataset
#### Předpočítané logity

Získání studentského modelu s definovanou embedding vrstvou. 

In [16]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=6)

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_coarse", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_coarse")

In [18]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [19]:
train_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")
eval_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")

In [20]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [21]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.9725,3.877782,0.424381,0.290962,0.317784,0.269476
2,3.8229,3.72867,0.436297,0.317906,0.324997,0.281137
3,3.7247,3.669156,0.404216,0.291985,0.299481,0.25061
4,3.68,3.631614,0.444546,0.256072,0.331197,0.27444
5,3.6505,3.616164,0.450962,0.261664,0.336352,0.281117


TrainOutput(global_step=175, training_loss=3.7701025390625, metrics={'train_runtime': 24.3316, 'train_samples_per_second': 896.16, 'train_steps_per_second': 7.192, 'total_flos': 0.0, 'train_loss': 3.7701025390625, 'epoch': 5.0})

In [22]:
base.reset_seed()

#### Logity získané inferencí
Získání studentského modelu s definovanou embedding vrstvou. 

Získání předtrénovaného učitelského modelu pro inferenci logitů za běhu tréninku.

In [23]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=6)
teacher_model = BertForSequenceClassification.from_pretrained("carrassi-ni/bert-base-trec-question-classification", num_labels=6)
teacher_model.to(device)
teacher_model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_coarse_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_coarse_infer")

In [25]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [26]:
train_data.reset_format()
eval_data.reset_format()   

In [27]:
trainer = base.DistilTrainerInferText(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [28]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.9416,3.840916,0.362053,0.364801,0.266503,0.216184
2,3.7934,3.707982,0.389551,0.334907,0.287734,0.241021
3,3.7092,3.648413,0.414299,0.276402,0.307314,0.258479
4,3.6532,3.60284,0.449129,0.25405,0.335003,0.278742
5,3.6205,3.584687,0.450046,0.2556,0.335837,0.280258


TrainOutput(global_step=175, training_loss=3.7435805402483258, metrics={'train_runtime': 37.6365, 'train_samples_per_second': 579.358, 'train_steps_per_second': 4.65, 'total_flos': 0.0, 'train_loss': 3.7435805402483258, 'epoch': 5.0})

### Augmentovaný dataset
#### Předpočítané logity

Získání studentského modelu s definovanou embedding vrstvou. 

In [29]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=6)

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_coarse_aug", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_coarse_aug")

In [31]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [32]:
all_train_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")
eval_data.set_format(type="torch", columns=["input_ids", "logits", "labels"], device="cpu")

In [33]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset= all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [34]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.0493,2.594009,0.567369,0.533474,0.462808,0.457354
2,2.1538,2.106654,0.660862,0.555705,0.560814,0.552871
3,1.771,1.843467,0.705775,0.595782,0.602444,0.59715
4,1.5719,1.7327,0.722273,0.60788,0.616874,0.610993
5,1.4805,1.698875,0.731439,0.61466,0.624957,0.618452


TrainOutput(global_step=1520, training_loss=2.0052886762117086, metrics={'train_runtime': 36.2383, 'train_samples_per_second': 5364.901, 'train_steps_per_second': 41.945, 'total_flos': 0.0, 'train_loss': 2.0052886762117086, 'epoch': 5.0})

#### Logity získané inferencí

Získání studentského modelu s definovanou embedding vrstvou. 

In [35]:
student_model = base.BiLSTMClassifier(embedding_matrix=embedding_matrix, embedding_dim=embedding_dim, fc_dim=400, hidden_dim=300, output_dim=6)

Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bilstm-distill_coarse_aug_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bilstm-distill_coarse_aug_infer")

In [37]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [38]:
all_train_data.reset_format()
eval_data.reset_format()   

In [39]:
trainer = base.DistilTrainerInferText(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [40]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.0493,2.594049,0.567369,0.53331,0.462808,0.457278
2,2.1523,2.105163,0.660862,0.555705,0.560814,0.552871
3,1.7698,1.842633,0.707608,0.597499,0.604427,0.599011
4,1.5711,1.732216,0.724106,0.609296,0.618256,0.612399
5,1.48,1.698466,0.729606,0.613304,0.623575,0.617103


TrainOutput(global_step=1520, training_loss=2.004483253077457, metrics={'train_runtime': 127.6824, 'train_samples_per_second': 1522.645, 'train_steps_per_second': 11.905, 'total_flos': 0.0, 'train_loss': 2.004483253077457, 'epoch': 5.0})

## BERT TINY
### Neaumentovaný dataset
#### Předpočítané logity

Získání studentského modelu.

In [41]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=6)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_coarse", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_coarse")

Konfigurace sloupců v datasetu.

In [43]:
train_data = train_data.remove_columns(["input_ids"])
train_data = train_data.rename_column("teacher_attention", "attention_mask")
train_data = train_data.rename_column("teacher_ids", "input_ids")

eval_data = eval_data.remove_columns(["input_ids"])
eval_data = eval_data.rename_column("teacher_attention", "attention_mask")
eval_data = eval_data.rename_column("teacher_ids", "input_ids")

train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "logits", "labels"], device="cpu")
eval_data.set_format(type="torch", columns=["input_ids", "attention_mask", "logits", "labels"], device="cpu")

In [52]:
base.reset_seed()

In [53]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [54]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.4658,3.334376,0.517874,0.518751,0.426008,0.430613
2,3.339,3.229984,0.530706,0.514513,0.440717,0.446996
3,3.2478,3.158741,0.541705,0.516286,0.44942,0.455007
4,3.1863,3.114835,0.554537,0.527688,0.460082,0.465626
5,3.1526,3.100107,0.560037,0.531812,0.464144,0.469593


TrainOutput(global_step=175, training_loss=3.278302045549665, metrics={'train_runtime': 23.9926, 'train_samples_per_second': 908.823, 'train_steps_per_second': 7.294, 'total_flos': 3250492282800.0, 'train_loss': 3.278302045549665, 'epoch': 5.0})

In [47]:
base.reset_seed()

#### Logity získané inferencí
Získání studentského modelu.

In [48]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=6)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_coarse_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_coarse_infer")

In [50]:
trainer = base.DistilTrainerInfer(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [51]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.9409,3.822212,0.298808,0.352817,0.220403,0.147902
2,3.7993,3.672094,0.451879,0.398647,0.348152,0.324297
3,3.6789,3.554323,0.472961,0.542589,0.371383,0.348653
4,3.5928,3.491384,0.482126,0.5279,0.385017,0.368784
5,3.5507,3.472617,0.493126,0.535372,0.396374,0.387244


TrainOutput(global_step=175, training_loss=3.7125143432617187, metrics={'train_runtime': 37.1211, 'train_samples_per_second': 587.401, 'train_steps_per_second': 4.714, 'total_flos': 3250492282800.0, 'train_loss': 3.7125143432617187, 'epoch': 5.0})

### Augmentovaný dataset
#### Předpočítané logity
Získání studentského modelu.

In [56]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=6)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_coarse_aug", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_coarse_aug")

In [58]:
base.reset_seed()

Zvolení správných sloupců datasetu.

In [59]:
all_train_data = all_train_data.remove_columns(["input_ids"])
all_train_data = all_train_data.rename_column("teacher_attention", "attention_mask")
all_train_data = all_train_data.rename_column("teacher_ids", "input_ids")

all_train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "logits", "labels"], device="cpu")

In [60]:
trainer = base.DistilTrainer(
    student_model=student_model,
    args=training_args,
    train_dataset=all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [61]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.2605,2.978655,0.555454,0.523995,0.449284,0.445469
2,2.3267,2.274173,0.712191,0.632872,0.60224,0.609727
3,1.7605,1.920719,0.759853,0.651027,0.645426,0.646088
4,1.4491,1.751995,0.774519,0.660902,0.659246,0.659136
5,1.3057,1.708815,0.775435,0.660972,0.660738,0.660006


TrainOutput(global_step=1520, training_loss=2.020503596255654, metrics={'train_runtime': 42.4722, 'train_samples_per_second': 4577.46, 'train_steps_per_second': 35.788, 'total_flos': 28981630688400.0, 'train_loss': 2.020503596255654, 'epoch': 5.0})

In [62]:
base.reset_seed()

#### Logity získané inferencí
Získání studentského modelu.

In [63]:
student_model = BertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=6)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Definice tréninku.

In [None]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/comp-bert-distill_coarse_aug_infer", remove_unused_columns=False, logging_dir=f"~/logs/{DATASET}/comp-bert-distill_coarse_aug_infer")

In [65]:
trainer = base.DistilTrainerInfer(
    student_model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=all_train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics
)

In [66]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,3.2133,2.891021,0.605866,0.616716,0.499458,0.516399
2,2.234,2.129092,0.729606,0.64402,0.620291,0.627627
3,1.655,1.817408,0.76077,0.656007,0.649572,0.651378
4,1.346,1.668602,0.771769,0.664944,0.659243,0.660382
5,1.2034,1.628448,0.774519,0.66473,0.663031,0.662857


TrainOutput(global_step=1520, training_loss=1.9303382271214535, metrics={'train_runtime': 135.967, 'train_samples_per_second': 1429.869, 'train_steps_per_second': 11.179, 'total_flos': 28981630688400.0, 'train_loss': 1.9303382271214535, 'epoch': 5.0})