# Předzpracování datasetu TREC (fine)

Tento notebook slouží k předzpracování datasetu TREC (fine). Pro dataset jsou vytvořeny augmentované záznamy a předpočítány logity.
Nejprve jsou načteny všechny potřebné knihovny včetně vlastní sbírky objektů a funkcí.

In [1]:
from transformers import BertTokenizer, BertForSequenceClassification, BasicTokenizer, Trainer, EarlyStoppingCallback, AutoConfig
from datasets import load_from_disk, Dataset, concatenate_datasets, ClassLabel, Features, Value, Sequence
from torch.utils.data import  DataLoader
from tqdm.notebook import tqdm
import numpy as np
import torch
import base
import os
import copy

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Ověření, že GPU je k dispozici a balíček torch je správně nakonfigurován.

In [3]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and will be used:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU.")

GPU is available and will be used: NVIDIA A100 80GB PCIe MIG 1g.10gb


Konfigurace augmentačních parametrů.
Dataset se bude procházet dvacetkrát, každý token v záznamu bude na deset procent zamaskován a na třicet procent nahrazen jiným tokenem se stejným POS tagem. Po průchodu všech tokenů v záznamu může dojít s pravděpodobností dvacet procent ke zkrácení záznamu. 

In [4]:
augmentation_params = {"n_iter": 20, "p_mask":0.1, "p_pos": 0.3, "p_ng":0.2}

Získání datasetu, základního tokenizeru a jeho použití nad datasetem. 

In [5]:
tokenizer = BasicTokenizer(do_lower_case=True)
DATASET = "trec"

In [6]:
train_data = load_from_disk(f"~/data/{DATASET}/train_fine")
sentences = list(map(lambda e: e["sentence"], train_data))

Pohled na průměrné délky záznamů dle počtu tokenů.

In [7]:
token_lengths = [len(tokenizer.tokenize(sentence)) for sentence in sentences]

In [8]:
sorted_token_lengths = sorted(token_lengths, reverse=True)
avg_tokens = np.mean(token_lengths)

In [9]:
print(sorted_token_lengths[0:25])
print(avg_tokens)
print(sorted_token_lengths[-25:])

[36, 35, 34, 33, 32, 31, 31, 30, 30, 30, 30, 29, 29, 28, 28, 28, 27, 27, 27, 27, 27, 27, 27, 26, 26]
10.765650080256822
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3]


Získání POS tagů jednotlivých tokenů v záznamech pro potřeby augmentace.

In [10]:
pos_tag_word_map_list = base.get_pos_tag_word_map(sentences, tokenizer=tokenizer)

In [11]:
print(train_data.features)

{'sentence': Value(dtype='string', id=None), 'label': ClassLabel(names=['ABBR:abb', 'ABBR:exp', 'ENTY:animal', 'ENTY:body', 'ENTY:color', 'ENTY:cremat', 'ENTY:currency', 'ENTY:dismed', 'ENTY:event', 'ENTY:food', 'ENTY:instru', 'ENTY:lang', 'ENTY:letter', 'ENTY:other', 'ENTY:plant', 'ENTY:product', 'ENTY:religion', 'ENTY:sport', 'ENTY:substance', 'ENTY:symbol', 'ENTY:techmeth', 'ENTY:termeq', 'ENTY:veh', 'ENTY:word', 'DESC:def', 'DESC:desc', 'DESC:manner', 'DESC:reason', 'HUM:gr', 'HUM:ind', 'HUM:title', 'HUM:desc', 'LOC:city', 'LOC:country', 'LOC:mount', 'LOC:other', 'LOC:state', 'NUM:code', 'NUM:count', 'NUM:date', 'NUM:dist', 'NUM:money', 'NUM:ord', 'NUM:other', 'NUM:period', 'NUM:perc', 'NUM:speed', 'NUM:temp', 'NUM:volsize', 'NUM:weight'], id=None)}


Spuštění procesu augmentace s popsanými parametry.

In [12]:
augmented_datasets = base.get_augmented_dataset(augmentation_params, train_data, pos_tag_word_map_list, tokenizer=tokenizer, include_idx=False)

Převedení nových záznamů do dataset objektu.

In [13]:
aug_datasets_formated = []
ds_schema = Features({
    "sentence": Value("string"),
    "label": ClassLabel(names=["ABBR:abb", "ABBR:exp", "ENTY:animal", "ENTY:body", "ENTY:color", "ENTY:cremat", "ENTY:currency", "ENTY:dismed", "ENTY:event", "ENTY:food", "ENTY:instru", "ENTY:lang", "ENTY:letter", "ENTY:other", "ENTY:plant", "ENTY:product", "ENTY:religion", "ENTY:sport", "ENTY:substance", "ENTY:symbol", "ENTY:techmeth", "ENTY:termeq", "ENTY:veh", "ENTY:word", "DESC:def", "DESC:desc", "DESC:manner", "DESC:reason", "HUM:gr", "HUM:ind", "HUM:title", "HUM:desc", "LOC:city", "LOC:country", "LOC:mount", "LOC:other", "LOC:state", "NUM:code", "NUM:count", "NUM:date", "NUM:dist", "NUM:money", "NUM:ord", "NUM:other", "NUM:period", "NUM:perc", "NUM:speed", "NUM:temp", "NUM:volsize", "NUM:weight"])
})

for iter in augmented_datasets:
    dataset = Dataset.from_dict(iter)
    dataset = dataset.cast(ds_schema)
    aug_datasets_formated.append(dataset)

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Ověření efektu augmentace, zde ke změně nedošlo. 

In [14]:
print(aug_datasets_formated[0][70])
print(train_data[70])

{'sentence': 'who does the advertizing for frito - lay ?', 'label': 29}
{'sentence': 'Who does the advertizing for Frito-Lay ?', 'label': 29}


Načtení učitelského modelu a jeho tokenizeru.

V tomto případě se nepodařilo nalézt hotový model pro tuto variantu a verzi datasetu. Pracováno je tedy s modelem doladěným na starší verzi datasetu, která měla méně tříd. Učitely je změněna klasifikační hlava a je doladěn na novou verzi datasetu.

In [15]:
tokenizer = BertTokenizer.from_pretrained("ndavid/autotrain-trec-fine-bert-739422530")
config = AutoConfig.from_pretrained("ndavid/autotrain-trec-fine-bert-739422530")
config.max_length = 20 
config.num_labels = 50
model = BertForSequenceClassification.from_pretrained("ndavid/autotrain-trec-fine-bert-739422530", config=config, ignore_mismatched_sizes=True)


model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ndavid/autotrain-trec-fine-bert-739422530 and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([47, 768]) in the checkpoint and torch.Size([50, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([47]) in the checkpoint and torch.Size([50]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

Načtení validačních a testovacích dat.

In [16]:
eval_data = load_from_disk(f"~/data/{DATASET}/eval_fine")
test_data = load_from_disk(f"~/data/{DATASET}/test_fine")

Definice trénovacích parametrů pro doladění učitele.

In [17]:
training_args = base.get_training_args(output_dir=f"~/results/{DATASET}/teacher", logging_dir=f"~/logs/{DATASET}/teacher", lr=.00005, epochs=5, batch_size=128)

Předzpracování všech částí datasetu pro použití v tréninku.

In [18]:
train_data = train_data.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=60), batched=True, desc="Tokenizing the provided dataset")

In [19]:
eval_data = eval_data.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=60), batched=True, desc="Tokenizing the provided dataset")

In [20]:
test_data = test_data.map(lambda e: tokenizer(e["sentence"], truncation=True, padding="max_length", return_tensors="pt", max_length=60), batched=True, desc="Tokenizing the provided dataset")

In [21]:
base.reset_seed()

Definice trenéra pro doladění. 

In [22]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=base.compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 2)]
)

Spuštění tréninku, výsledky nad validační částí.

In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,2.2062,1.140848,0.875344,0.57643,0.553421,0.543947
2,0.8651,0.588864,0.947754,0.740086,0.737085,0.732887
3,0.4786,0.403095,0.960587,0.840398,0.801243,0.809344
4,0.3198,0.336224,0.96517,0.889665,0.856616,0.865845
5,0.2655,0.31657,0.96517,0.910953,0.867303,0.880207


TrainOutput(global_step=175, training_loss=0.8270321982247489, metrics={'train_runtime': 106.8884, 'train_samples_per_second': 203.998, 'train_steps_per_second': 1.637, 'total_flos': 672610442691600.0, 'train_loss': 0.8270321982247489, 'epoch': 5.0})

Přepnutí učitele do evaluačního módu.

In [24]:
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

Ověření výsledků nad testovací částí.

In [25]:
trainer.evaluate(test_data)

{'eval_loss': 0.45101335644721985,
 'eval_accuracy': 0.944,
 'eval_precision': 0.9311251389444667,
 'eval_recall': 0.9126465626901167,
 'eval_f1': 0.9066237863824048,
 'eval_runtime': 3.3139,
 'eval_samples_per_second': 150.879,
 'eval_steps_per_second': 1.207,
 'epoch': 5.0}

Uložení učitele.

In [26]:
torch.save(model.state_dict(), f"{os.path.expanduser('~')}/models/{DATASET}/teacher_fine.pth")

Předzpracování trénovací části datasetu a výpočet logitů učitelským modelem. Vypočtené logity se přidávají jako nový sloupec datasetu, naopak se odstraňují záznamy ponechané po tokenizeru učitele, které dále nejsou třeba. 

In [27]:
train_dataset = base.prepare_dataset(train_data, tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=128, shuffle=False)
train_logits = base.generate_logits(train_dataloader, model)
train_dataset = train_dataset.add_column("logits", train_logits)
train_dataset = train_dataset.remove_columns(["token_type_ids", "attention_mask", "input_ids"])
train_dataset.set_format(type="torch", columns=["logits", "labels"], device="cpu")

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Výpočet správnosti učitelských predikcí nad trénovací částí datasetu.

In [28]:
print(base.check_acc(train_dataset, "Accuracy for base dataset: "))

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for base dataset:  0.9924329282274708


Stejné kroky pro výpočet logitů jsou provedeny nad každou z dvaceti augmentovaných variant datasetu. Postupně je skrze záznamy iterováno. Dataset je nejprve tokenizován a následně je předán učitely k inferenci pro získání logitů, které jsou k datasetu uloženy. Odstraněny jsou poté nepotřebné pozůstatky tokenizace. V rámci průběhu zpracování rovnou dochází k filtraci augmentovaných záznamů dle popsaného mechanismu (notebook precompute_logits_10.ipynb).

Takto zpracované datasety se postupně spojují do jednoho celku. 

In [29]:
aug_clean_datasets = []
for dataset in tqdm(aug_datasets_formated, total=(len(aug_datasets_formated)), desc="Processing augmented datasets: "):
    aug_train_dataset = base.prepare_dataset(dataset, tokenizer)
    aug_train_dataloader = DataLoader(aug_train_dataset, batch_size=128, shuffle=False)
    aug_train_logits = base.generate_logits(aug_train_dataloader, model)
    aug_train_dataset = aug_train_dataset.add_column("logits", aug_train_logits)
    aug_train_dataset = aug_train_dataset.remove_columns(["token_type_ids", "attention_mask", "input_ids"])
    aug_train_dataset.set_format(type="torch", columns=["logits", "labels"], device="cpu")

    print(base.check_acc(aug_train_dataset, "Accuracy for augmented dataset: "))

    aug_train_dataset = base.remove_diff_pred_class(train_dataset, aug_train_dataset, pytorch_dataset=False)
    
    print(base.check_acc(aug_train_dataset, "Accuracy for filtered dataset: "))

    aug_train_dataset.reset_format()
    aug_clean_datasets.extend(aug_train_dataset)

Processing augmented datasets:   0%|          | 0/20 [00:00<?, ?it/s]

Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7145150194909424


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3128 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9952046035805626


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7113047466177482


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3119 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9939083039435717


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7179545975693649


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3145 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9955484896661367


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7101582205916074


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3116 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9929396662387676


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7122219674386608


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3124 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9939180537772087


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7083237789497822


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3108 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9932432432432432


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7014446227929374


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3073 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9944679466319557


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7138271038752579


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3131 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9936122644522517


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.6996101811511122


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3067 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9944571242256276


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7142857142857143


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3128 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9955242966751918


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7232286172896125


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3169 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9949510886715052


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.6966292134831461


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3047 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9963898916967509


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7152029351066269


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3137 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9939432578897035


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7012153175877093


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3074 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.994469746258946


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7113047466177482


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3117 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9951876804619827


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.6975464343040587


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3062 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.992815153494448


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7119926622334327


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3117 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9958293230670516


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7012153175877093


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3073 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9941425317279532


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.700986012382481


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3079 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9928548229944787


Tokenizing the provided dataset:   0%|          | 0/4361 [00:00<?, ? examples/s]

Generating logits for given dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.7170373767484521


Removing entries from augmented dataset that are different from the base one - based on saved logits:   0%|   …

Calculating accuracy based on the saved logits:   0%|          | 0/3144 [00:00<?, ?it/s]

Accuracy for filtered dataset:  0.9942748091603053


In [30]:
print(train_dataset.features)

{'sentence': Value(dtype='string', id=None), 'labels': ClassLabel(names=['ABBR:abb', 'ABBR:exp', 'ENTY:animal', 'ENTY:body', 'ENTY:color', 'ENTY:cremat', 'ENTY:currency', 'ENTY:dismed', 'ENTY:event', 'ENTY:food', 'ENTY:instru', 'ENTY:lang', 'ENTY:letter', 'ENTY:other', 'ENTY:plant', 'ENTY:product', 'ENTY:religion', 'ENTY:sport', 'ENTY:substance', 'ENTY:symbol', 'ENTY:techmeth', 'ENTY:termeq', 'ENTY:veh', 'ENTY:word', 'DESC:def', 'DESC:desc', 'DESC:manner', 'DESC:reason', 'HUM:gr', 'HUM:ind', 'HUM:title', 'HUM:desc', 'LOC:city', 'LOC:country', 'LOC:mount', 'LOC:other', 'LOC:state', 'NUM:code', 'NUM:count', 'NUM:date', 'NUM:dist', 'NUM:money', 'NUM:ord', 'NUM:other', 'NUM:period', 'NUM:perc', 'NUM:speed', 'NUM:temp', 'NUM:volsize', 'NUM:weight'], id=None), 'logits': Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None)}


Převedení spojeného a zpracovaného augmentovaného datasetu do dataset objektu.

In [31]:
ds_schema = Features({
    "sentence": Value("string"),
    "labels": ClassLabel(names=["ABBR:abb", "ABBR:exp", "ENTY:animal", "ENTY:body", "ENTY:color", "ENTY:cremat", "ENTY:currency", "ENTY:dismed", "ENTY:event", "ENTY:food", "ENTY:instru", "ENTY:lang", "ENTY:letter", "ENTY:other", "ENTY:plant", "ENTY:product", "ENTY:religion", "ENTY:sport", "ENTY:substance", "ENTY:symbol", "ENTY:techmeth", "ENTY:termeq", "ENTY:veh", "ENTY:word", "DESC:def", "DESC:desc", "DESC:manner", "DESC:reason", "HUM:gr", "HUM:ind", "HUM:title", "HUM:desc", "LOC:city", "LOC:country", "LOC:mount", "LOC:other", "LOC:state", "NUM:code", "NUM:count", "NUM:date", "NUM:dist", "NUM:money", "NUM:ord", "NUM:other", "NUM:period", "NUM:perc", "NUM:speed", "NUM:temp", "NUM:volsize", "NUM:weight"]),
    "logits": Sequence(feature=Value(dtype="float32")),
})

aug_dataset = Dataset.from_list(aug_clean_datasets)
aug_dataset = aug_dataset.cast(ds_schema)


Casting the dataset:   0%|          | 0/62158 [00:00<?, ? examples/s]

In [32]:
aug_dataset.set_format(type="torch", columns=["logits", "labels"], device="cpu")
train_dataset.set_format(type="torch", columns=["logits", "labels"], device="cpu")

Výpočet správnosti nad zpracovanými datasety.

In [33]:
print(base.check_acc(train_dataset, "Accuracy for base dataset: "))
print(base.check_acc(aug_dataset, "Accuracy for augmented dataset: "))

Calculating accuracy based on the saved logits:   0%|          | 0/4361 [00:00<?, ?it/s]

Accuracy for base dataset:  0.9924329282274708


Calculating accuracy based on the saved logits:   0%|          | 0/62158 [00:00<?, ?it/s]

Accuracy for augmented dataset:  0.9943852762315389


Spojení původního a augmentovaného datasetu.

In [34]:
train_all_data = concatenate_datasets([train_dataset, aug_dataset])
train_all_data.set_format(type="torch", columns=["logits", "labels"], device="cpu")

Získání správnosti nad touto kombinací.

In [35]:
print(base.check_acc(train_all_data, "Accuracy for combined dataset: "))

Calculating accuracy based on the saved logits:   0%|          | 0/66519 [00:00<?, ?it/s]

Accuracy for combined dataset:  0.994257279874923


In [36]:
print(train_all_data.column_names)

['sentence', 'labels', 'logits']


In [37]:
train_all_data.reset_format()

Uložení zpracovaných datasetů na disk.

In [38]:
train_all_data.save_to_disk(f"~/data/{DATASET}/train-logits-augmented_fine")

Saving the dataset (0/1 shards):   0%|          | 0/66519 [00:00<?, ? examples/s]

In [39]:
train_dataset.reset_format()
train_dataset.save_to_disk(f"~/data/{DATASET}/train-logits_fine")

Saving the dataset (0/1 shards):   0%|          | 0/4361 [00:00<?, ? examples/s]

Načtení zbylých částí datasetu (validační a testovací). Výpočet logitů je proveden pro každou z těchto částí stejným způsobem jako v příapdě trénovací části.

Nejprve je část tokenizována učitelem a vložena do dataloaderu.

In [40]:
eval_data = load_from_disk(f"~/data/{DATASET}/eval_fine")

eval_dataset = base.prepare_dataset(eval_data, tokenizer)
eval_dataloader = DataLoader(eval_dataset, batch_size=128, shuffle=False)

In [41]:
test_data = load_from_disk(f"~/data/{DATASET}/test_fine")

test_dataset = base.prepare_dataset(test_data, tokenizer)
test_dataloader = DataLoader(test_dataset, batch_size=128, shuffle=False)

Následně jsou pro každou část spočteny logity.

In [42]:
eval_logits = base.generate_logits(eval_dataloader, model)
test_logits = base.generate_logits(test_dataloader, model)

Generating logits for given dataset:   0%|          | 0/9 [00:00<?, ?it/s]

Generating logits for given dataset:   0%|          | 0/4 [00:00<?, ?it/s]

Logity jsou přidány jako nový sloupec, odstraněny jsou již nepotřebné pozůstatky tokenizace.

In [43]:
eval_dataset.reset_format()
eval_dataset = eval_dataset.add_column("logits", eval_logits)
eval_dataset = eval_dataset.remove_columns(["token_type_ids", "input_ids", "attention_mask"])

In [44]:
test_dataset.reset_format()
test_dataset = test_dataset.add_column("logits", test_logits)
test_dataset = test_dataset.remove_columns(["token_type_ids", "input_ids", "attention_mask"])

In [45]:
eval_dataset.save_to_disk(f"~/data/{DATASET}/eval-logits_fine")
test_dataset.save_to_disk(f"~/data/{DATASET}/test-logits_fine")

Saving the dataset (0/1 shards):   0%|          | 0/1091 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/500 [00:00<?, ? examples/s]

Uložení dat na disk a vypočtení správnosti predikcí.

In [46]:
eval_data = load_from_disk(f"~/data/{DATASET}/eval-logits_fine")
test_data = load_from_disk(f"~/data/{DATASET}/test-logits_fine")

eval_data.set_format(type="torch", columns=["logits", "labels"], device="cpu")
test_data.set_format(type="torch", columns=["logits", "labels"], device="cpu")

print(base.check_acc(eval_data, "Accuracy for base eval dataset: "))
print(base.check_acc(test_data, "Accuracy for base test dataset: "))

Calculating accuracy based on the saved logits:   0%|          | 0/1091 [00:00<?, ?it/s]

Accuracy for base eval dataset:  0.9651695692025665


Calculating accuracy based on the saved logits:   0%|          | 0/500 [00:00<?, ?it/s]

Accuracy for base test dataset:  0.944


Výpočet výkonnostních metrik a velikosti u učitelského modelu.

In [47]:
train_dataset = base.prepare_dataset(train_data, tokenizer)

train_data_gpu = copy.deepcopy(train_dataset)
train_data_gpu.set_format(type="torch", columns=["input_ids", "attention_mask"], device="cuda")
gpu_data_loader = DataLoader(train_data_gpu, batch_size=1, shuffle=False)

train_data_cpu = copy.deepcopy(train_dataset)
train_data_cpu.set_format(type="torch", columns=["input_ids", "attention_mask"], device="cpu")
cpu_data_loader = DataLoader(train_data_cpu, batch_size=1, shuffle=False)

In [48]:
base.count_parameters(model)

model size: 417.796MB.
Total Trainable Params: 109520690.


Unnamed: 0,Modules,Parameters
0,bert.embeddings.word_embeddings.weight,23440896
1,bert.embeddings.position_embeddings.weight,393216
2,bert.embeddings.token_type_embeddings.weight,1536
3,bert.embeddings.LayerNorm.weight,768
4,bert.embeddings.LayerNorm.bias,768
...,...,...
196,bert.encoder.layer.11.output.LayerNorm.bias,768
197,bert.pooler.dense.weight,589824
198,bert.pooler.dense.bias,768
199,classifier.weight,38400


In [49]:
cpu_benchmark = base.BenchMarkRunner(model, cpu_data_loader, "cpu", 1000)
print(cpu_benchmark.run_benchmark())

<torch.utils.benchmark.utils.common.Measurement object at 0x7ecbd3d0beb0>
self.infer_speed_comp()
  57.33 ms
  1 measurement, 1000 runs , 6 threads


In [50]:
gpu_benchmark = base.BenchMarkRunner(model, gpu_data_loader, "cuda", 1000)
print(gpu_benchmark.run_benchmark())

<torch.utils.benchmark.utils.common.Measurement object at 0x7ecbd3d18ee0>
self.infer_speed_comp()
  7.95 ms
  1 measurement, 1000 runs , 6 threads


In [53]:
base.get_scores(test_data)

F1 score: 0.9066237863824048
Accuracy: 0.944
Precision: 0.9311251389444667
Recall: 0.9126465626901167


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
