
## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

In [1]:
# !pip install -q transformers datasets

In [2]:
# !pip install -q --upgrade numpy datasets transformers


## Load dataset

In [3]:
from datasets import load_dataset

# Path to your JSON file
json_file_path = 'data.json'

# Load the dataset
dataset = load_dataset('json', data_files=json_file_path, field = 'train')
validation_dataset = load_dataset('json', data_files = json_file_path, field = 'validation')
test_dataset = load_dataset('json', data_files = json_file_path, field = 'test')



As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

Let's check the first example of the training split:

The dataset consists of tweets, labeled with one or more emotions.

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [4]:
labels = [label for label in dataset['train'].features.keys() if label not in ['Article']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['t1497virtualization/sandboxevasion',
 't1082systeminformationdiscovery',
 't1059commandandscriptinginterpreter',
 't1486dataencryptedforimpact',
 't1105ingresstooltransfer',
 't1021remoteservices',
 't0814denialofservice',
 't1562impairdefenses',
 't1055processinjection',
 't1566phishing',
 't1003oscredentialdumping',
 't1027obfuscatedfilesorinformation',
 't1018remotesystemdiscovery',
 't1047windowsmanagementinstrumentation',
 't1053scheduledtask/job']

## Preprocess data

In [5]:
from transformers import AutoTokenizer
import numpy as np

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)


def preprocess_data(examples):
  # take a batch of texts
  text = examples["Article"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=1024)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()

  return encoding

In [6]:
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)
encoded_validation = validation_dataset.map(preprocess_data, batched= True, remove_columns = validation_dataset['train'].column_names)
encoded_test = test_dataset.map(preprocess_data, batched= True, remove_columns = test_dataset['train'].column_names)


Map:   0%|          | 0/431 [00:00<?, ? examples/s]

In [7]:
example = encoded_dataset['train'][100]
print(example.keys())

dict_keys(['input_ids', 'attention_mask', 'labels'])


In [8]:
tokenizer.decode(example['input_ids'])

'For the last few months, the Online Top Twenty has contained an unusually large number of Trojan dialers. They reached their peak in January, with five such programs in the rankings, and Diamin.fc in first place. The situation took a surprising turn in February: Diamin.fc dropped off the bottom of the table, and only Dialer.cj, which led the rankings in December 2006, was left.Email worms, on the other hand, appear to be very active. In addition to Rays and Brontok, which have become something of a fixture in the online ratings, Mydoom.m has returned in first place. New worms such as Warezov.lk and Warezov.ls have also put in an appearance. It’s interesting that no Zhelatin variants showed up in the Online statistics, as they occupied a significant proportion of our mail traffic statistics. This may partly be due to the fact that Zhelatin epidemics were mostly cut off at mail server level, meaning that a relatively small number of infected emails actually reached end users.The combina

In [9]:
example['labels']

[0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0]

In [10]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['t1082systeminformationdiscovery',
 't1059commandandscriptinginterpreter',
 't1566phishing',
 't1027obfuscatedfilesorinformation']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html).

In [11]:
encoded_dataset.set_format(type='torch')
encoded_validation.set_format("torch")
encoded_test.set_format("torch")

In [12]:
!huggingface-cli login --token $secret_hf


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ashraful/.cache/huggingface/token
Login successful


## Define model

Here we define a model that includes a pre-trained base are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [13]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_id,
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of T5ForSequenceClassification were not initialized from the model checkpoint at google/flan-t5-base and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train the model!

We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things:

* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).

In [14]:
batch_size = 1
metric_name = "f1"

In [15]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"flant5-finetuned-ttp-e20",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=20,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=True,
)

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [16]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch

# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions,
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result

Let's verify a batch as well as a forward pass:

In [17]:
# pip install -q numpy==1.19.5


In [18]:
# %reset

In [19]:
encoded_dataset['train']['input_ids']

tensor([[ 9825,  2900,    10,  ...,     0,     0,     0],
        [ 3054,   911,     7,  ...,     0,     0,     0],
        [   86,     3,     9,  ...,    15,  2157,     1],
        ...,
        [   37,  9738, 16837,  ...,     6,   305,     1],
        [30980, 10485,   641,  ...,   492,    34,     1],
        [   37,  6025,    13,  ..., 12734,     3,     1]])

In [20]:
#forward pass
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0), labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

Seq2SeqSequenceClassifierOutput(loss=tensor(0.6712, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.1896,  0.0111, -0.3293, -0.3833, -0.2918,  0.0198,  0.7272, -0.0811,
          0.0362,  0.4742, -0.0155,  0.2663, -0.1751,  0.3498,  0.0397]],
       grad_fn=<AddmmBackward0>), past_key_values=None, decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[-0.0105,  0.1539, -0.0941,  ...,  0.1313,  0.0248, -0.1061],
         [ 0.0447,  0.0412, -0.1369,  ...,  0.1066,  0.1846, -0.1399],
         [ 0.1515,  0.1309, -0.0670,  ..., -0.0065,  0.1653, -0.0459],
         ...,
         [ 0.0202,  0.0084, -0.1394,  ...,  0.1753, -0.1126,  0.1521],
         [ 0.0147,  0.0132, -0.1509,  ...,  0.1707, -0.1095,  0.1525],
         [ 0.0259,  0.0059, -0.1400,  ...,  0.1583, -0.0905,  0.1747]]],
       grad_fn=<MulBackward0>), encoder_hidden_states=None, encoder_attentions=None)

In [21]:

import os

# Enable synchronous CUDA operations for easier debugging
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# Now, rerun your PyTorch code to pinpoint the problematic line more easily
import os
os.environ['TORCH_USE_CUDA_DSA'] = '1'



In [22]:
import gc
gc.collect()

20

Let's start training!

In [23]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_validation["train"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
[codecarbon INFO @ 00:13:53] [setup] RAM Tracking...
[codecarbon INFO @ 00:13:53] [setup] GPU Tracking...
[codecarbon INFO @ 00:13:53] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 00:13:53] [setup] CPU Tracking...
[codecarbon INFO @ 00:13:54] CPU Model on constant consumption mode: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
[codecarbon INFO @ 00:13:54] >>> Tracker's metadata:
[codecarbon INFO @ 00:13:54]   Platform system: Linux-6.5.0-26-generic-x86_64-with-glibc2.35
[codecarbon INFO @ 00:13:54]   Python version: 3.10.13
[codecarbon INFO @ 00:13:54]   CodeCarbon version: 2.2.3
[codecarbon INFO @ 00:13:54]   Available RAM : 15.545 GB
[codecarbon INFO @ 00:13:54]   CPU count: 12
[codecarbon INFO @ 00:13:54]   CPU model: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
[codecarbon INFO @ 00:13:54]   GPU count: 1
[codecarbon INFO @ 00:13:54]   GPU model: 1 x NVIDIA

In [24]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33manikshahrukhfahim[0m ([33mcti-ttp-g1[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113093644477582, max=1.0…

Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,0.4286,0.45635,0.225676,0.558106,0.097448
2,0.4198,0.428999,0.495726,0.670751,0.12065
3,0.3913,0.400648,0.489668,0.665183,0.146172
4,0.3397,0.427114,0.504758,0.67243,0.169374
5,0.3372,0.400076,0.561706,0.70646,0.178654
6,0.3132,0.412035,0.561067,0.705086,0.171694
7,0.3023,0.428796,0.573057,0.713148,0.178654
8,0.2713,0.433811,0.586153,0.722984,0.183295
9,0.2707,0.44664,0.590805,0.727819,0.178654
10,0.2524,0.461575,0.572942,0.713,0.178654


[codecarbon INFO @ 00:14:26] Energy consumed for RAM : 0.000024 kWh. RAM Power : 5.82952880859375 W
[codecarbon INFO @ 00:14:26] Energy consumed for all GPUs : 0.000885 kWh. Total GPU Power : 212.47900000000004 W
[codecarbon INFO @ 00:14:26] Energy consumed for all CPUs : 0.000135 kWh. Total CPU Power : 32.5 W
[codecarbon INFO @ 00:14:26] 0.001045 kWh of electricity used since the beginning.
[codecarbon INFO @ 00:14:41] Energy consumed for RAM : 0.000049 kWh. RAM Power : 5.82952880859375 W
[codecarbon INFO @ 00:14:41] Energy consumed for all GPUs : 0.001884 kWh. Total GPU Power : 239.775 W
[codecarbon INFO @ 00:14:41] Energy consumed for all CPUs : 0.000271 kWh. Total CPU Power : 32.5 W
[codecarbon INFO @ 00:14:41] 0.002204 kWh of electricity used since the beginning.
[codecarbon INFO @ 00:14:56] Energy consumed for RAM : 0.000073 kWh. RAM Power : 5.82952880859375 W
[codecarbon INFO @ 00:14:56] Energy consumed for all GPUs : 0.002889 kWh. Total GPU Power : 241.204 W
[codecarbon INFO @ 

TrainOutput(global_step=40160, training_loss=0.26766943717857755, metrics={'train_runtime': 23420.1243, 'train_samples_per_second': 1.715, 'train_steps_per_second': 1.715, 'total_flos': 4.906006841622528e+16, 'train_loss': 0.26766943717857755, 'epoch': 20.0})

## Evaluate

After training, we evaluate our model on the validation set.

In [25]:
trainer.evaluate()

{'eval_loss': 0.5327719449996948,
 'eval_f1': 0.6020170674941815,
 'eval_roc_auc': 0.7332355768182035,
 'eval_accuracy': 0.18793503480278423,
 'eval_runtime': 75.3818,
 'eval_samples_per_second': 5.718,
 'eval_steps_per_second': 5.718,
 'epoch': 20.0}