<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning BERT (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way. 

All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels), indicating the unnormalized scores for a number of labels for every example in the batch.



## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

In [1]:
#!pip install -q transformers datasets

## Load dataset

Next, let's download a multi-label text classification dataset from the [hub](https://huggingface.co/).

At the time of writing, I picked a random one as follows:   

* first, go to the "datasets" tab on huggingface.co
* next, select the "multi-label-classification" tag on the left as well as the the "1k<10k" tag (fo find a relatively small dataset).

Note that you can also easily load your local data (i.e. csv files, txt files, Parquet files, JSON, ...) as explained [here](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files).



In [2]:
import random
import numpy as np
import torch

seed = 2022

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
  torch.cuda.manual_seed_all(seed_val)

In [3]:
import pandas as pd

df_args = pd.read_csv('../../../Data/arguments-training.tsv',sep = '\t')
df_lbls = pd.read_table('../../../Data/labels-training.tsv')
df = df_args.merge(df_lbls, how="left", on="Argument ID")

# print(df)

from datasets import Dataset, DatasetDict
train_dataset = Dataset.from_pandas(df, split="train")
dataset = DatasetDict({ "train": train_dataset, "validation": train_dataset, "test": train_dataset })

As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Argument ID', 'Conclusion', 'Stance', 'Premise', 'Self-direction: thought', 'Self-direction: action', 'Stimulation', 'Hedonism', 'Achievement', 'Power: dominance', 'Power: resources', 'Face', 'Security: personal', 'Security: societal', 'Tradition', 'Conformity: rules', 'Conformity: interpersonal', 'Humility', 'Benevolence: caring', 'Benevolence: dependability', 'Universalism: concern', 'Universalism: nature', 'Universalism: tolerance', 'Universalism: objectivity', '__index_level_0__'],
        num_rows: 5220
    })
    validation: Dataset({
        features: ['Argument ID', 'Conclusion', 'Stance', 'Premise', 'Self-direction: thought', 'Self-direction: action', 'Stimulation', 'Hedonism', 'Achievement', 'Power: dominance', 'Power: resources', 'Face', 'Security: personal', 'Security: societal', 'Tradition', 'Conformity: rules', 'Conformity: interpersonal', 'Humility', 'Benevolence: caring', 'Benevolence: dependability', 'Universalism:

Let's check the first example of the training split:

In [5]:
example = dataset['train'][0]
example

{'Argument ID': 'A01001',
 'Conclusion': 'Entrapment should be legalized',
 'Stance': 'in favor of',
 'Premise': "if entrapment can serve to more easily capture wanted criminals, then why shouldn't it be legal?",
 'Self-direction: thought': 0,
 'Self-direction: action': 0,
 'Stimulation': 0,
 'Hedonism': 0,
 'Achievement': 0,
 'Power: dominance': 0,
 'Power: resources': 0,
 'Face': 0,
 'Security: personal': 0,
 'Security: societal': 1,
 'Tradition': 0,
 'Conformity: rules': 0,
 'Conformity: interpersonal': 0,
 'Humility': 0,
 'Benevolence: caring': 0,
 'Benevolence: dependability': 0,
 'Universalism: concern': 0,
 'Universalism: nature': 0,
 'Universalism: tolerance': 0,
 'Universalism: objectivity': 0,
 '__index_level_0__': 0}

The dataset consists of arguments, labeled with one or more values. 

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [6]:
labels = [label for label in dataset['train'].features.keys() if label not in ['Argument ID', 'Conclusion', 'Stance', 'Premise', '__index_level_0__']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['Self-direction: thought',
 'Self-direction: action',
 'Stimulation',
 'Hedonism',
 'Achievement',
 'Power: dominance',
 'Power: resources',
 'Face',
 'Security: personal',
 'Security: societal',
 'Tradition',
 'Conformity: rules',
 'Conformity: interpersonal',
 'Humility',
 'Benevolence: caring',
 'Benevolence: dependability',
 'Universalism: concern',
 'Universalism: nature',
 'Universalism: tolerance',
 'Universalism: objectivity']

## Preprocess data

As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' `BCEWithLogitsLoss` (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

In [7]:
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(examples):
  # take a batch of texts
  premise    = examples["Premise"]
  conclusion = examples["Conclusion"]
  # encode them
  encoding = tokenizer(premise, conclusion, padding="max_length", truncation=True, max_length=128)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(premise), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  return encoding

In [8]:
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

In [9]:
example = encoded_dataset['train'][0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [10]:
tokenizer.decode(example['input_ids'])

"[CLS] if entrapment can serve to more easily capture wanted criminals, then why shouldn't it be legal? [SEP] entrapment should be legalized [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]"

In [11]:
example['labels']

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

In [12]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['Security: societal']

# N-fold cross validation
Prepare the folds, for n-fold cross validation.
We want to use stratification, although we have a multi-label classification problem, which StratifiedKFold cannot handle. (You get an error: "ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.")

A solution is to convert from multi-label to multi-class. Assuming you have n classes and the target variable is a combination of these n classes. We will have (2^n) - 1 combinations (Not including all 0s).

Lets create a new target variable considering each combination as a new label.

In [13]:
from sklearn.preprocessing import LabelEncoder

def multilabel_to_string(y):
    return '-'.join(str(int(l.item())) for l in y)

def multilabel_to_multiclass(y):
    y_new = LabelEncoder().fit_transform([multilabel_to_string(l) for l in y])
    return y_new

# multilabel_to_multiclass(encoded_dataset["train"]["labels"])

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html). 

In [14]:
encoded_dataset.set_format("torch")

In [15]:
from sklearn.model_selection import StratifiedKFold

n_folds = 5

# First make the kfold object
folds = StratifiedKFold(n_splits=n_folds)

# Now make our splits based off of the labels. 
# We can use `np.zeros()` here since it only works off of indices, we really care about the labels
splits = folds.split(np.zeros(dataset["train"].num_rows), multilabel_to_multiclass(encoded_dataset["train"]["labels"]))

## Define model

Here we define a model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [16]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Train the model!

We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things: 

* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).

In [17]:
batch_size = 8
metric_name = "f1"
num_train_epochs = 10

In [18]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [19]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, precision_score, recall_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    precision_micro_average = precision_score(y_true=y_true, y_pred=y_pred, average='micro')
    recall_micro_average = recall_score(y_true=y_true, y_pred=y_pred, average='micro')
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'p': precision_micro_average,
               'r': recall_micro_average,
               'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

Let's verify a batch as well as a forward pass:

In [20]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [21]:
encoded_dataset['train']['input_ids'][0]

tensor([  101,  2065,  4372,  6494, 24073,  2064,  3710,  2000,  2062,  4089,
         5425,  2359, 12290,  1010,  2059,  2339,  5807,  1005,  1056,  2009,
         2022,  3423,  1029,   102,  4372,  6494, 24073,  2323,  2022,  3423,
         3550,   102,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])

In [22]:
#forward pass
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0), labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.6951, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.0848, -0.4281,  0.5540,  0.1601, -0.0590,  0.0067, -0.0180, -0.4107,
         -0.4852,  0.0186, -0.2489, -0.7731,  0.0675,  0.1176,  0.2130,  0.0916,
         -0.0998,  0.2706,  0.8651, -0.2715]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Let's start training!

In [23]:
fold = 1

metrics = {
    'p': [],
    'r': [],
    'f1': [],
    'roc_auc': [],
    'accuracy': []
}

if n_folds < 2:
    trainer = Trainer(
        model,
        args,
        train_dataset=encoded_dataset["train"],
        eval_dataset=encoded_dataset["validation"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    trainer.train()
    results = trainer.evaluate()
    for key in metrics:
        metrics[key].append(results.get('eval_' + key))
else:
    for train_idxs, val_idxs in splits:
        print(f"Fold: {fold}")
        ## Important: Don't forget to reload the original model in each fold!
        model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)
        
        fold_dataset = DatasetDict({
            "train":encoded_dataset["train"].select(train_idxs),
            "validation":encoded_dataset["train"].select(val_idxs),
            "test":encoded_dataset["validation"]
        })
        
        trainer = Trainer(
            model,
            args,
            train_dataset=fold_dataset["train"],
            eval_dataset=fold_dataset["validation"],
            tokenizer=tokenizer,
            compute_metrics=compute_metrics
        )
        
        trainer.train()
        results = trainer.evaluate()
        for key in metrics:
            metrics[key].append(results.get('eval_' + key))
        fold += 1
        # print(metrics)



Fold: 1


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Epoch,Training Loss,Validation Loss,P,R,F1,Roc Auc,Accuracy
1,0.4099,0.357382,0.705655,0.256764,0.376524,0.61742,0.052682
2,0.3399,0.334192,0.692629,0.37345,0.485259,0.669762,0.076628
3,0.2995,0.325797,0.686616,0.410654,0.513933,0.686143,0.069923
4,0.2669,0.322368,0.697862,0.414036,0.519724,0.68867,0.072797
5,0.2384,0.325592,0.677826,0.439402,0.533174,0.698325,0.070881
6,0.2131,0.330223,0.675011,0.448422,0.538865,0.702113,0.07567
7,0.1926,0.336274,0.654187,0.458005,0.538793,0.704222,0.063218
8,0.1787,0.340527,0.659901,0.448985,0.534384,0.700808,0.068008
9,0.166,0.344301,0.648478,0.462232,0.53974,0.70547,0.060345
10,0.1564,0.346513,0.644783,0.465051,0.540364,0.706302,0.058429


***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-522
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-522/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-522/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-522/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-522/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-1044
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-1044/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-1044/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-1044/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpo

Fold: 2


loading configuration file config.json from cache at /home/petasis/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "Self-direction: thought",
    "1": "Self-direction: action",
    "2": "Stimulation",
    "3": "Hedonism",
    "4": "Achievement",
    "5": "Power: dominance",
    "6": "Power: resources",
    "7": "Face",
    "8": "Security: personal",
    "9": "Security: societal",
    "10": "Tradition",
    "11": "Conformity: rules",
    "12": "Conformity: interpersonal",
    "13": "Humility",
    "14": "Benevolence: caring",
    "15": "Benevolence: dependability",
    "16": "Universalism: concern",
    "17": "

Epoch,Training Loss,Validation Loss,P,R,F1,Roc Auc,Accuracy
1,0.4138,0.36007,0.723324,0.214746,0.331172,0.598911,0.04023
2,0.344,0.329879,0.744853,0.334735,0.461896,0.655556,0.071839
3,0.3068,0.319158,0.741407,0.393047,0.513741,0.682401,0.091954
4,0.2717,0.316813,0.707891,0.430053,0.535054,0.696746,0.090996
5,0.243,0.316959,0.695466,0.455845,0.55072,0.70736,0.094828
6,0.2162,0.321696,0.677514,0.455285,0.544601,0.705318,0.086207
7,0.1968,0.328105,0.668462,0.452201,0.539465,0.702996,0.087165
8,0.1836,0.330659,0.654905,0.473507,0.549626,0.71105,0.086207
9,0.1681,0.333355,0.658721,0.464816,0.545036,0.7076,0.083333
10,0.1616,0.334438,0.655952,0.463415,0.543125,0.706668,0.08046


***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-522
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-522/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-522/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-522/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-522/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-1044
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-1044/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-1044/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-1044/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpo

Fold: 3


loading configuration file config.json from cache at /home/petasis/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "Self-direction: thought",
    "1": "Self-direction: action",
    "2": "Stimulation",
    "3": "Hedonism",
    "4": "Achievement",
    "5": "Power: dominance",
    "6": "Power: resources",
    "7": "Face",
    "8": "Security: personal",
    "9": "Security: societal",
    "10": "Tradition",
    "11": "Conformity: rules",
    "12": "Conformity: interpersonal",
    "13": "Humility",
    "14": "Benevolence: caring",
    "15": "Benevolence: dependability",
    "16": "Universalism: concern",
    "17": "

Epoch,Training Loss,Validation Loss,P,R,F1,Roc Auc,Accuracy
1,0.4138,0.364698,0.707972,0.249722,0.369212,0.61416,0.045019
2,0.3459,0.338041,0.720742,0.346325,0.467845,0.659222,0.074713
3,0.3092,0.325164,0.716319,0.392261,0.506926,0.679992,0.08908
4,0.2758,0.31713,0.707006,0.432628,0.536788,0.697688,0.090996
5,0.2467,0.32162,0.705244,0.434298,0.53756,0.698292,0.092912
6,0.2203,0.323508,0.687092,0.468263,0.556954,0.711977,0.092912
7,0.1992,0.327644,0.669981,0.487751,0.564524,0.718916,0.091954
8,0.1841,0.333248,0.680209,0.470768,0.556433,0.712391,0.08908
9,0.1735,0.333745,0.676743,0.467428,0.552939,0.710519,0.095785
10,0.1646,0.334893,0.677602,0.465757,0.552054,0.709857,0.09387


***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-522
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-522/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-522/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-522/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-522/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-1044
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-1044/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-1044/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-1044/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpo

Fold: 4


loading configuration file config.json from cache at /home/petasis/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "Self-direction: thought",
    "1": "Self-direction: action",
    "2": "Stimulation",
    "3": "Hedonism",
    "4": "Achievement",
    "5": "Power: dominance",
    "6": "Power: resources",
    "7": "Face",
    "8": "Security: personal",
    "9": "Security: societal",
    "10": "Tradition",
    "11": "Conformity: rules",
    "12": "Conformity: interpersonal",
    "13": "Humility",
    "14": "Benevolence: caring",
    "15": "Benevolence: dependability",
    "16": "Universalism: concern",
    "17": "

Epoch,Training Loss,Validation Loss,P,R,F1,Roc Auc,Accuracy
1,0.4128,0.362643,0.713829,0.248124,0.368247,0.613704,0.041188
2,0.3413,0.334999,0.711965,0.365379,0.482923,0.667297,0.084291
3,0.3042,0.325017,0.724645,0.383162,0.501272,0.67642,0.091954
4,0.2698,0.32286,0.714216,0.409003,0.520141,0.687459,0.096743
5,0.239,0.324058,0.685903,0.43262,0.530584,0.69568,0.090996
6,0.2143,0.330634,0.6875,0.437066,0.534398,0.697845,0.096743
7,0.1938,0.333777,0.677145,0.442901,0.535528,0.699461,0.105364
8,0.1773,0.339131,0.679449,0.438177,0.53277,0.697562,0.106322
9,0.1682,0.343442,0.675364,0.438733,0.531918,0.697406,0.10249
10,0.1588,0.344241,0.672964,0.445402,0.536031,0.700162,0.105364


***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-522
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-522/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-522/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-522/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-522/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-1044
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-1044/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-1044/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-1044/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpo

Fold: 5


loading configuration file config.json from cache at /home/petasis/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "Self-direction: thought",
    "1": "Self-direction: action",
    "2": "Stimulation",
    "3": "Hedonism",
    "4": "Achievement",
    "5": "Power: dominance",
    "6": "Power: resources",
    "7": "Face",
    "8": "Security: personal",
    "9": "Security: societal",
    "10": "Tradition",
    "11": "Conformity: rules",
    "12": "Conformity: interpersonal",
    "13": "Humility",
    "14": "Benevolence: caring",
    "15": "Benevolence: dependability",
    "16": "Universalism: concern",
    "17": "

Epoch,Training Loss,Validation Loss,P,R,F1,Roc Auc,Accuracy
1,0.4129,0.364809,0.737511,0.226879,0.347009,0.605087,0.031609
2,0.3422,0.336923,0.713609,0.336966,0.457772,0.654495,0.064176
3,0.3017,0.330019,0.68753,0.398994,0.50495,0.680741,0.074713
4,0.2694,0.324698,0.693897,0.412965,0.517779,0.687639,0.078544
5,0.2369,0.329897,0.662316,0.453758,0.538551,0.70295,0.073755
6,0.2136,0.332376,0.677802,0.430847,0.526819,0.69424,0.08046
7,0.1939,0.337298,0.665973,0.446773,0.534783,0.700209,0.076628
8,0.1799,0.343878,0.644973,0.46242,0.538649,0.704882,0.073755
9,0.1662,0.346158,0.650237,0.460743,0.53933,0.704737,0.073755
10,0.1587,0.346504,0.648562,0.453758,0.533947,0.701447,0.077586


***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-522
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-522/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-522/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-522/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-522/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-1044
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-1044/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-1044/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-1044/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpo

## Results

In [24]:
import pandas as pd
import numpy as np

decimal_digits = 4

data = [[round(np.mean(metrics['p']), decimal_digits),
         round(np.mean(metrics['r']), decimal_digits),
         round(np.mean(metrics['f1']), decimal_digits),
         round(np.mean(metrics['roc_auc']), decimal_digits),
         round(np.mean(metrics['accuracy']), decimal_digits),
        ],
        [round(np.std(metrics['p']), decimal_digits),
         round(np.std(metrics['r']), decimal_digits),
         round(np.std(metrics['f1']), decimal_digits),
         round(np.std(metrics['roc_auc']), decimal_digits),
         round(np.std(metrics['accuracy']), decimal_digits),
        ]]

df = pd.DataFrame(data, index=["", "±"], columns=["P", "R", "F1", "Roc Auc", "Accuracy"])
df
# df.style.hide_index()

Unnamed: 0,P,R,F1,Roc Auc,Accuracy
,0.6667,0.463,0.5462,0.7075,0.0849
±,0.018,0.014,0.0104,0.0062,0.0167


## Evaluate

After training, we evaluate our model on the validation set.

In [25]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1044
  Batch size = 8


{'eval_loss': 0.34615832567214966,
 'eval_p': 0.6502365930599369,
 'eval_r': 0.46074322436434756,
 'eval_f1': 0.5393295175797219,
 'eval_roc_auc': 0.7047372557865896,
 'eval_accuracy': 0.07375478927203065,
 'eval_runtime': 5.2357,
 'eval_samples_per_second': 199.401,
 'eval_steps_per_second': 25.021,
 'epoch': 10.0}

## Inference

Let's test the model on a new sentence:

In [26]:
text = "I'm happy I can finally train a model for multi-label classification"

encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

outputs = trainer.model(**encoding)

The logits that come out of the model are of shape (batch_size, num_labels). As we are only forwarding a single sentence through the model, the `batch_size` equals 1. The logits is a tensor that contains the (unnormalized) scores for every individual label.

In [27]:
logits = outputs.logits
logits.shape

torch.Size([1, 20])

To turn them into actual predicted labels, we first apply a sigmoid function independently to every score, such that every score is turned into a number between 0 and 1, that can be interpreted as a "probability" for how certain the model is that a given class belongs to the input text.

Next, we use a threshold (typically, 0.5) to turn every probability into either a 1 (which means, we predict the label for the given example) or a 0 (which means, we don't predict the label for the given example).

In [28]:
# apply sigmoid + threshold
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits.squeeze().cpu())
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.5)] = 1
# turn predicted id's into actual label names
predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
print(predicted_labels)

['Achievement']


In [30]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element