# BERT

### Introduction
In this notebook, our objective is to train an BERT model for a challenging multilabel classification task. This task involves categorizing textual arguments into one or more of 20 distinct categories, each representing a fundamental human value.

The categories are as follows:
- Self-direction: thought
- Self-direction: action
- Stimulation
- Hedonism
- Achievement
- Power: dominance
- Power: resources
- Face
- Security: personal
- Security: societal
- Tradition
- Conformity: rules
- Conformity: interpersonal
- Humility
- Benevolence: caring
- Benevolence: dependability
- Universalism: concern
- Universalism: nature
- Universalism: tolerance
- Universalism: objectivity






### Flow of the notebook


The notebook will be structured into distinct sections to offer a well-organized guide through the implemented process. These sections will include:

1. Installing the required libraries
2. Importing the libraries
3. Loading the datasets
3. Defining and Fine-Tuning the Model
  - BERT Base
  - BERT Large
4. Evaluation

## Installing the Required Libraries

We have to install the libraries below because it is not be pre-installed in the runtime environment provided by Google Colab:
- transformers
- SentencePiece
- wandb
- simpletransformers

In [None]:
!pip install transformers

In [None]:
!pip install SentencePiece

In [None]:
!pip install wandb --upgrade

In [None]:
!pip install --upgrade wandb simpletransformers

In [None]:
!pip install simpletransformers

## Importing the Libraries

In [10]:
from transformers import AutoTokenizer

import logging
import wandb

import torch

import numpy as np

from simpletransformers.classification import MultiLabelClassificationModel, ClassificationArgs

import pandas as pd
from sklearn.metrics import f1_score, classification_report

## Loading the Data
We utilize data sourced from [Zenodo](https://zenodo.org/record/7550385#.Y8wMquzMK3I), specifically from the [Human Value Detection 2023 competition](https://touche.webis.de/semeval23/touche23-web/index.html). Our focus is on the following datasets: arguments-training.tsv, arguments-validation.tsv, arguments-test.tsv, labels-training.tsv, labels-validation.tsv, and labels-test.tsv.


In [11]:
# defining the label titles in our datasets
label_cols = ['Self-direction: thought', 'Self-direction: action', 'Stimulation',
       'Hedonism', 'Achievement', 'Power: dominance', 'Power: resources',
       'Face', 'Security: personal', 'Security: societal', 'Tradition',
       'Conformity: rules', 'Conformity: interpersonal', 'Humility',
       'Benevolence: caring', 'Benevolence: dependability',
       'Universalism: concern', 'Universalism: nature',
       'Universalism: tolerance', 'Universalism: objectivity']

In [12]:
# Loading the data
train_args = pd.read_csv("arguments-training.tsv",delimiter='\t')
train_labels = pd.read_csv("labels-training.tsv",delimiter='\t')

val_labels = pd.read_csv("labels-validation.tsv",delimiter='\t')
val_args = pd.read_csv("arguments-validation.tsv",delimiter='\t')

test_labels = pd.read_csv("labels-test.tsv",delimiter='\t')
test_args = pd.read_csv("arguments-test.tsv",delimiter='\t')

The input data for simple transformers' models need to have a 'text' column containing the context we need to apply multiclassification and the list of labels for each element in the context.

In [13]:
# Adding the 'text' column
train_args['text'] = train_args['Conclusion'] + " " + train_args['Stance'] + " " + train_args['Premise']
train_args.drop(labels=['Conclusion', 'Stance', 'Premise'], axis=1, inplace=True)

val_args['text'] = val_args['Conclusion'] + " " + val_args['Stance'] + " " + val_args['Premise']
val_args.drop(labels=['Conclusion', 'Stance', 'Premise'], axis=1, inplace=True)

test_args['text'] = test_args['Conclusion'] + " " + test_args['Stance'] + " " + test_args['Premise']
test_args.drop(labels=['Conclusion', 'Stance', 'Premise'], axis=1, inplace=True)

print(val_labels)

     Argument ID  Self-direction: thought  Self-direction: action  \
0         A01001                        0                       0   
1         A01012                        0                       0   
2         A02001                        0                       0   
3         A02002                        0                       1   
4         A02009                        0                       0   
...          ...                      ...                     ...   
1891      E08014                        1                       0   
1892      E08021                        1                       0   
1893      E08022                        0                       1   
1894      E08024                        0                       1   
1895      E08025                        0                       1   

      Stimulation  Hedonism  Achievement  Power: dominance  Power: resources  \
0               0         0            0                 0                 0   
1          

In [14]:
def prepare_labels(data):
  df = pd.DataFrame(data)

  final_list = df.copy()  # Make a copy of the DataFrame
  final_list['labels'] = df.iloc[:, 1:].apply(lambda row: tuple(row), axis=1)

  return final_list

In [None]:
# Adding the 'labels' column
train_labels_compressed = prepare_labels(train_labels)
val_labels_compressed = prepare_labels(val_labels)
test_labels_compressed = prepare_labels(test_labels)

In [None]:
# Merging the content of two files into one dataset
merged_train = pd.merge(train_args, train_labels_compressed, on='Argument ID', how='inner')
merged_val = pd.merge(val_args, val_labels_compressed, on='Argument ID', how='inner')
merged_test = pd.merge(test_args, test_labels_compressed, on='Argument ID', how='inner')

We have implemented the following functions to determine the optimal threshold after training and assessing the model. This is necessary because the model's output consists of a series of probabilities, and these probabilities need to be transformed into binary values (0s and 1s) using a threshold.

In [17]:
# Obtaining the best optimal threshold
def get_threshold(labels, model_output):
  results = {}
  labels_compressed = labels.drop("Argument ID", axis=1)

  df_prediction = labels_compressed.copy()
  for tr in np.arange(0.1, 0.9, 0.05):
      tr = round(tr, 2)
      for i, label in enumerate(label_cols):
          prediction = np.where(model_output[:,i] >= tr, 1, 0)
          df_prediction[label] = prediction

      y_pred = df_prediction.values.tolist()
      y_test = labels_compressed.values.tolist()
      f1 = f1_score(y_test, y_pred, average = "macro", zero_division = 1)
      results[tr] = f1

  for k,v in results.items():
      print("THRESHOLD: {:.2f} ".format(k), "F1 score: {:.3f}".format(v))

  THRESHOLD = max(results, key = results.get)

  print("\nBest threshold obtained:", THRESHOLD , "having F1 score of: {:.2f}".format(max(results.values())))

  return THRESHOLD

Using the function below, we can get a report of the classification scores on various classes.

In [18]:
# Obtaining the final report of our scores
def get_report(labels, model_output, THRESHOLD):
  print(labels)
  print(model_output)
  print(THRESHOLD)
  df_prediction = labels.copy()

  for i, label in enumerate(label_cols):
    prediction = np.where(model_output[:,i] >= THRESHOLD, 1, 0)
    df_prediction[label] = prediction

  y_pred = df_prediction.values.tolist()
  y_test = labels.values.tolist()

  print(classification_report(y_test,y_pred, target_names = label_cols))

## Defining the Model


In this section, we introduce our BERT model and initiate the fine-tuning process by exploring various parameters. To facilitate this experimentation, we make use of the Sweep library, which enables us to systematically evaluate different model configurations.

We work with two versions of the BERT model, namely "bert-base-cased" and "bert-large-cased." However, owing to constraints with our CUDA resources, we begin by fine-tuning "bert-base-cased" using hyperparameter tuning to identify the optimal hyperparameters. Once we've determined the best hyperparameters for "bert-base-cased," we proceed to train both the "bert-base-cased" and "bert-large-cased" models. This approach allows us to make a fair and meaningful comparison of their performance on our datasets.

### Defining Hyperparameter


We attempted to conduct experiments by varying the parameters related to the number of training epochs and the batch size.

In [19]:
# hyperparmaters we want to fine-tune
sweep_config = {
    "method": "grid",  # grid, random
    "parameters": {
        "num_train_epochs": {"values": [3, 5, 10]},
        "train_batch_size": {"values": [16, 32]}
    },
    'metric': {
        'name': 'eval_loss',  # or 'LRAP', 'accuracy', etc.
        'goal': 'minimize'    # or 'maximize'
    }
}

In [None]:
# create a sweep in W&B
sweep_id = wandb.sweep(sweep_config, project="Human Value Detection")

In [21]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [22]:
# Our fixed arguments for the simple transformers model
args = {
      'manual_seed': 42,
      'max_seq_length': 100,
      'overwrite_output_dir': True,
      'gradient_accumulation_steps': 8,
      "lr": 2e-4,
      "optimizer": 'AdamW'
      }

### BERT Base
This section contains the training phase for our BERT base model:

In [23]:
def train():
    # Initialize a new wandb run
    wandb.init()

    # Create a TransformerModel
    model = MultiLabelClassificationModel(
        'bert',
        'bert-base-cased',
        use_cuda=True,
        args=args,
        sweep_config=wandb.config,
        num_labels=20
    )

    # Train the model
    model.train_model(merged_train)

    # Evaluate the model
    model.eval_model(merged_val)

    # Sync wandb
    wandb.join()


In [None]:
wandb.agent(sweep_id, train, count=6)

In [None]:
# Initialize the W&B API
api = wandb.Api()

# Replace 'your_project_name' with your actual project name
project = api.project('Human Value Detection')

project

In [26]:
# Get the sweep
sweep = api.sweep(f"{project.name}/{sweep_id}")

# Retrieve the best run
best_run = sweep.best_run()

# Get the best run's parameters and results
best_params = best_run.config
best_results = best_run.summary

[34m[1mwandb[0m: Sorting runs by +summary_metrics.eval_loss


In [27]:
print("Best Parameters:", best_params)
print("Best Results:", best_results)

Best Parameters: {'num_train_epochs': 3, 'train_batch_size': 32}
Best Results: {'Training loss': 0.41010457277297974, '_runtime': 132.145995657, '_step': 0, '_timestamp': 1746645539.280375, '_wandb': {'runtime': 202}, 'global_step': 50, 'lr': 8.8135593220339e-06}


In [28]:
 # Defining the model using the best parameters obtained
 model_bert_base = MultiLabelClassificationModel(
        'bert',
        'bert-base-cased',
        use_cuda=True,
        args={
              'num_train_epochs': best_params['num_train_epochs'],
              'train_batch_size': best_params['train_batch_size'],
              'manual_seed': 42,
              'max_seq_length': 100,
              'overwrite_output_dir': True,
              'gradient_accumulation_steps': 8,
              "lr": 2e-4,
              "optimizer": 'AdamW'
            },
        num_labels=20
)

Some weights of BertForMultiLabelSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
# layers of our bert base model
model_bert_base.model

BertForMultiLabelSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768

In [None]:
# training the model
model_bert_base.train_model(merged_train, eval_df=merged_val)

## Bert Large

This section encompasses the training phase for our BERT large model.

In [None]:
# Defining our bert large model with the hyperparameters we got from bert base
model_bert_large = MultiLabelClassificationModel(
        'bert',
        'bert-large-cased',
        use_cuda=True,
        args={
              'num_train_epochs': best_params['num_train_epochs'],
              'train_batch_size': best_params['train_batch_size'],
              'manual_seed': 42,
              'max_seq_length': 100,
              'overwrite_output_dir': True,
              'gradient_accumulation_steps': 8,
              "lr": 2e-4,
              "optimizer": 'AdamW'
            },
        num_labels=20
)

In [32]:
# Layers of our bert large model
model_bert_large.model

BertForMultiLabelSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): La

In [None]:
# Training our bert large model
model_bert_large.train_model(merged_train, eval_df=merged_val)

## Evaluation of the models


In this section, we assess the performance of the BERT base and BERT large models on both the validation and test sets. We gauge their performance using the F1 score as the primary metric for evaluation.

### Evaluation of bert-base-cased
In here, we evaluate the BERT base model on both the validation and test datasets.

**Evaluation on validation set**

In [None]:
result_bert_base, model_outputs_bert_base, wrong_predictions_bert_base = model_bert_base.eval_model(merged_val)
print("Results of bert base on validtion set:\n", result_bert_base)

In [35]:
bert_base_threshold = get_threshold(val_labels, model_outputs_bert_base)

THRESHOLD: 0.10  F1 score: 0.271
THRESHOLD: 0.15  F1 score: 0.245
THRESHOLD: 0.20  F1 score: 0.200
THRESHOLD: 0.25  F1 score: 0.171
THRESHOLD: 0.30  F1 score: 0.118
THRESHOLD: 0.35  F1 score: 0.089
THRESHOLD: 0.40  F1 score: 0.047
THRESHOLD: 0.45  F1 score: 0.001
THRESHOLD: 0.50  F1 score: 0.000
THRESHOLD: 0.55  F1 score: 0.000
THRESHOLD: 0.60  F1 score: 0.000
THRESHOLD: 0.65  F1 score: 0.000
THRESHOLD: 0.70  F1 score: 0.000
THRESHOLD: 0.75  F1 score: 0.000
THRESHOLD: 0.80  F1 score: 0.000
THRESHOLD: 0.85  F1 score: 0.000

Best threshold obtained: 0.1 having F1 score of: 0.27


**Evaluation on test set**


In [None]:
result_bert_base_test, model_outputs_bert_base_test, wrong_predictions_bert_base_test = model_bert_base.eval_model(merged_test)
print("Results of bert base on validtion set:\n", result_bert_base_test)

In [37]:
bert_base_threshold_test = get_threshold(test_labels, model_outputs_bert_base_test)

THRESHOLD: 0.10  F1 score: 0.247
THRESHOLD: 0.15  F1 score: 0.220
THRESHOLD: 0.20  F1 score: 0.183
THRESHOLD: 0.25  F1 score: 0.156
THRESHOLD: 0.30  F1 score: 0.107
THRESHOLD: 0.35  F1 score: 0.072
THRESHOLD: 0.40  F1 score: 0.049
THRESHOLD: 0.45  F1 score: 0.002
THRESHOLD: 0.50  F1 score: 0.000
THRESHOLD: 0.55  F1 score: 0.000
THRESHOLD: 0.60  F1 score: 0.000
THRESHOLD: 0.65  F1 score: 0.000
THRESHOLD: 0.70  F1 score: 0.000
THRESHOLD: 0.75  F1 score: 0.000
THRESHOLD: 0.80  F1 score: 0.000
THRESHOLD: 0.85  F1 score: 0.000

Best threshold obtained: 0.1 having F1 score of: 0.25


### Evaluation of bert-large-cased
This section involves the evaluation of the BERT large model on both the validation and test datasets.

**Evaluation on validation set**

In [None]:
result_bert_large, model_outputs_bert_large, wrong_predictions_bert_large = model_bert_large.eval_model(merged_val)
print("Results of bert base on validtion set:\n", result_bert_large)

In [39]:
bert_large_threshold = get_threshold(val_labels, model_outputs_bert_large)

THRESHOLD: 0.10  F1 score: 0.271
THRESHOLD: 0.15  F1 score: 0.201
THRESHOLD: 0.20  F1 score: 0.185
THRESHOLD: 0.25  F1 score: 0.140
THRESHOLD: 0.30  F1 score: 0.096
THRESHOLD: 0.35  F1 score: 0.074
THRESHOLD: 0.40  F1 score: 0.014
THRESHOLD: 0.45  F1 score: 0.000
THRESHOLD: 0.50  F1 score: 0.000
THRESHOLD: 0.55  F1 score: 0.000
THRESHOLD: 0.60  F1 score: 0.000
THRESHOLD: 0.65  F1 score: 0.000
THRESHOLD: 0.70  F1 score: 0.000
THRESHOLD: 0.75  F1 score: 0.000
THRESHOLD: 0.80  F1 score: 0.000
THRESHOLD: 0.85  F1 score: 0.000

Best threshold obtained: 0.1 having F1 score of: 0.27


**Evaluation on test set**

In [None]:
result_bert_large_test, model_outputs_bert_large_test, wrong_predictions_bert_large_test = model_bert_large.eval_model(merged_test)
print("Results of bert base on validtion set:\n", result_bert_large_test)

In [41]:
bert_large_threshold_test = get_threshold(test_labels, model_outputs_bert_large_test)

THRESHOLD: 0.10  F1 score: 0.244
THRESHOLD: 0.15  F1 score: 0.186
THRESHOLD: 0.20  F1 score: 0.155
THRESHOLD: 0.25  F1 score: 0.139
THRESHOLD: 0.30  F1 score: 0.102
THRESHOLD: 0.35  F1 score: 0.070
THRESHOLD: 0.40  F1 score: 0.017
THRESHOLD: 0.45  F1 score: 0.000
THRESHOLD: 0.50  F1 score: 0.000
THRESHOLD: 0.55  F1 score: 0.000
THRESHOLD: 0.60  F1 score: 0.000
THRESHOLD: 0.65  F1 score: 0.000
THRESHOLD: 0.70  F1 score: 0.000
THRESHOLD: 0.75  F1 score: 0.000
THRESHOLD: 0.80  F1 score: 0.000
THRESHOLD: 0.85  F1 score: 0.000

Best threshold obtained: 0.1 having F1 score of: 0.24
