# BERT

### Introduction
In this notebook, our objective is to train an BERT model for a challenging multilabel classification task. This task involves categorizing textual arguments into one or more of 20 distinct categories, each representing a fundamental human value.

The categories are as follows:
- Self-direction: thought
- Self-direction: action
- Stimulation
- Hedonism
- Achievement
- Power: dominance
- Power: resources
- Face
- Security: personal
- Security: societal
- Tradition
- Conformity: rules
- Conformity: interpersonal
- Humility
- Benevolence: caring
- Benevolence: dependability
- Universalism: concern
- Universalism: nature
- Universalism: tolerance
- Universalism: objectivity






### Flow of the notebook


The notebook will be structured into distinct sections to offer a well-organized guide through the implemented process. These sections will include:

1. Installing the required libraries
2. Importing the libraries
3. Loading the datasets
3. Defining and Fine-Tuning the Model
  - BERT Base
  - BERT Large
4. Evaluation

## Installing the Required Libraries

We have to install the libraries below because it is not be pre-installed in the runtime environment provided by Google Colab:
- transformers
- SentencePiece
- wandb
- simpletransformers

In [None]:
!pip install transformers

In [None]:
!pip install SentencePiece

In [None]:
!pip install wandb --upgrade

In [None]:
!pip install --upgrade wandb simpletransformers

In [None]:
!pip install simpletransformers

## Importing the Libraries

In [None]:
from transformers import AutoTokenizer

import logging
import wandb

import torch

import numpy as np

from simpletransformers.classification import MultiLabelClassificationModel, ClassificationArgs

import pandas as pd
from sklearn.metrics import f1_score, classification_report

## Loading the Data
We utilize data sourced from [Zenodo](https://zenodo.org/record/7550385#.Y8wMquzMK3I), specifically from the [Human Value Detection 2023 competition](https://touche.webis.de/semeval23/touche23-web/index.html). Our focus is on the following datasets: arguments-training.tsv, arguments-validation.tsv, arguments-test.tsv, labels-training.tsv, labels-validation.tsv, and labels-test.tsv.


In [None]:
# defining the label titles in our datasets
label_cols = ['Self-direction: thought', 'Self-direction: action', 'Stimulation',
       'Hedonism', 'Achievement', 'Power: dominance', 'Power: resources',
       'Face', 'Security: personal', 'Security: societal', 'Tradition',
       'Conformity: rules', 'Conformity: interpersonal', 'Humility',
       'Benevolence: caring', 'Benevolence: dependability',
       'Universalism: concern', 'Universalism: nature',
       'Universalism: tolerance', 'Universalism: objectivity']

In [None]:
# Loading the data
train_args = pd.read_csv("arguments-training.tsv",delimiter='\t')
train_labels = pd.read_csv("labels-training.tsv",delimiter='\t')

val_labels = pd.read_csv("labels-validation.tsv",delimiter='\t')
val_args = pd.read_csv("arguments-validation.tsv",delimiter='\t')

test_labels = pd.read_csv("labels-test.tsv",delimiter='\t')
test_args = pd.read_csv("arguments-test.tsv",delimiter='\t')

The input data for simple transformers' models need to have a 'text' column containing the context we need to apply multiclassification and the list of labels for each element in the context.

In [None]:
# Adding the 'text' column
train_args['text'] = train_args['Conclusion'] + " " + train_args['Stance'] + " " + train_args['Premise']
train_args.drop(labels=['Conclusion', 'Stance', 'Premise'], axis=1, inplace=True)

val_args['text'] = val_args['Conclusion'] + " " + val_args['Stance'] + " " + val_args['Premise']
val_args.drop(labels=['Conclusion', 'Stance', 'Premise'], axis=1, inplace=True)

test_args['text'] = test_args['Conclusion'] + " " + test_args['Stance'] + " " + test_args['Premise']
test_args.drop(labels=['Conclusion', 'Stance', 'Premise'], axis=1, inplace=True)

print(val_labels)

     Argument ID  Self-direction: thought  Self-direction: action  \
0         A01001                        0                       0   
1         A01012                        0                       0   
2         A02001                        0                       0   
3         A02002                        0                       1   
4         A02009                        0                       0   
...          ...                      ...                     ...   
1891      E08014                        1                       0   
1892      E08021                        1                       0   
1893      E08022                        0                       1   
1894      E08024                        0                       1   
1895      E08025                        0                       1   

      Stimulation  Hedonism  Achievement  Power: dominance  Power: resources  \
0               0         0            0                 0                 0   
1          

In [None]:
def prepare_labels(data):
  df = pd.DataFrame(data)

  final_list = df.copy()  # Make a copy of the DataFrame
  final_list['labels'] = df.iloc[:, 1:].apply(lambda row: tuple(row), axis=1)

  return final_list

In [None]:
# Adding the 'labels' column
train_labels_compressed = prepare_labels(train_labels)
val_labels_compressed = prepare_labels(val_labels)
test_labels_compressed = prepare_labels(test_labels)

print(train_labels_compressed)

     Argument ID  Self-direction: thought  Self-direction: action  \
0         A01002                        0                       0   
1         A01005                        0                       0   
2         A01006                        0                       0   
3         A01007                        0                       0   
4         A01008                        0                       0   
...          ...                      ...                     ...   
5388      E08016                        0                       0   
5389      E08017                        0                       0   
5390      E08018                        0                       0   
5391      E08019                        0                       0   
5392      E08020                        0                       1   

      Stimulation  Hedonism  Achievement  Power: dominance  Power: resources  \
0               0         0            0                 0                 0   
1          

In [None]:
# Merging the content of two files into one dataset
merged_train = pd.merge(train_args, train_labels_compressed, on='Argument ID', how='inner')
merged_val = pd.merge(val_args, val_labels_compressed, on='Argument ID', how='inner')
merged_test = pd.merge(test_args, test_labels_compressed, on='Argument ID', how='inner')

print(merged_train)

     Argument ID                                               text  \
0         A01002  We should ban human cloning in favor of we sho...   
1         A01005  We should ban fast food in favor of fast food ...   
2         A01006  We should end the use of economic sanctions ag...   
3         A01007  We should abolish capital punishment against c...   
4         A01008  We should ban factory farming against factory ...   
...          ...                                                ...   
5388      E08016  The EU should integrate the armed forces of it...   
5389      E08017  Food whose production has been subsidized with...   
5390      E08018  Food whose production has been subsidized with...   
5391      E08019  Food whose production has been subsidized with...   
5392      E08020  The EU should integrate the armed forces of it...   

      Self-direction: thought  Self-direction: action  Stimulation  Hedonism  \
0                           0                       0            0 

We have implemented the following functions to determine the optimal threshold after training and assessing the model. This is necessary because the model's output consists of a series of probabilities, and these probabilities need to be transformed into binary values (0s and 1s) using a threshold.

In [None]:
# Obtaining the best optimal threshold
def get_threshold(labels, model_output):
  results = {}
  labels_compressed = labels.drop("Argument ID", axis=1)

  df_prediction = labels_compressed.copy()
  for tr in np.arange(0.1, 0.9, 0.05):
      tr = round(tr, 2)
      for i, label in enumerate(label_cols):
          prediction = np.where(model_output[:,i] >= tr, 1, 0)
          df_prediction[label] = prediction

      y_pred = df_prediction.values.tolist()
      y_test = labels_compressed.values.tolist()
      f1 = f1_score(y_test, y_pred, average = "macro", zero_division = 1)
      results[tr] = f1

  for k,v in results.items():
      print("THRESHOLD: {:.2f} ".format(k), "F1 score: {:.3f}".format(v))

  THRESHOLD = max(results, key = results.get)

  print("\nBest threshold obtained:", THRESHOLD , "having F1 score of: {:.2f}".format(max(results.values())))

  return THRESHOLD

Using the function below, we can get a report of the classification scores on various classes.

In [None]:
# Obtaining the final report of our scores
def get_report(labels, model_output, THRESHOLD):
  print(labels)
  print(model_output)
  print(THRESHOLD)
  df_prediction = labels.copy()

  for i, label in enumerate(label_cols):
    prediction = np.where(model_output[:,i] >= THRESHOLD, 1, 0)
    df_prediction[label] = prediction

  y_pred = df_prediction.values.tolist()
  y_test = labels.values.tolist()

  print(classification_report(y_test,y_pred, target_names = label_cols))

## Defining the Model


In this section, we introduce our BERT model and initiate the fine-tuning process by exploring various parameters. To facilitate this experimentation, we make use of the Sweep library, which enables us to systematically evaluate different model configurations.

We work with two versions of the BERT model, namely "bert-base-cased" and "bert-large-cased." However, owing to constraints with our CUDA resources, we begin by fine-tuning "bert-base-cased" using hyperparameter tuning to identify the optimal hyperparameters. Once we've determined the best hyperparameters for "bert-base-cased," we proceed to train both the "bert-base-cased" and "bert-large-cased" models. This approach allows us to make a fair and meaningful comparison of their performance on our datasets.

### Defining Hyperparameter


We attempted to conduct experiments by varying the parameters related to the number of training epochs and the batch size.

In [None]:
# hyperparmaters we want to fine-tune
sweep_config = {
    "method": "grid",  # grid, random
    "parameters": {
        "num_train_epochs": {"values": [3, 5, 10]},
        "train_batch_size": {"values": [16, 32]}
    },
}

In [None]:
# create a sweep in W&B
sweep_id = wandb.sweep(sweep_config, project="Human Value Detection")

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Create sweep with ID: gr6rewmr
Sweep URL: https://wandb.ai/human-value-detection/Human%20Value%20Detection/sweeps/gr6rewmr


In [None]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [None]:
# Our fixed arguments for the simple transformers model
args = {
      'manual_seed': 42,
      'max_seq_length': 100,
      'overwrite_output_dir': True,
      'gradient_accumulation_steps': 8,
      "lr": 2e-4,
      "optimizer": 'AdamW'
      }

### BERT Base
This section contains the training phase for our BERT base model:

In [None]:
def train():
    # Initialize a new wandb run
    wandb.init()

    # Create a TransformerModel
    model = MultiLabelClassificationModel(
        'bert',
        'bert-base-cased',
        use_cuda=True,
        args=args,
        sweep_config=wandb.config,
        num_labels=20
    )

    # Train the model
    model.train_model(merged_train)

    # Evaluate the model
    model.eval_model(merged_val)

    # Sync wandb
    wandb.join()


In [None]:
wandb.agent(sweep_id, train, count=6)

In [None]:
# Initialize the W&B API
api = wandb.Api()

# Replace 'your_project_name' with your actual project name
project = api.project('Human Value Detection')

project

In [None]:
# Get the sweep
sweep = api.sweep(f"{project.name}/{sweep_id}")

# Retrieve the best run
best_run = sweep.best_run()

# Get the best run's parameters and results
best_params = best_run.config
best_results = best_run.summary



In [None]:
print("Best Parameters:", best_params)
print("Best Results:", best_results)

Best Parameters: {'num_train_epochs': 10, 'train_batch_size': 32}
Best Results: {'_runtime': 322.14150643348694, '_timestamp': 1693808617.0067475, 'global_step': 200, 'Training loss': 0.3118610978126526, 'lr': 2.0304568527918785e-06, '_step': 3, '_wandb': {'runtime': 341}}


In [None]:
 # Defining the model using the best parameters obtained
 model_bert_base = MultiLabelClassificationModel(
        'bert',
        'bert-base-cased',
        use_cuda=True,
        args={
              'num_train_epochs': best_params['num_train_epochs'],
              'train_batch_size': best_params['train_batch_size'],
              'manual_seed': 42,
              'max_seq_length': 100,
              'overwrite_output_dir': True,
              'gradient_accumulation_steps': 8,
              "lr": 2e-4,
              "optimizer": 'AdamW'
            },
        num_labels=20
)

Some weights of BertForMultiLabelSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# layers of our bert base model
model_bert_base.model

BertForMultiLabelSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), 

In [None]:
# training the model
model_bert_base.train_model(merged_train, eval_df=merged_val)

## Bert Large

This section encompasses the training phase for our BERT large model.

In [None]:
# Defining our bert large model with the hyperparameters we got from bert base
model_bert_large = MultiLabelClassificationModel(
        'bert',
        'bert-large-cased',
        use_cuda=True,
        args={
              'num_train_epochs': best_params['num_train_epochs'],
              'train_batch_size': best_params['train_batch_size'],
              'manual_seed': 42,
              'max_seq_length': 100,
              'overwrite_output_dir': True,
              'gradient_accumulation_steps': 8,
              "lr": 2e-4,
              "optimizer": 'AdamW'
            },
        num_labels=20
)

In [None]:
# Layers of our bert large model
model_bert_large.model

BertForMultiLabelSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerN

In [None]:
# Training our bert large model
model_bert_large.train_model(merged_train, eval_df=merged_val)

## Evaluation of the models


In this section, we assess the performance of the BERT base and BERT large models on both the validation and test sets. We gauge their performance using the F1 score as the primary metric for evaluation.

### Evaluation of bert-base-cased
In here, we evaluate the BERT base model on both the validation and test datasets.

**Evaluation on validation set**

In [None]:
result_bert_base, model_outputs_bert_base, wrong_predictions_bert_base = model_bert_base.eval_model(merged_val)
print("Results of bert base on validtion set:\n", result_bert_base)

In [None]:
bert_base_threshold = get_threshold(val_labels, model_outputs_bert_base)

THRESHOLD: 0.10  F1 score: 0.344
THRESHOLD: 0.15  F1 score: 0.363
THRESHOLD: 0.20  F1 score: 0.348
THRESHOLD: 0.25  F1 score: 0.327
THRESHOLD: 0.30  F1 score: 0.304
THRESHOLD: 0.35  F1 score: 0.279
THRESHOLD: 0.40  F1 score: 0.245
THRESHOLD: 0.45  F1 score: 0.209
THRESHOLD: 0.50  F1 score: 0.176
THRESHOLD: 0.55  F1 score: 0.150
THRESHOLD: 0.60  F1 score: 0.124
THRESHOLD: 0.65  F1 score: 0.099
THRESHOLD: 0.70  F1 score: 0.070
THRESHOLD: 0.75  F1 score: 0.043
THRESHOLD: 0.80  F1 score: 0.016
THRESHOLD: 0.85  F1 score: 0.003

Best threshold obtained: 0.15 having F1 score of: 0.36


In [None]:
get_report(val_labels, model_outputs_bert_base, bert_base_threshold)

                            precision    recall  f1-score   support

   Self-direction: thought       0.30      0.68      0.42       251
    Self-direction: action       0.33      0.81      0.47       496
               Stimulation       0.30      0.12      0.17       138
                  Hedonism       0.64      0.07      0.12       103
               Achievement       0.39      0.91      0.55       575
          Power: dominance       0.23      0.45      0.30       164
          Power: resources       0.23      0.86      0.37       132
                      Face       0.21      0.18      0.19       130
        Security: personal       0.43      0.99      0.59       759
        Security: societal       0.35      0.91      0.50       488
                 Tradition       0.36      0.41      0.38       172
         Conformity: rules       0.32      0.85      0.46       455
 Conformity: interpersonal       0.00      0.00      0.00        60
                  Humility       0.20      0.11

**Evaluation on test set**


In [None]:
result_bert_base_test, model_outputs_bert_base_test, wrong_predictions_bert_base_test = model_bert_base.eval_model(merged_test)
print("Results of bert base on validtion set:\n", result_bert_base_test)

In [None]:
bert_base_threshold_test = get_threshold(test_labels, model_outputs_bert_base_test)

THRESHOLD: 0.10  F1 score: 0.309
THRESHOLD: 0.15  F1 score: 0.340
THRESHOLD: 0.20  F1 score: 0.330
THRESHOLD: 0.25  F1 score: 0.308
THRESHOLD: 0.30  F1 score: 0.284
THRESHOLD: 0.35  F1 score: 0.261
THRESHOLD: 0.40  F1 score: 0.235
THRESHOLD: 0.45  F1 score: 0.208
THRESHOLD: 0.50  F1 score: 0.176
THRESHOLD: 0.55  F1 score: 0.155
THRESHOLD: 0.60  F1 score: 0.133
THRESHOLD: 0.65  F1 score: 0.112
THRESHOLD: 0.70  F1 score: 0.088
THRESHOLD: 0.75  F1 score: 0.057
THRESHOLD: 0.80  F1 score: 0.026
THRESHOLD: 0.85  F1 score: 0.004

Best threshold obtained: 0.15 having F1 score of: 0.34


In [None]:
get_report(test_labels, model_outputs_bert_base_test, bert_base_threshold_test)

                            precision    recall  f1-score   support

   Self-direction: thought       0.22      0.62      0.33       143
    Self-direction: action       0.34      0.87      0.49       391
               Stimulation       0.19      0.08      0.11        77
                  Hedonism       0.17      0.08      0.11        26
               Achievement       0.38      0.80      0.51       412
          Power: dominance       0.18      0.52      0.27       108
          Power: resources       0.27      0.77      0.40       105
                      Face       0.17      0.08      0.11        96
        Security: personal       0.37      1.00      0.54       537
        Security: societal       0.31      0.92      0.47       397
                 Tradition       0.29      0.64      0.40       168
         Conformity: rules       0.26      0.94      0.41       287
 Conformity: interpersonal       0.08      0.02      0.03        53
                  Humility       0.05      0.09

### Evaluation of bert-large-cased
This section involves the evaluation of the BERT large model on both the validation and test datasets.

**Evaluation on validation set**

In [None]:
result_bert_large, model_outputs_bert_large, wrong_predictions_bert_large = model_bert_large.eval_model(merged_val)
print("Results of bert base on validtion set:\n", result_bert_large)

In [None]:
bert_large_threshold = get_threshold(val_labels, model_outputs_bert_large)

THRESHOLD: 0.10  F1 score: 0.398
THRESHOLD: 0.15  F1 score: 0.420
THRESHOLD: 0.20  F1 score: 0.414
THRESHOLD: 0.25  F1 score: 0.398
THRESHOLD: 0.30  F1 score: 0.379
THRESHOLD: 0.35  F1 score: 0.356
THRESHOLD: 0.40  F1 score: 0.337
THRESHOLD: 0.45  F1 score: 0.318
THRESHOLD: 0.50  F1 score: 0.296
THRESHOLD: 0.55  F1 score: 0.268
THRESHOLD: 0.60  F1 score: 0.248
THRESHOLD: 0.65  F1 score: 0.219
THRESHOLD: 0.70  F1 score: 0.192
THRESHOLD: 0.75  F1 score: 0.161
THRESHOLD: 0.80  F1 score: 0.126
THRESHOLD: 0.85  F1 score: 0.088

Best threshold obtained: 0.15 having F1 score of: 0.42


In [None]:
get_report(val_labels, model_outputs_bert_large, bert_large_threshold)

                            precision    recall  f1-score   support

   Self-direction: thought       0.32      0.75      0.45       251
    Self-direction: action       0.43      0.76      0.54       496
               Stimulation       0.31      0.20      0.24       138
                  Hedonism       0.38      0.35      0.36       103
               Achievement       0.47      0.86      0.61       575
          Power: dominance       0.23      0.48      0.31       164
          Power: resources       0.31      0.85      0.46       132
                      Face       0.19      0.15      0.17       130
        Security: personal       0.52      0.95      0.68       759
        Security: societal       0.40      0.86      0.54       488
                 Tradition       0.36      0.47      0.40       172
         Conformity: rules       0.39      0.80      0.53       455
 Conformity: interpersonal       0.20      0.08      0.12        60
                  Humility       0.20      0.23

**Evaluation on test set**

In [None]:
result_bert_large_test, model_outputs_bert_large_test, wrong_predictions_bert_large_test = model_bert_large.eval_model(merged_test)
print("Results of bert base on validtion set:\n", result_bert_large_test)

In [None]:
bert_large_threshold_test = get_threshold(test_labels, model_outputs_bert_large_test)

THRESHOLD: 0.10  F1 score: 0.372
THRESHOLD: 0.15  F1 score: 0.402
THRESHOLD: 0.20  F1 score: 0.410
THRESHOLD: 0.25  F1 score: 0.408
THRESHOLD: 0.30  F1 score: 0.391
THRESHOLD: 0.35  F1 score: 0.370
THRESHOLD: 0.40  F1 score: 0.351
THRESHOLD: 0.45  F1 score: 0.328
THRESHOLD: 0.50  F1 score: 0.306
THRESHOLD: 0.55  F1 score: 0.287
THRESHOLD: 0.60  F1 score: 0.265
THRESHOLD: 0.65  F1 score: 0.246
THRESHOLD: 0.70  F1 score: 0.221
THRESHOLD: 0.75  F1 score: 0.184
THRESHOLD: 0.80  F1 score: 0.143
THRESHOLD: 0.85  F1 score: 0.101

Best threshold obtained: 0.2 having F1 score of: 0.41


In [None]:
get_report(test_labels, model_outputs_bert_large_test, bert_large_threshold_test)

                            precision    recall  f1-score   support

   Self-direction: thought       0.34      0.65      0.44       143
    Self-direction: action       0.51      0.71      0.59       391
               Stimulation       0.33      0.04      0.07        77
                  Hedonism       0.31      0.19      0.24        26
               Achievement       0.52      0.67      0.58       412
          Power: dominance       0.26      0.38      0.31       108
          Power: resources       0.39      0.69      0.50       105
                      Face       0.22      0.04      0.07        96
        Security: personal       0.52      0.92      0.66       537
        Security: societal       0.38      0.87      0.53       397
                 Tradition       0.42      0.65      0.51       168
         Conformity: rules       0.34      0.82      0.48       287
 Conformity: interpersonal       0.31      0.09      0.14        53
                  Humility       0.10      0.12