# NLP course project
**Summary**: Application of text classification approaches for Human Value Detection <br>
**Members**:
- Dell'Olio Domenico
- Delvecchio Giovanni Pio
- Disabato Raffaele  


The project was developed in order to create and test various models to address the task of Human Value Detection proposed in the challenge: <br>
https://touche.webis.de/semeval23/touche23-web/index.html <br>

The challenge can be tackled as a multi-label text clasification problem, thus we decided to implement and test various architectures in order to compare their performances. <br>
These architectures were either already present at the state of the art or were obtained as a result of experiments.

## This notebook contains the following implementations:
- GloVe baseline with two layers of Bi-GRU, followed by flatten and two dense layers with ReLU activation and a single dense layer with no activation;
- BERT baseline with two layers of Bi-LSTM (transfer learning), where the output cell states are concatenated and passed to a dense layer with ReLU activation and a single dense layer with no activation;
- finetuning of BERT followed by a dense layer with ReLU activation followed by a dense layer with no activation.

## This notebook does **not** contain:
- exstensive Data analysis (it is explored in the other notebook)

In [1]:
# installation of the required libraries
!pip install transformers
!pip install datasets
!pip install torchinfo

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m60.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.0
Looking in indexes: https://pypi.org/simple, http

In [2]:
# Cell for the download of the datasets
!wget https://zenodo.org/record/7550385/files/arguments-training.tsv
!wget https://zenodo.org/record/7550385/files/labels-training.tsv
!wget https://zenodo.org/record/7550385/files/arguments-validation.tsv
!wget https://zenodo.org/record/7550385/files/labels-validation.tsv
!wget https://zenodo.org/record/7550385/files/arguments-test.tsv
!wget https://zenodo.org/record/7550385/files/arguments-validation-zhihu.tsv
!wget https://zenodo.org/record/7550385/files/labels-validation-zhihu.tsv

--2023-02-09 16:03:10--  https://zenodo.org/record/7550385/files/arguments-training.tsv
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1012498 (989K) [application/octet-stream]
Saving to: ‘arguments-training.tsv’


2023-02-09 16:03:20 (147 KB/s) - ‘arguments-training.tsv’ saved [1012498/1012498]

--2023-02-09 16:03:20--  https://zenodo.org/record/7550385/files/labels-training.tsv
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 253843 (248K) [application/octet-stream]
Saving to: ‘labels-training.tsv’


2023-02-09 16:03:23 (318 KB/s) - ‘labels-training.tsv’ saved [253843/253843]

--2023-02-09 16:03:23--  https://zenodo.org/record/7550385/files/arguments-validation.tsv
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting

In [3]:
# imports for dataset loading
import numpy as np
import random
import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# torch imports
import torch
import torchtext
from torchtext.data import get_tokenizer
from torchtext.vocab import GloVe
from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset
from torch import nn
from torch.nn import functional as F
from torch.optim import Adam
from torchinfo import summary
from torch.optim import AdamW

#huggingface imports
from transformers import BertTokenizer, BertModel, get_linear_schedule_with_warmup

# progress bar
from tqdm import tqdm
# garbage collector
import gc

# imports for evaluation
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

In [4]:
def fix_random(seed: int) -> None:
  """Fix all the possible sources of randomness.

  Params:
    seed: the seed to use. 
  """
  np.random.seed(seed)
  random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)

  torch.backends.cudnn.benchmark = False
  torch.backends.cudnn.deterministic = True

In [5]:
# Cell needed to fix the seeds and define the available device
# for the training of the models
seed = 10
fix_random(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


In [6]:
def huggingface_from_pandas(pandas_df):
  """
  Function converting a pandas dataframe to a huggingface dataset.
  It also returns an ordered list containing the target labels

  Params:
    pandas_df: the dataset that has to be converted
  Returns:
    hf_ds:     the huggingface dataset obrained from pandas_df
    label_cols: the ordered list of target labels of pandas_df
  """

  hf_ds = Dataset.from_pandas(pandas_df, preserve_index=False)
  hf_ds = hf_ds.remove_columns(["Argument ID", "Argument ID2"])
  # Aggregating labels in a single list
  hf_ds = hf_ds.map(lambda x:{"labels": [int(x[col]) for col in hf_ds.column_names if
                                      col not in ['Conclusion', 'Stance', 'Premise']]})
  label_cols = [col for col in hf_ds.column_names if col not in ['Conclusion', 'Stance', 'Premise', "labels"]]
  # here we are removing the columns related to the labels from the dataset
  hf_ds = hf_ds.remove_columns(label_cols)
  return hf_ds, label_cols

The challenge provides the already splitted dataset in Train, Validation and Test splits. However the Test split does not have public labels available, 
so we decided to split the Training set in (Training, Validation) 
(with proportions 80-20 on unique conclusions) and to use the validation set as Test set.  <br>
We decided to probe the robustness of our model on the Chinese validation
set too, which has a different cultural background.

In [56]:
def train_test_split_wrt_conclusions(train, ratio = 0.8):
  """
  Function needed to perform the splits over the original train dataset,
  in order to obtain a train and a validation set which are divided by unique
  conclusions. The ratio parameter is needed in order to assign which portion 
  of the unique conclusions must be selected for the train split.
  
  Params:
    train: the original train set, to be splitted (Pandas dataframe)
    ratio: the proportion in (0, 1) of unique conclusions to be inserted in 
           the training dataframe.
  Returns:
    train_set_to_return: the portion of train that contains ratio unique
                         conclusions.
    val_set_to_return: the proportion of the train that contains 1 - ratio
                       unique conclusions (the remaining ones)
  """
  val = []
  unique_conc = pd.unique(train["Conclusion"])
  num_train_con = int(len(unique_conc)*ratio)
  train_unique_conc = np.random.choice(unique_conc, num_train_con, replace = False)
  val_unique_conc = set(unique_conc) - set(train_unique_conc)
  train_set_to_return = train[train.Conclusion.isin(train_unique_conc)] 
  val_set_to_return = train[train.Conclusion.isin(val_unique_conc)]
  return train_set_to_return, val_set_to_return

In [8]:
# Dataset loading and splitting
raw_training = pd.read_csv("arguments-training.tsv", encoding='utf-8', sep='\t', header=0)
raw_training_lab = pd.read_csv("labels-training.tsv", encoding='utf-8', sep='\t', header=0)
raw_test = pd.read_csv("arguments-validation.tsv", encoding='utf-8', sep='\t', header=0)
raw_test_lab = pd.read_csv("labels-validation.tsv", encoding='utf-8', sep='\t', header=0)
raw_test_chn=pd.read_csv("arguments-validation-zhihu.tsv", encoding='utf-8', sep='\t', header=0)
raw_test_chn_lab=pd.read_csv("labels-validation-zhihu.tsv", encoding='utf-8', sep='\t', header=0)

train = raw_training.join(raw_training_lab,how='inner' ,lsuffix='2') # joining labels
test = raw_test.join(raw_test_lab, how='inner', lsuffix='2') # joining labels
test_chn = raw_test_chn.join(raw_test_chn_lab, how='inner', lsuffix='2') # joining labels
fix_random(seed)
train, val = train_test_split_wrt_conclusions(train) # splitting training

train_ds, label_list = huggingface_from_pandas(train)
val_ds, _ = huggingface_from_pandas(val)
test_ds, _ = huggingface_from_pandas(test)
test_chn_ds, _ = huggingface_from_pandas(test_chn) 

print("Single example from the training dataset: ")
print(train_ds[0])
print("Full list of target labels: ")
print(label_list)
num_classes = len(label_list)
print("Total number of target labels: ")
print(num_classes)
whole_dataset = DatasetDict()
whole_dataset["train"] = train_ds.with_format("torch")
whole_dataset["val"] = val_ds.with_format("torch")
whole_dataset["test"] = test_ds.with_format("torch")
whole_dataset["test_chn"] = test_chn_ds.with_format("torch")

  0%|          | 0/4176 [00:00<?, ?ex/s]

  0%|          | 0/1217 [00:00<?, ?ex/s]

  0%|          | 0/1896 [00:00<?, ?ex/s]

  0%|          | 0/100 [00:00<?, ?ex/s]

Single example from the training dataset: 
{'Conclusion': 'We should ban human cloning', 'Stance': 'in favor of', 'Premise': 'we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same.', 'labels': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
Full list of target labels: 
['Self-direction: thought', 'Self-direction: action', 'Stimulation', 'Hedonism', 'Achievement', 'Power: dominance', 'Power: resources', 'Face', 'Security: personal', 'Security: societal', 'Tradition', 'Conformity: rules', 'Conformity: interpersonal', 'Humility', 'Benevolence: caring', 'Benevolence: dependability', 'Universalism: concern', 'Universalism: nature', 'Universalism: tolerance', 'Universalism: objectivity']
Total number of target labels: 
20


In [9]:
def make_predictions(model, loader):
  """
  Function needed to obtain the prediction for the target labels
  given a model and a data loader.

  Params:
    model: the model that will be used to obtain the predictions over
           the labels
    loader: the data loader needed to feed the model with the data for which
            we want to obtain label predictions
  Returns:
    Y_preds: tensor containing the predicted label for each example.
             These labels are obtained as the output of the model passed to
             a sigmoid function.
  """
  Y_preds = []
  model.eval()
  for X, Y in loader:
    with torch.no_grad():
      preds = model(X)
    Y_preds.append(preds)
  gc.collect()
  Y_preds = torch.cat(Y_preds)
  Y_preds = Y_preds.sigmoid()
  return Y_preds.detach()

def keep_above_thresh(Y_preds, thr):
  """
  Function needed to convert the results of the models to hard labels
  using a threshold.
  
  Params:
    Y_preds: scores obtained by the model which have to be converted to hard
             labels
    thr: threshold to be applied to the scores, element of (0, 1), if a score
         is greater than thr it becomes a hard label with value 1, 
         0 otherwise
  Retuns:
    Y_preds_thr: hard labels obtained by thresholding Y_preds with thr
  """
  Y_preds_thr = np.copy(Y_preds.numpy())
  max_rows = Y_preds_thr.shape[0]
  max_cols = Y_preds_thr.shape[1]
  for i in range(max_rows):
    new_row = np.array([1 if Y_preds_thr[i][j] > thr else 0 for j in range(max_cols)])
    Y_preds_thr[i] = new_row
  return Y_preds_thr

def compute_macro_score(M_true, M_pred, score_func):
  """
  Function needed to compute the macro aggregation of a scored function
  over the different classes.

  Params:
    M_true: true labels needed to compute the scores
    M_pred: predicted labels needed to compute the scores
    score_func: scoring function to be computed
  Returns:
    macro: aggregation of the result of score_func computed over all the
           labels.
    scores: list of per-label score
  """
  scores = []
  for i in range(M_true.shape[1]):
      true = M_true[:, i]
      pred = M_pred[:, i]
      if score_func == accuracy_score:
        scores.append(score_func(true, pred))
      else: 
        scores.append(score_func(true, pred, zero_division=0))
  macro = np.mean(scores)
  return macro, scores
  
def support(true, pred, zero_division):
  """
  Utility function to compute the support of the class labels,
  pred and zero_division are dummy parameters needed to have conformity
  with the sklearn functions to compute scores.

  Params: 
    true: binary true labels for a single class for each example that are needed
          to compute the support for the single class
    pred: dummy parameter
    zero_division: dummy parameter
  Returns:
    sum(true): the number of example for a single class (support)
  """
  return sum(true)

def print_report(classifier, loader, y_true, threshold, labels=label_list):
  """
  Function needed to print the classification results given a classifier,
  a dataset loader, true labels and a threshold. 
  The printed report includes macro accuracy, precision, recall and F1, as 
  well as per-class accuracy, precision, recall, F1 and support.

  Params:
    classifier: the model that has to be evaluated
    loader: data-loader needed to feed the data to the classifier to get 
            predicted labels
    y_true: true labels associated to the dataset associated to the loader
    threshold: threshold for the conversion of the scores to hard labels,
               check keep_above_thresh for further details
    labels: ordered list of target labels. Defaults to the list extracted from
            the dataset
  """

  Y_preds = make_predictions(classifier, loader)
  Y_preds_thr = keep_above_thresh(Y_preds.to('cpu'), threshold)

  f1_macro, f1 = compute_macro_score(y_true, Y_preds_thr, f1_score)
  acc_macro, acc = compute_macro_score(y_true, Y_preds_thr, accuracy_score)
  prec_macro, prec = compute_macro_score(y_true, Y_preds_thr, precision_score)
  rec_macro, rec = compute_macro_score(y_true, Y_preds_thr, recall_score)
  _, sup = compute_macro_score(y_true, Y_preds_thr, support)

  print("----- MACRO AVG. -----")
  print(f"  F1-score:\t{round(f1_macro,4)}\n\
  Precision:\t{round(prec_macro,4)}\n\
  Recall:\t{round(rec_macro,4)}\n\
  Accuracy:\t{round(acc_macro,4)}")
  print("----- PER-CLASS VALUES -----")
  print("  \t\t\t\tF1-score\tPrecision\tRecall\t\tAccuracy\tSupport")
  for i in range(len(labels)):
    print("  " + labels[i]+" "*(len(max(labels, key=len))-len(labels[i])), end="\t")
    print(f"{round(f1[i],4)}\t\t{round(prec[i],4)}\t\t{round(rec[i],4)}\t\t{round(acc[i],4)}\t\t{sup[i]}")

## GloVe model
The first model that was developed is a GloVe 100d embedding + two Bi-GRU layers
That serves as an advanced baseline to perfom experiments for multi-label classification problems like the current one. 
It is still a baseline since it has a simple architecture, OOV are treated using zero-vectors, the hidden states of the Bi-GRU layers are initialized 
as zero-vectors and most importantly the model does not work with contextual information, but only with the semantics of the words. <br>
Moreover an heavy preprocessing to the dataset is not applied except for lowercasing the arguments, tokenization and the addition of truncation and padding because the GloVe embeddings would return too many unmasked zero vectors. <br>
About padding and truncation: the maximum allowed length is 35 which is 
slightly above the sum of the mean token length value for the premises and the
conclusion. 

N.B.: the last dense layer has no activation for all the models, since the loss
function applies it by guaranteeing numerical stability. Thus the output of the
layer must be passed to a sigmoid function before converting it to labels.

In [10]:
# Pretrained GloVe setup

global_vectors = GloVe(name='6B', dim=100)

# the current choice is to give an id to each word
tokenizer = get_tokenizer("basic_english")

.vector_cache/glove.6B.zip: 862MB [02:43, 5.28MB/s]                           
100%|█████████▉| 399999/400000 [00:16<00:00, 24181.28it/s]


In [65]:
# these parameters are used both by the following function and by the 
# implementation of the GloVe model itself, thus are kept global
max_words_emb = 35
embed_len = 100

# collate function where the Premises are tokenized and embedded in batches
def vectorize_batch(batch):
  """
  Collate function to preprocess the data for the GloVe model.
  In particular it joins premises, stances and conclusions in the same string,
  tokenizes, truncates and pads them and then converts each token to a GloVe
  vector. Target labels are already one-hot encoded.

  Params:
    batch: batch of data to be preprocessed
  Returns:
    X_tensor: a tensor containing GloVe vectors of dimension 100, which has shape
              (batch_size, max_words_emb, embed_len)
    Y_tensor: a tensor containing labels, which has shape
              (batch_size, num_classes)
  """
  X = [elem["Premise"] + " " + elem["Stance"] + " " +elem["Conclusion"] for elem in batch]
  Y = [elem["labels"] for elem in batch]
  X = [tokenizer(x) for x in X]
  X = [tokens+[""] * (max_words_emb-len(tokens))  if len(tokens)<max_words_emb else tokens[:max_words_emb] for tokens in X]
  X_tensor = torch.zeros(len(batch), max_words_emb, embed_len)
  Y_tensor = torch.zeros(len(batch), Y[0].shape[0])
  for i, tokens in enumerate(X):
      X_tensor[i] = global_vectors.get_vecs_by_tokens(tokens)
      Y_tensor[i] = Y[i]
  return X_tensor, Y_tensor

In [74]:
# Simple model to perform some tests with pytorch
class EmbeddingClassifier(nn.Module):
  """
  Class implementing the GloVe model.
  Remark: max_words_emb, embed_len and num_classes are parameters
          used to create this architecture which are set outside the class.
  """
  def __init__(self):
      super(EmbeddingClassifier, self).__init__() 
      
      self.gru_layers = 2

      self.gru = nn.GRU(input_size = embed_len,
                        hidden_size = embed_len,
                        num_layers = self.gru_layers,
                        batch_first=True, 
                        bidirectional = True)
      self.flatten = nn.Flatten(start_dim=1)
      self.linear_1 = nn.Linear(max_words_emb*embed_len*2, 512)
      self.relu = nn.ReLU()
      self.linear_2 = nn.Linear(512,128)
      self.linear_3 = nn.Linear(128, num_classes)
      
              

  def forward(self, X_batch):
    """
    It is important to note that the initial hidden states of the GRU
    layers are initialized with zero tensors.

    The outcomes of the GRU layers are flattened and classified. 
    """
    h0 = torch.zeros(2*self.gru_layers,X_batch.shape[0], embed_len)
    h0 = h0.to(device)
    out, hn = self.gru(X_batch, h0)
    out = self.flatten(out)
    out = self.linear_1(out)
    out = self.relu(out)
    out = self.linear_2(out)
    out = self.relu(out)
    out = self.linear_3(out)
    return out

# Function needed to compute the validation loss and the accuracy
def compute_validation_loss(model, loss_fn, val_loader):
  """
  Function computing and printing the loss on the validation set.
  Params:
    model: the model for which the loss must be computed and printed
    loss_fn: the loss function to adopt
    val_loader: dataloader for the validation set

  Returns:
    loss: the computed mean loss across the batch
  """

  with torch.no_grad():
    losses = []
    for X, Y in val_loader:
      preds = model(X)
      loss = loss_fn(preds, Y)
      losses.append(loss.item())

    loss = torch.tensor(losses).mean()
    print("Valid Loss : {:.3f}".format(loss))
  return loss


# Training function
def TrainModel(model, loss_fn, optimizer, train_loader, val_loader, epochs, early_stopping_info, model_name):
  """
  Function for model training. If early stopping info is defined it saves the 
  last best models and eventually returns it, in case of early stopping.
  In case early_stopping_info is not defined, the early stopping is not
  applied.

  Params:
    model: the model that has to be trained
    loss_fn: the loss function to adopt in order to perform the training
    optimizer: the optimizer to be used for training
    train_loader: the dataloader for the training dataset
    val_loader: the dataloader for the validation dataset
    epochs: the number of epochs for training
    early_stopping_info: dictionary containing the parameters for the early stopping:
                          - delta: min acceptable improvement in the validation loss
                          - patience: number of epochs to wait for improvement
    model_name: string containing the name of the model (used in order to save
                the weights)
  Returns:
    model: the trained model
  """
  patience_acc = 0
  precedent_loss = np.Inf
  model.train()
  for i in range(1, epochs+1):
      losses = []
      for X, Y in tqdm(train_loader):

          Y_preds = model(X)

          loss = loss_fn(Y_preds, Y)
          losses.append(loss.item())

          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

      loss = compute_validation_loss(model, loss_fn, val_loader)
      print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
      if early_stopping_info != None:
        if precedent_loss - loss < early_stopping_info["delta"]:
            patience_acc = patience_acc + 1
        else:
          patience_acc = 0
          precedent_loss = loss  
          torch.save(model, model_name + "_best.pth")

        if patience_acc >= early_stopping_info["patience"]:
          return torch.load(model_name + "_best.pth")           
  return model


### Training informations
The training was performed considering a maximum of 50 epochs with the following parameters for early stopping: patience equal to 3 epochs and delta equal to 1e-4. Batch size, learning rate and number of parameters for the layers were tuned by hand considering the results on the validation set.
In particular the following pools were considered:
- batch_size in \{16, 32, 64\}
- learning rate in \{1e-2, 1e-3, 1e-4\}
- hidden size of the GRU layers in \{100, 200\}
- neurons of the linear layers in \{512, 256, 128\}

In [75]:
epochs = 50
learning_rate = 1e-4
batch_size = 32

loss_fn = nn.BCEWithLogitsLoss()
embed_classifier = EmbeddingClassifier()
optimizer = Adam(embed_classifier.parameters(), lr=learning_rate)

# Construction of the Dataloaders for train and validation
train_loader = DataLoader(whole_dataset["train"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in vectorize_batch(x)))
val_loader  = DataLoader(whole_dataset["val"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in vectorize_batch(x)))
test_loader  = DataLoader(whole_dataset["test"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in vectorize_batch(x)))


embed_classifier.to(device)
summary(embed_classifier, 
                input_data=next(iter(train_loader))[0],
                device=device)


Layer (type:depth-idx)                   Output Shape              Param #
EmbeddingClassifier                      [32, 20]                  --
├─GRU: 1-1                               [32, 35, 200]             302,400
├─Flatten: 1-2                           [32, 7000]                --
├─Linear: 1-3                            [32, 512]                 3,584,512
├─ReLU: 1-4                              [32, 512]                 --
├─Linear: 1-5                            [32, 128]                 65,664
├─ReLU: 1-6                              [32, 128]                 --
├─Linear: 1-7                            [32, 20]                  2,580
Total params: 3,955,156
Trainable params: 3,955,156
Non-trainable params: 0
Total mult-adds (M): 455.58
Input size (MB): 0.45
Forward/backward pass size (MB): 1.96
Params size (MB): 15.82
Estimated Total Size (MB): 18.23

In [76]:
fix_random(seed)
embed_classifier = TrainModel(embed_classifier, loss_fn, optimizer, train_loader, val_loader, epochs, {"patience": 3, "delta": 1e-4}, "glove")

100%|██████████| 131/131 [00:02<00:00, 53.16it/s]


Valid Loss : 0.410
Train Loss : 0.449


100%|██████████| 131/131 [00:03<00:00, 36.67it/s]


Valid Loss : 0.407
Train Loss : 0.417


100%|██████████| 131/131 [00:02<00:00, 52.59it/s]


Valid Loss : 0.397
Train Loss : 0.407


100%|██████████| 131/131 [00:02<00:00, 53.84it/s]


Valid Loss : 0.388
Train Loss : 0.392


100%|██████████| 131/131 [00:02<00:00, 54.64it/s]


Valid Loss : 0.380
Train Loss : 0.380


100%|██████████| 131/131 [00:03<00:00, 36.77it/s]


Valid Loss : 0.375
Train Loss : 0.371


100%|██████████| 131/131 [00:02<00:00, 53.63it/s]


Valid Loss : 0.368
Train Loss : 0.363


100%|██████████| 131/131 [00:02<00:00, 53.87it/s]


Valid Loss : 0.365
Train Loss : 0.357


100%|██████████| 131/131 [00:02<00:00, 54.94it/s]


Valid Loss : 0.364
Train Loss : 0.351


100%|██████████| 131/131 [00:03<00:00, 37.02it/s]


Valid Loss : 0.363
Train Loss : 0.347


100%|██████████| 131/131 [00:02<00:00, 53.85it/s]


Valid Loss : 0.363
Train Loss : 0.342


100%|██████████| 131/131 [00:02<00:00, 53.87it/s]


Valid Loss : 0.364
Train Loss : 0.337


100%|██████████| 131/131 [00:02<00:00, 53.99it/s]


Valid Loss : 0.364
Train Loss : 0.333


In [78]:
print_report(embed_classifier, val_loader, whole_dataset["val"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.3213
  Precision:	0.3282
  Recall:	0.4286
  Accuracy:	0.7758
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.2723		0.2214		0.3537		0.8726		82
  Self-direction: action    	0.4715		0.375		0.6348		0.6574		293
  Stimulation               	0.0		0.0		0.0		0.9647		43
  Hedonism                  	0.0		0.0		0.0		0.9737		32
  Achievement               	0.6043		0.4922		0.7824		0.6943		363
  Power: dominance          	0.287		0.3333		0.252		0.8694		127
  Power: resources          	0.3388		0.6966		0.2238		0.8012		277
  Face                      	0.0412		0.25		0.0225		0.9236		89
  Security: personal        	0.6253		0.4852		0.879		0.5637		504
  Security: societal        	0.6985		0.5984		0.8389		0.7305		453
  Tradition                 	0.3084		0.2598		0.3793		0.8784		87
  Conformity: rules         	0.4344		0.3137		0.7061		0.5785		279
  Conformity: interpersonal 	0.0		0.0		0.0		0.9515		56
  Humility       

## BERT + LSTM model (transfer learning)
The following model is proposed to enhance the GloVe model through the following changes:
- Usage of contextual frozen Bert encoding instead of GloVe embeddings
(changes were performed in the collate)
- Substitution of the GRU layers with LSTM layers, which are more complex. 
- Meaningful initialization of the LSTM hidden and cell states using
the pooler-output of the BERT encoding of an argument passed to two different dense layers (ideally the pooler-output represents the encoding of the \[CLS\] token which is at the beginning of every argument and contains general informations about semantics of the whole sentence).
- Classification focussed on the concatenation of the output cell states of the LSTM layers, rather than the encoding of the whole sentence (concatenation of the hidden states).
This reduces the number of required neurons and elaborates a tensor that retains the most important semantic informations on the sentence.

### Selection of the BERT model
For this task we used the bert-based-uncased model and tokenizer. 
We also decided to try different variations of BERT (ELECTRA, ALBERT, Funnel Transformer, ...), but we didn't obtain remarkable improvements.

In [31]:
# import of the BERT tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model.to(device)
print("Bert loaded")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Bert loaded


The choice for the maximum length of the BERT encodings is 70 because
it is slightly above the sum of the 90-th percentile of the lengths of premises, stances and conclusions.
It is longer with respect to the maximum number of words for the GloVe model,
since BERT-encoded vectors are much more dense.

In [35]:
max_words_bert = 70
# collate function that uses the tokenizer relative to the bert pretrained model
def bert_vectorize_batch(batch):
  """
  Collate function to preprocess the data for the BERT-based models.
  In particular it joins premises, stances and conclusions in the same string,
  separated by the [SEP] token.
  Using the appropriate tokenizer, arguments are tokenized, truncated and padded.
  Target labels are already one-hot encoded.

  Params:
    batch: batch of data to be preprocessed
  Returns:
    X_tensor: a tensor containing a torch tensor containing the input_ids, the token_type_ids and the
              attention_mask for each example of the batch, which has shape:
              (3, batch_size, max_words_bert)
    Y_tensor: a tensor containing labels, which has shape
              (batch_size, num_classes)
  """
  X = [elem["Premise"] + " [SEP] " + elem["Stance"] + " [SEP] " + elem["Conclusion"] for elem in batch]
  Y = [elem["labels"] for elem in batch]
  X = bert_tokenizer(X, padding="max_length", truncation="longest_first", return_tensors = "pt", max_length = max_words_bert) 
  Y_tensor = torch.zeros(len(batch), Y[0].shape[0])
  for i, tokens in enumerate(Y):    
      Y_tensor[i] = Y[i]
  X_tensor = torch.stack([X["input_ids"], X["token_type_ids"], X["attention_mask"]])

  return X_tensor, Y_tensor

train_dataset = whole_dataset["train"]
val_dataset = whole_dataset["val"] 
test_dataset = whole_dataset["test"] 

In [80]:
# Simple model to perform some tests with pytorch
class BertLSTM(nn.Module):
  """
  Class implementing the BERT + LSTM model.
  Remark: max_words_bert and num_classes are parameters
          used to create this architecture which are set outside the class.
  """
  def __init__(self, bert_model):
    # the single parameter of this init function is the bert model 
    # that has to be used for transfer learning
    super(BertLSTM, self).__init__() 
    self.lstm_layers = 2
    self.lstm_hs = 128 # hidden size of the lstm
    bert_hidden_size = bert_model.config.hidden_size

    # freezing the parameters for the BERT model
    self.bert_model = bert_model
    for param in self.bert_model.parameters():
        param.requires_grad = False

    self.lstm = nn.LSTM(input_size=bert_hidden_size,
                        hidden_size=self.lstm_hs,
                        num_layers=self.lstm_layers ,
                        batch_first=True,
                        bidirectional=True)
    self.reducer_c0 = nn.Linear(bert_hidden_size, self.lstm_hs)
    self.reducer_h0 = nn.Linear(bert_hidden_size, self.lstm_hs)
    self.linear_1 = nn.Linear(self.lstm_hs*2*self.lstm_layers, self.lstm_hs)
    self.relu = nn.ReLU()
    self.linear_2 = nn.Linear(self.lstm_hs, num_classes) # since the dimensions
    # are already small there is no need for a third linear layer

  def forward(self, X_batch):
    #Remark: The LSTM layer does not contain the encoding of the [CLS] token
    #since it is used to initialize the hidden and cell states.
    out = self.bert_model(input_ids=X_batch[0], token_type_ids = X_batch[1], attention_mask = X_batch[2])
    cell = self.reducer_c0(out.pooler_output)
    hidden = self.reducer_h0(out.pooler_output)
    out = out.last_hidden_state[:,1:,:]
    c0 = torch.stack([cell,cell,cell,cell]) 
    h0 = torch.stack([hidden, hidden, hidden, hidden])
    out_lstm, hc_n  = self.lstm(out, (h0, c0))
    c_n = hc_n[1].permute(1, 0, 2) # permutation in order to obtain the batch 
                                   # size dimension first
    out = torch.cat([c_n[:,0,:], c_n[:,1,:]], 1) # concatenation of the cell states
    out2 = torch.cat([c_n[:,2,:], c_n[:,3,:]], 1)
    out = torch.cat([out, out2], 1)
    out = self.linear_1(out)
    out = self.relu(out)
    out = self.linear_2(out)
    return out

### Training informations
The training was performed considering a maximum of 50 epochs with the following parameters for early stopping: patience equal to 3 epochs and delta equal to 1e-4. Batch size, learning rate and number of parameters for the layers were tuned by hand considering the results on the validation set.
In particular the following pools were considered:
- batch_size in \{16, 32, 64\}
- learning rate in \{1e-2, 1e-3, 1e-4\}
- hidden size of the LSTM layers in \{128, 256, 512\}
- neurons of the linear layers in \{256, 128\}

In [37]:
batch_size = 32
epochs = 50
learning_rate = 1e-3

loss_fn = nn.BCEWithLogitsLoss()
prebert_classifier = BertLSTM(bert_model)
optimizer = Adam(prebert_classifier.parameters(), lr=learning_rate)

bert_train_loader = DataLoader(whole_dataset["train"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))
bert_val_loader  = DataLoader(whole_dataset["val"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))
bert_test_loader  = DataLoader(whole_dataset["test"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))

prebert_classifier.to(device)
summary(prebert_classifier, 
                input_data=next(iter(bert_train_loader))[0],
                device=device)

Layer (type:depth-idx)                                  Output Shape              Param #
BertLSTM                                                [32, 20]                  --
├─BertModel: 1-1                                        [32, 768]                 --
│    └─BertEmbeddings: 2-1                              [32, 70, 768]             --
│    │    └─Embedding: 3-1                              [32, 70, 768]             (23,440,896)
│    │    └─Embedding: 3-2                              [32, 70, 768]             (1,536)
│    │    └─Embedding: 3-3                              [1, 70, 768]              (393,216)
│    │    └─LayerNorm: 3-4                              [32, 70, 768]             (1,536)
│    │    └─Dropout: 3-5                                [32, 70, 768]             --
│    └─BertEncoder: 2-2                                 [32, 70, 768]             --
│    │    └─ModuleList: 3-6                             --                        (85,054,464)
│    └─BertPooler: 2-3 

In [38]:
fix_random(seed)
prebert_classifier = TrainModel(prebert_classifier, loss_fn, optimizer, bert_train_loader, bert_val_loader, epochs, {"patience": 3, "delta": 1e-4}, "bertencoder")

100%|██████████| 131/131 [00:23<00:00,  5.55it/s]


Valid Loss : 0.387
Train Loss : 0.398


100%|██████████| 131/131 [00:24<00:00,  5.44it/s]


Valid Loss : 0.368
Train Loss : 0.348


100%|██████████| 131/131 [00:22<00:00,  5.75it/s]


Valid Loss : 0.363
Train Loss : 0.327


100%|██████████| 131/131 [00:23<00:00,  5.61it/s]


Valid Loss : 0.364
Train Loss : 0.310


100%|██████████| 131/131 [00:23<00:00,  5.68it/s]


Valid Loss : 0.368
Train Loss : 0.295


100%|██████████| 131/131 [00:22<00:00,  5.75it/s]


Valid Loss : 0.384
Train Loss : 0.282


100%|██████████| 131/131 [00:23<00:00,  5.55it/s]


Valid Loss : 0.398
Train Loss : 0.268


In [81]:
print_report(prebert_classifier, bert_val_loader, whole_dataset["val"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.3571
  Precision:	0.3829
  Recall:	0.441
  Accuracy:	0.7911
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.2727		0.2174		0.3659		0.8685		82
  Self-direction: action    	0.4597		0.4026		0.5358		0.6968		293
  Stimulation               	0.0769		0.2222		0.0465		0.9606		43
  Hedonism                  	0.1667		0.75		0.0938		0.9753		32
  Achievement               	0.6268		0.5305		0.7658		0.728		363
  Power: dominance          	0.1765		0.3488		0.1181		0.885		127
  Power: resources          	0.5492		0.5777		0.5235		0.8044		277
  Face                      	0.1905		0.18		0.2022		0.8743		89
  Security: personal        	0.6835		0.556		0.8869		0.6598		504
  Security: societal        	0.6795		0.6604		0.6998		0.7543		453
  Tradition                 	0.2635		0.275		0.2529		0.8989		87
  Conformity: rules         	0.4282		0.2796		0.914		0.4404		279
  Conformity: interpersonal 	0.0		0.0		0.0		0.9532		56
  H

## BERT fine tuning
The following model is simply a fine tuned version of the BERT model on the reference dataset and it is proposed as an alternative to the previous models.
In particular a similar model has been proposed by the authors of the dataset, hence it can be used as a reference point, since the datasets are not exacly equal.

The architecture is a simple fine tuning of BERT followed by two linear layers which elaborate its pooler-output and reduce its dimension to num_classes.
It is a common architecture that employs BERT in order to perform multi-label text classification.

In [92]:
# Simple model to perform some tests with pytorch
class FineTunedBert(nn.Module):
  """
  Class implementing the model that allows to fine-tune BERT for this task.
  Remark: max_words_bert and num_classes are parameters
          used to create this architecture which are set outside the class.
  """
  def __init__(self, bert_model):
    # the single parameter of this init function is the bert model 
    # that has to be used for fine-tuning
    super(FineTunedBert, self).__init__() 
    self.bert_model = bert_model
    for param in self.bert_model.parameters():
        param.requires_grad = True
    bert_hidden_size = bert_model.config.hidden_size
    self.linear_1 = nn.Linear(bert_hidden_size, bert_hidden_size//2)
    self.relu = nn.ReLU()
    self.linear_2 = nn.Linear(bert_hidden_size//2, num_classes)

  def forward(self, X_batch):
    out = self.bert_model(input_ids=X_batch[0], 
                          token_type_ids = X_batch[1],
                          attention_mask = X_batch[2])

    out = out.last_hidden_state[:,0,:]
    out = self.linear_1(out)
    out = self.relu(out)
    out = self.linear_2(out)
    return out

# Training function
def finetune_bert(model, loss_fn, optimizer, train_loader, val_loader, epochs, early_stopping_info, model_name, scheduler):
  """
  Training function for the fine-tuning of BERT. if arly_stopping_info info is set
  to None, early stopping is not performed.
  Params:
    model: the model that has to be fine-tuned
    loss_fn: the loss function to adopt in order to perform the training
    optimizer: the optimizer to be used for training
    train_loader: the dataloader for the training dataset
    val_loader: the dataloader for the validation dataset
    epochs: the number of epochs for training
    early_stopping_info: dictionary containing the parameters for the early stopping:
                          - delta: min acceptable improvement in the validation loss
                          - patience: number of epochs to wait for improvement
    model_name: string containing the name of the model (used in order to save
                the weights)
    scheduler: the learning rate scheduler to be applied (it is a step scheduler)
  Returns:
    model: the trained model
  """
  patience_acc = 0
  precedent_loss = np.Inf
  model.train()
  for i in range(1, epochs+1):
      losses = []
      for X, Y in tqdm(train_loader):
          model.zero_grad()
          Y_preds = model(X)
          loss = loss_fn(Y_preds, Y)
          losses.append(loss.item())

          loss.backward()
          torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
          optimizer.step()
          scheduler.step()

      loss = compute_validation_loss(model, loss_fn, val_loader)
      print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
      if early_stopping_info != None:
        if precedent_loss - loss < early_stopping_info["delta"]:
            patience_acc = patience_acc + 1
        else:
          patience_acc = 0
          precedent_loss = loss
          torch.save(model, model_name + "_best.pth")

        if patience_acc > early_stopping_info["patience"]:
          return torch.load(model_name + "best.pth")
  return model

In [105]:
bert_model_unfrozen = BertModel.from_pretrained('bert-base-uncased')
bert_model_unfrozen.to(device)
print("reloaded")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


reloaded


## Training informations
The training recipe and the model were adapted from the following tutorial:<br>
https://skimai.com/fine-tuning-bert-for-sentiment-analysis/
<br>
As suggested in the tutorial the AdamW optimizer was used, the gradients were clipped to 1 and hyperparameters were drawn from the following pools:
- batch_size in \{16, 32\}
- learning rate in {5e-5, 3e-5, 2e-5}
- number of epochs in \{2, 3, 4\}, but also 5 and 6 were tested.
- the number of neurons of the linear layers are obtained by progressively halving the output of the BERT model

In [113]:
batch_size = 16
epochs = 3
learning_rate = 5e-5

loss_fn = nn.BCEWithLogitsLoss()

finetune_classifier = FineTunedBert(bert_model_unfrozen)
optimizer = AdamW(finetune_classifier.parameters(), lr=learning_rate, eps=1e-8)


bert_train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))
bert_val_loader  = DataLoader(val_dataset, batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))
bert_test_loader  = DataLoader(test_dataset, batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0,
                                            num_training_steps=len(bert_train_loader)*epochs)

finetune_classifier.to(device)
summary(finetune_classifier, input_data=next(iter(bert_train_loader))[0], device=device, dtypes = [torch.int]*3)

Layer (type:depth-idx)                                  Output Shape              Param #
FineTunedBert                                           [16, 20]                  --
├─BertModel: 1-1                                        [16, 768]                 --
│    └─BertEmbeddings: 2-1                              [16, 70, 768]             --
│    │    └─Embedding: 3-1                              [16, 70, 768]             23,440,896
│    │    └─Embedding: 3-2                              [16, 70, 768]             1,536
│    │    └─Embedding: 3-3                              [1, 70, 768]              393,216
│    │    └─LayerNorm: 3-4                              [16, 70, 768]             1,536
│    │    └─Dropout: 3-5                                [16, 70, 768]             --
│    └─BertEncoder: 2-2                                 [16, 70, 768]             --
│    │    └─ModuleList: 3-6                             --                        85,054,464
│    └─BertPooler: 2-3           

In [None]:
fix_random(seed)
finetune_classifier = finetune_bert(finetune_classifier, 
                                   loss_fn, optimizer,
                                   bert_train_loader,
                                   bert_val_loader,
                                   epochs,
                                   None, 
                                   "finebert", scheduler)

100%|██████████| 261/261 [01:04<00:00,  4.07it/s]


In [112]:
print("FINETUNED BERT:")
print_report(finetune_classifier, bert_val_loader, whole_dataset["val"]["labels"], 0.25)

FINETUNED BERT:
----- MACRO AVG. -----
  F1-score:	0.4132
  Precision:	0.4661
  Recall:	0.4302
  Accuracy:	0.8421
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.3278		0.2258		0.5976		0.8348		82
  Self-direction: action    	0.5423		0.5014		0.5904		0.7601		293
  Stimulation               	0.1455		0.3333		0.093		0.9614		43
  Hedonism                  	0.1538		0.4286		0.0938		0.9729		32
  Achievement               	0.629		0.5406		0.7521		0.7354		363
  Power: dominance          	0.2513		0.3472		0.1969		0.8776		127
  Power: resources          	0.6296		0.6074		0.6534		0.825		277
  Face                      	0.1154		0.4		0.0674		0.9244		89
  Security: personal        	0.7015		0.6248		0.7996		0.7182		504
  Security: societal        	0.7069		0.7031		0.7108		0.7806		453
  Tradition                 	0.3129		0.3833		0.2644		0.917		87
  Conformity: rules         	0.5375		0.4925		0.5914		0.7666		279
  Conformity: interpersonal 	0.0656		0.

## Evaluation of the models
The models are evalueted both on out test split (original validation split) and
on the Chinese dataset split.
The reference evaluation scores are the macro F1s, but we also provide macro inter-class precision, recall and accuracy.
We also tuned the threshold needed to obtain hard labels. The default should be 0.5, but we decided to lower it to 0.25 in order to sacrifice a bit of precision and gain an improvement to the recall and thus obtain a better F1 score. 
Since the threshold is an hyperparameter of the model, tuning it is not a way to artifically increase F1-scores.

In [109]:
print_report(embed_classifier, test_loader, whole_dataset["test"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.3139
  Precision:	0.2954
  Recall:	0.3992
  Accuracy:	0.7764
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.3063		0.3398		0.2789		0.8328		251
  Self-direction: action    	0.4683		0.3797		0.6109		0.6371		496
  Stimulation               	0.0		0.0		0.0		0.9272		138
  Hedonism                  	0.0		0.0		0.0		0.9457		103
  Achievement               	0.5427		0.4493		0.6852		0.6498		575
  Power: dominance          	0.2443		0.3265		0.1951		0.8956		164
  Power: resources          	0.3148		0.4048		0.2576		0.9219		132
  Face                      	0.0292		0.2857		0.0154		0.9299		130
  Security: personal        	0.6615		0.5238		0.8972		0.6324		759
  Security: societal        	0.5528		0.4514		0.7131		0.7031		488
  Tradition                 	0.343		0.3869		0.3081		0.8929		172
  Conformity: rules         	0.4454		0.3298		0.6857		0.5902		455
  Conformity: interpersonal 	0.0		0.0		0.0		0.9662		60
  Humil

In [110]:
print_report(prebert_classifier, bert_test_loader, whole_dataset["test"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.3739
  Precision:	0.3769
  Recall:	0.4412
  Accuracy:	0.7907
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.4085		0.3686		0.4582		0.8244		251
  Self-direction: action    	0.4343		0.4352		0.4335		0.7046		496
  Stimulation               	0.198		0.3125		0.1449		0.9146		138
  Hedonism                  	0.3467		0.5532		0.2524		0.9483		103
  Achievement               	0.5982		0.4954		0.7548		0.6925		575
  Power: dominance          	0.201		0.4667		0.128		0.9119		164
  Power: resources          	0.3521		0.3481		0.3561		0.9088		132
  Face                      	0.1308		0.1308		0.1308		0.8808		130
  Security: personal        	0.7019		0.578		0.8933		0.6962		759
  Security: societal        	0.6271		0.6377		0.6168		0.8112		488
  Tradition                 	0.4057		0.3989		0.4128		0.8903		172
  Conformity: rules         	0.4513		0.2999		0.9121		0.4678		455
  Conformity: interpersonal 	0.0		0.0		0.0		0.9

In [111]:
print_report(finetune_classifier, bert_test_loader, whole_dataset["test"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.4131
  Precision:	0.4625
  Recall:	0.4345
  Accuracy:	0.8358
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.4906		0.3858		0.6733		0.8149		251
  Self-direction: action    	0.5327		0.4965		0.5746		0.7363		496
  Stimulation               	0.2688		0.5208		0.1812		0.9283		138
  Hedonism                  	0.2857		0.5405		0.1942		0.9473		103
  Achievement               	0.6203		0.5317		0.7443		0.7236		575
  Power: dominance          	0.2902		0.4066		0.2256		0.9045		164
  Power: resources          	0.4431		0.3731		0.5455		0.9045		132
  Face                      	0.0795		0.2857		0.0462		0.9267		130
  Security: personal        	0.7376		0.6493		0.8538		0.7569		759
  Security: societal        	0.6181		0.5737		0.6701		0.7869		488
  Tradition                 	0.4203		0.5041		0.3605		0.9098		172
  Conformity: rules         	0.5261		0.5433		0.5099		0.7795		455
  Conformity: interpersonal 	0.0923		0.6		0

In [108]:
chn_loader = DataLoader(whole_dataset["test_chn"], batch_size=32, collate_fn=lambda x:tuple(y.to(device) for y in vectorize_batch(x)))
bert_chn_loader = DataLoader(whole_dataset["test_chn"], batch_size=32, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))

In [None]:
print_report(embed_classifier, chn_loader, whole_dataset["test_chn"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.1647
  Precision:	0.2646
  Recall:	0.1446
  Accuracy:	0.8745
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.0		0.0		0.0		0.88		6
  Self-direction: action    	0.2667		0.5		0.1818		0.89		11
  Stimulation               	0.0		0.0		0.0		1.0		0
  Hedonism                  	0.0		0.0		0.0		0.98		2
  Achievement               	0.481		0.475		0.4872		0.59		39
  Power: dominance          	0.0		0.0		0.0		0.99		1
  Power: resources          	0.3529		0.4		0.3158		0.78		19
  Face                      	0.0		0.0		0.0		0.99		1
  Security: personal        	0.4762		0.4545		0.5		0.67		30
  Security: societal        	0.3462		0.4286		0.2903		0.66		31
  Tradition                 	0.0		0.0		0.0		1.0		0
  Conformity: rules         	0.125		1.0		0.0667		0.86		15
  Conformity: interpersonal 	0.0		0.0		0.0		0.99		1
  Humility                  	0.0		0.0		0.0		0.95		5
  Benevolence: caring       	0.1429		0.5		0.0833		0.

In [None]:
print_report(prebert_classifier, bert_chn_loader, whole_dataset["test_chn"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.2551
  Precision:	0.2188
  Recall:	0.3798
  Accuracy:	0.8035
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.5556		0.4167		0.8333		0.92		6
  Self-direction: action    	0.2979		0.1944		0.6364		0.67		11
  Stimulation               	0.0		0.0		0.0		1.0		0
  Hedonism                  	0.0		0.0		0.0		0.98		2
  Achievement               	0.6		0.4444		0.9231		0.52		39
  Power: dominance          	0.0		0.0		0.0		0.9		1
  Power: resources          	0.375		0.2459		0.7895		0.5		19
  Face                      	0.0		0.0		0.0		0.99		1
  Security: personal        	0.495		0.3521		0.8333		0.49		30
  Security: societal        	0.4719		0.3621		0.6774		0.53		31
  Tradition                 	0.0		0.0		0.0		1.0		0
  Conformity: rules         	0.3448		0.3571		0.3333		0.81		15
  Conformity: interpersonal 	0.0		0.0		0.0		0.99		1
  Humility                  	0.0		0.0		0.0		0.95		5
  Benevolence: caring       	0.4167

In [None]:
print_report(finetune_classifier, bert_chn_loader, whole_dataset["test_chn"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.2712
  Precision:	0.2211
  Recall:	0.3944
  Accuracy:	0.836
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.375		0.2308		1.0		0.8		6
  Self-direction: action    	0.36		0.2308		0.8182		0.68		11
  Stimulation               	0.0		0.0		0.0		0.98		0
  Hedonism                  	0.0		0.0		0.0		0.98		2
  Achievement               	0.6598		0.5517		0.8205		0.67		39
  Power: dominance          	0.0		0.0		0.0		0.99		1
  Power: resources          	0.4314		0.3438		0.5789		0.71		19
  Face                      	0.0		0.0		0.0		0.98		1
  Security: personal        	0.5412		0.4182		0.7667		0.61		30
  Security: societal        	0.4444		0.3902		0.5161		0.6		31
  Tradition                 	0.0		0.0		0.0		1.0		0
  Conformity: rules         	0.4324		0.3636		0.5333		0.79		15
  Conformity: interpersonal 	0.0		0.0		0.0		0.99		1
  Humility                  	0.0		0.0		0.0		0.95		5
  Benevolence: caring       	0.3158	