# NLP course project
**Summary**: Application of text classification approaches for Human Value Detection <br>
**Members**:
- Dell'Olio Domenico
- Delvecchio Giovanni Pio
- Disabato Raffaele  


The project was developed in order to create and test various models to address the task of Human Value Detection proposed in the challenge: <br>
https://touche.webis.de/semeval23/touche23-web/index.html <br>

The challenge can be tackled as a multi-label text clasification problem, thus we decided to implement and test various architectures in order to compare their performances. <br>
These architectures were either already present at the state of the art or were obtained as a result of experiments.

## This notebook contains the following implementations:
- GloVe baseline with two layers of Bi-GRU, followed by flatten and two dense layers with ReLU activation and a single dense layer with no activation;
- BERT baseline with two layers of Bi-LSTM (transfer learning), where the output cell states are concatenated and passed to a dense layer with ReLU activation and a single dense layer with no activation;
- finetuning of BERT followed by a dense layer with ReLU activation followed by a dense layer with no activation.

## This notebook does **not** contain:
- exstensive Data analysis (it is explored in the other notebook)

In [1]:
# installation of the required libraries
!pip install transformers
!pip install datasets
!pip install torchinfo

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m60.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.0
Looking in indexes: https://pypi.org/simple, http

In [2]:
# Cell for the download of the datasets
!wget https://zenodo.org/record/7550385/files/arguments-training.tsv
!wget https://zenodo.org/record/7550385/files/labels-training.tsv
!wget https://zenodo.org/record/7550385/files/arguments-validation.tsv
!wget https://zenodo.org/record/7550385/files/labels-validation.tsv
!wget https://zenodo.org/record/7550385/files/arguments-test.tsv
!wget https://zenodo.org/record/7550385/files/arguments-validation-zhihu.tsv
!wget https://zenodo.org/record/7550385/files/labels-validation-zhihu.tsv

--2023-02-09 16:03:10--  https://zenodo.org/record/7550385/files/arguments-training.tsv
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1012498 (989K) [application/octet-stream]
Saving to: ‘arguments-training.tsv’


2023-02-09 16:03:20 (147 KB/s) - ‘arguments-training.tsv’ saved [1012498/1012498]

--2023-02-09 16:03:20--  https://zenodo.org/record/7550385/files/labels-training.tsv
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 253843 (248K) [application/octet-stream]
Saving to: ‘labels-training.tsv’


2023-02-09 16:03:23 (318 KB/s) - ‘labels-training.tsv’ saved [253843/253843]

--2023-02-09 16:03:23--  https://zenodo.org/record/7550385/files/arguments-validation.tsv
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting

In [3]:
# imports for dataset loading
import numpy as np
import random
import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# torch imports
import torch
import torchtext
from torchtext.data import get_tokenizer
from torchtext.vocab import GloVe
from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset
from torch import nn
from torch.nn import functional as F
from torch.optim import Adam
from torchinfo import summary
from torch.optim import AdamW

#huggingface imports
from transformers import BertTokenizer, BertModel, get_linear_schedule_with_warmup

# progress bar
from tqdm import tqdm
# garbage collector
import gc

# imports for evaluation
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

In [4]:
def fix_random(seed: int) -> None:
  """Fix all the possible sources of randomness.

  Params:
    seed: the seed to use. 
  """
  np.random.seed(seed)
  random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)

  torch.backends.cudnn.benchmark = False
  torch.backends.cudnn.deterministic = True

In [5]:
# Cell needed to fix the seeds and define the available device
# for the training of the models
seed = 10
fix_random(seed)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


In [6]:
def huggingface_from_pandas(pandas_df):
  """
  Function converting a pandas dataframe to a huggingface dataset.
  It also returns an ordered list containing the target labels

  Params:
    pandas_df: the dataset that has to be converted
  Returns:
    hf_ds:     the huggingface dataset obrained from pandas_df
    label_cols: the ordered list of target labels of pandas_df
  """

  hf_ds = Dataset.from_pandas(pandas_df, preserve_index=False)
  hf_ds = hf_ds.remove_columns(["Argument ID", "Argument ID2"])
  # Aggregating labels in a single list
  hf_ds = hf_ds.map(lambda x:{"labels": [int(x[col]) for col in hf_ds.column_names if
                                      col not in ['Conclusion', 'Stance', 'Premise']]})
  label_cols = [col for col in hf_ds.column_names if col not in ['Conclusion', 'Stance', 'Premise', "labels"]]
  # here we are removing the columns related to the labels from the dataset
  hf_ds = hf_ds.remove_columns(label_cols)
  return hf_ds, label_cols

The challenge provides the already splitted dataset in Train, Validation and Test splits. However the Test split does not have public labels available, 
so we decided to split the Training set in (Training, Validation) 
(with proportions 80-20 on unique conclusions) and to use the validation set as Test set.  <br>
We decided to probe the robustness of our model on the Chinese validation
set too, which has a different cultural background.

In [7]:
def train_test_split_wrt_conclusions(train, ratio = 0.8):
  val = []
  unique_conc = pd.unique(train["Conclusion"])
  num_train_con = int(len(unique_conc)*ratio)
  train_unique_conc = np.random.choice(unique_conc, num_train_con, replace = False)
  val_unique_conc = set(unique_conc) - set(train_unique_conc)
  train_set_to_return = train[train.Conclusion.isin(train_unique_conc)] 
  val_set_to_return = train[train.Conclusion.isin(val_unique_conc)]
  return train_set_to_return, val_set_to_return

In [8]:
# Dataset loading and splitting
raw_training = pd.read_csv("arguments-training.tsv", encoding='utf-8', sep='\t', header=0)
raw_training_lab = pd.read_csv("labels-training.tsv", encoding='utf-8', sep='\t', header=0)
raw_test = pd.read_csv("arguments-validation.tsv", encoding='utf-8', sep='\t', header=0)
raw_test_lab = pd.read_csv("labels-validation.tsv", encoding='utf-8', sep='\t', header=0)
raw_test_chn=pd.read_csv("arguments-validation-zhihu.tsv", encoding='utf-8', sep='\t', header=0)
raw_test_chn_lab=pd.read_csv("labels-validation-zhihu.tsv", encoding='utf-8', sep='\t', header=0)

train = raw_training.join(raw_training_lab,how='inner' ,lsuffix='2') # joining labels
test = raw_test.join(raw_test_lab, how='inner', lsuffix='2') # joining labels
test_chn = raw_test_chn.join(raw_test_chn_lab, how='inner', lsuffix='2') # joining labels
fix_random(seed)
train, val = train_test_split_wrt_conclusions(train) # splitting training

train_ds, label_list = huggingface_from_pandas(train)
val_ds, _ = huggingface_from_pandas(val)
test_ds, _ = huggingface_from_pandas(test)
test_chn_ds, _ = huggingface_from_pandas(test_chn) 

print("Single example from the training dataset: ")
print(train_ds[0])
print("Full list of target labels: ")
print(label_list)
num_classes = len(label_list)
print("Total number of target labels: ")
print(num_classes)
whole_dataset = DatasetDict()
whole_dataset["train"] = train_ds.with_format("torch")
whole_dataset["val"] = val_ds.with_format("torch")
whole_dataset["test"] = test_ds.with_format("torch")
whole_dataset["test_chn"] = test_chn_ds.with_format("torch")

  0%|          | 0/4176 [00:00<?, ?ex/s]

  0%|          | 0/1217 [00:00<?, ?ex/s]

  0%|          | 0/1896 [00:00<?, ?ex/s]

  0%|          | 0/100 [00:00<?, ?ex/s]

Single example from the training dataset: 
{'Conclusion': 'We should ban human cloning', 'Stance': 'in favor of', 'Premise': 'we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same.', 'labels': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
Full list of target labels: 
['Self-direction: thought', 'Self-direction: action', 'Stimulation', 'Hedonism', 'Achievement', 'Power: dominance', 'Power: resources', 'Face', 'Security: personal', 'Security: societal', 'Tradition', 'Conformity: rules', 'Conformity: interpersonal', 'Humility', 'Benevolence: caring', 'Benevolence: dependability', 'Universalism: concern', 'Universalism: nature', 'Universalism: tolerance', 'Universalism: objectivity']
Total number of target labels: 
20


In [9]:
def make_predictions(model, loader):
  """
  Function needed to obtain the prediction for the target labels
  given a model and a data loader.

  Params:
    model: the model that will be used to obtain the predictions over
           the labels
    loader: the data loader needed to feed the model with the data for which
            we want to obtain label predictions
  Returns:
    Y_preds: tensor containing the predicted label for each example.
             These labels are obtained as the output of the model passed to
             a sigmoid function.
  """
  Y_preds = []
  model.eval()
  for X, Y in loader:
    with torch.no_grad():
      preds = model(X)
    Y_preds.append(preds)
  gc.collect()
  Y_preds = torch.cat(Y_preds)
  Y_preds = Y_preds.sigmoid()
  return Y_preds.detach()

def keep_above_thresh(Y_preds, thr):
  """
  Function needed to convert the results of the models to hard labels
  using a threshold.
  
  Params:
    Y_preds: scores obtained by the model which have to be converted to hard
             labels
    thr: threshold to be applied to the scores, element of (0, 1), if a score
         is greater than thr it becomes a hard label with value 1, 
         0 otherwise
  Retuns:
    Y_preds_thr: hard labels obtained by thresholding Y_preds with thr
  """
  Y_preds_thr = np.copy(Y_preds.numpy())
  max_rows = Y_preds_thr.shape[0]
  max_cols = Y_preds_thr.shape[1]
  for i in range(max_rows):
    new_row = np.array([1 if Y_preds_thr[i][j] > thr else 0 for j in range(max_cols)])
    Y_preds_thr[i] = new_row
  return Y_preds_thr

def compute_macro_score(M_true, M_pred, score_func):
  """
  Function needed to compute the macro aggregation of a scored function
  over the different classes.

  Params:
    M_true: true labels needed to compute the scores
    M_pred: predicted labels needed to compute the scores
    score_func: scoring function to be computed
  Returns:
    macro: aggregation of the result of score_func computed over all the
           labels.
    scores: list of per-label score
  """
  scores = []
  for i in range(M_true.shape[1]):
      true = M_true[:, i]
      pred = M_pred[:, i]
      if score_func == accuracy_score:
        scores.append(score_func(true, pred))
      else: 
        scores.append(score_func(true, pred, zero_division=0))
  macro = np.mean(scores)
  return macro, scores
  
def support(true, pred, zero_division):
  """
  Utility function to compute the support of the class labels,
  pred and zero_division are dummy parameters needed to have conformity
  with the sklearn functions to compute scores.

  Params: 
    true: binary true labels for a single class for each example that are needed
          to compute the support for the single class
    pred: dummy parameter
    zero_division: dummy parameter
  Returns:
    sum(true): the number of example for a single class (support)
  """
  return sum(true)

def print_report(classifier, loader, y_true, threshold, labels=label_list):
  """
  Function needed to print the classification results given a classifier,
  a dataset loader, true labels and a threshold. 
  The printed report includes macro accuracy, precision, recall and F1, as 
  well as per-class accuracy, precision, recall, F1 and support.

  Params:
    classifier: the model that has to be evaluated
    loader: data-loader needed to feed the data to the classifier to get 
            predicted labels
    y_true: true labels associated to the dataset associated to the loader
    threshold: threshold for the conversion of the scores to hard labels,
               check keep_above_thresh for further details
    labels: ordered list of target labels. Defaults to the list extracted from
            the dataset
  """

  Y_preds = make_predictions(classifier, loader)
  Y_preds_thr = keep_above_thresh(Y_preds.to('cpu'), threshold)

  f1_macro, f1 = compute_macro_score(y_true, Y_preds_thr, f1_score)
  acc_macro, acc = compute_macro_score(y_true, Y_preds_thr, accuracy_score)
  prec_macro, prec = compute_macro_score(y_true, Y_preds_thr, precision_score)
  rec_macro, rec = compute_macro_score(y_true, Y_preds_thr, recall_score)
  _, sup = compute_macro_score(y_true, Y_preds_thr, support)

  print("----- MACRO AVG. -----")
  print(f"  F1-score:\t{round(f1_macro,4)}\n\
  Precision:\t{round(prec_macro,4)}\n\
  Recall:\t{round(rec_macro,4)}\n\
  Accuracy:\t{round(acc_macro,4)}")
  print("----- PER-CLASS VALUES -----")
  print("  \t\t\t\tF1-score\tPrecision\tRecall\t\tAccuracy\tSupport")
  for i in range(len(labels)):
    print("  " + labels[i]+" "*(len(max(labels, key=len))-len(labels[i])), end="\t")
    print(f"{round(f1[i],4)}\t\t{round(prec[i],4)}\t\t{round(rec[i],4)}\t\t{round(acc[i],4)}\t\t{sup[i]}")

The first model that was developed is a GloVe 100d embedding + two Bi-GRU layers
That serves as an advanced baseline to perfom experiments for multi-label classification problems like the current one. 
It is still a baseline since it has a simple architecture, OOV are treated using <br>zero-vectors, the hidden states of the Bi-GRU layers are initialized 
as zero-vectors and most importantly the model does not work with contextual information, but only with the semantics of the words. <br>
Moreover an heavy preprocessing to the dataset is not applied except for lowercasing the arguments, tokenization and the addition of truncation and padding because the GloVe embeddings would return too many unmasked zero vectors. <br>
About padding and truncation: the maximum allowed length is 35 which is 
slightly above the sum of the mean token length value for the premises and the
conclusion. 

In [10]:
# Pretrained GloVe setup

global_vectors = GloVe(name='6B', dim=100)

# the current choice is to give an id to each word
tokenizer = get_tokenizer("basic_english")

.vector_cache/glove.6B.zip: 862MB [02:43, 5.28MB/s]                           
100%|█████████▉| 399999/400000 [00:16<00:00, 24181.28it/s]


In [18]:
max_words_emb = 35
embed_len = 100

# collate function where the Premises are tokenized and embedded in batches
def vectorize_batch(batch):
    X = [elem["Premise"] + " " + elem["Stance"] + " " +elem["Conclusion"] for elem in batch]
    Y = [elem["labels"] for elem in batch]
    X = [tokenizer(x) for x in X]
    X = [tokens+[""] * (max_words_emb-len(tokens))  if len(tokens)<max_words_emb else tokens[:max_words_emb] for tokens in X]
    X_tensor = torch.zeros(len(batch), max_words_emb, embed_len)
    Y_tensor = torch.zeros(len(batch), Y[0].shape[0])
    for i, tokens in enumerate(X):
        X_tensor[i] = global_vectors.get_vecs_by_tokens(tokens)
        Y_tensor[i] = Y[i]
    return X_tensor, Y_tensor

In [19]:
# Simple model to perform some tests with pytorch
class EmbeddingClassifier(nn.Module):
    def __init__(self):
        super(EmbeddingClassifier, self).__init__() 
       
        self.gru_layers = 1

        self.gru = nn.GRU(input_size = embed_len,
                          hidden_size = embed_len,
                          num_layers = self.gru_layers,
                          batch_first=True, 
                          bidirectional = True)
        self.flatten = nn.Flatten(start_dim=1)
        self.linear_1 = nn.Linear(max_words_emb*embed_len*self.gru_layers*2, 512)
        self.relu = nn.ReLU()
        self.linear_2 = nn.Linear(512,128)
        self.linear_3 = nn.Linear(128, num_classes)
        
                

    def forward(self, X_batch):
        h0 = torch.zeros(2*self.gru_layers,X_batch.shape[0], embed_len)
        h0 = h0.to(device)
        out, hn = self.gru(X_batch, h0)
        out = self.flatten(out)
        out = self.linear_1(out)
        out = self.relu(out)
        out = self.linear_2(out)
        out = self.relu(out)
        out = self.linear_3(out)
        return out

# Function needed to compute the validation loss and the accuracy
def CalcValLoss(model, loss_fn, val_loader):
    with torch.no_grad():
      Y_shuffled, Y_preds, losses = [],[],[]
      for X, Y in val_loader:
        preds = model(X)
        loss = loss_fn(preds, Y)
        losses.append(loss.item())
        Y_shuffled.append(Y)
        Y_preds.append(preds.argmax(dim=-1))

      Y_shuffled = torch.cat(Y_shuffled)
      Y_preds = torch.cat(Y_preds)

      loss = torch.tensor(losses).mean()
      print("Valid Loss : {:.3f}".format(loss))
    return loss


# Training function
def TrainModel(model, loss_fn, optimizer, train_loader, val_loader, epochs, early_stopping_info, model_name):
    patience_acc = 0
    precedent_loss = np.Inf
    model.train()
    for i in range(1, epochs+1):
        losses = []
        for X, Y in tqdm(train_loader):

            Y_preds = model(X)

            loss = loss_fn(Y_preds, Y)
            losses.append(loss.item())

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        loss = CalcValLoss(model, loss_fn, val_loader)
        print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
        if precedent_loss - loss < early_stopping_info["delta"]:
           patience_acc = patience_acc + 1
        else:
          patience_acc = 0
          torch.save(model, model_name + "_best.pth")

        if patience_acc > early_stopping_info["patience"]:
          return torch.load(model_name + "_best.pth")
        precedent_loss = loss
            
    return model

In [20]:
epochs = 50
learning_rate = 1e-4
batch_size = 32

loss_fn = nn.BCEWithLogitsLoss()
embed_classifier = EmbeddingClassifier()
optimizer = Adam(embed_classifier.parameters(), lr=learning_rate)

# Construction of the Dataloaders for train and validation
train_loader = DataLoader(whole_dataset["train"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in vectorize_batch(x)))
val_loader  = DataLoader(whole_dataset["val"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in vectorize_batch(x)))
test_loader  = DataLoader(whole_dataset["test"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in vectorize_batch(x)))


embed_classifier.to(device)
summary(embed_classifier, 
                input_data=next(iter(train_loader))[0],
                device=device)


Layer (type:depth-idx)                   Output Shape              Param #
EmbeddingClassifier                      [32, 20]                  --
├─GRU: 1-1                               [32, 35, 200]             121,200
├─Flatten: 1-2                           [32, 7000]                --
├─Linear: 1-3                            [32, 512]                 3,584,512
├─ReLU: 1-4                              [32, 512]                 --
├─Linear: 1-5                            [32, 128]                 65,664
├─ReLU: 1-6                              [32, 128]                 --
├─Linear: 1-7                            [32, 20]                  2,580
Total params: 3,773,956
Trainable params: 3,773,956
Non-trainable params: 0
Total mult-adds (M): 252.63
Input size (MB): 0.45
Forward/backward pass size (MB): 1.96
Params size (MB): 15.10
Estimated Total Size (MB): 17.50

In [36]:
fix_random(seed)
embed_classifier = TrainModel(embed_classifier, loss_fn, optimizer, train_loader, val_loader, epochs, {"patience": 3, "delta": 1e-4}, "glove")

100%|██████████| 131/131 [00:15<00:00,  8.48it/s]


Valid Loss : 0.412
Train Loss : 0.446


100%|██████████| 131/131 [00:15<00:00,  8.65it/s]


Valid Loss : 0.409
Train Loss : 0.414


100%|██████████| 131/131 [00:15<00:00,  8.64it/s]


Valid Loss : 0.400
Train Loss : 0.404


100%|██████████| 131/131 [00:15<00:00,  8.57it/s]


Valid Loss : 0.392
Train Loss : 0.389


100%|██████████| 131/131 [00:18<00:00,  7.18it/s]


Valid Loss : 0.386
Train Loss : 0.375


100%|██████████| 131/131 [00:20<00:00,  6.43it/s]


Valid Loss : 0.382
Train Loss : 0.363


100%|██████████| 131/131 [00:20<00:00,  6.52it/s]


Valid Loss : 0.378
Train Loss : 0.354


100%|██████████| 131/131 [00:18<00:00,  7.08it/s]


Valid Loss : 0.375
Train Loss : 0.346


100%|██████████| 131/131 [00:22<00:00,  5.75it/s]


Valid Loss : 0.374
Train Loss : 0.339


100%|██████████| 131/131 [00:18<00:00,  7.10it/s]


Valid Loss : 0.374
Train Loss : 0.332


100%|██████████| 131/131 [00:18<00:00,  7.26it/s]


Valid Loss : 0.374
Train Loss : 0.325


100%|██████████| 131/131 [00:19<00:00,  6.86it/s]


Valid Loss : 0.375
Train Loss : 0.318


100%|██████████| 131/131 [00:17<00:00,  7.35it/s]


Valid Loss : 0.377
Train Loss : 0.311


100%|██████████| 131/131 [00:17<00:00,  7.45it/s]


Valid Loss : 0.378
Train Loss : 0.303


100%|██████████| 131/131 [00:18<00:00,  7.23it/s]


Valid Loss : 0.380
Train Loss : 0.296


In [None]:
print_report(embed_classifier, val_loader, whole_dataset["val"]["labels"] ,0.25)
# batchsize 32, 1e-4, no conclusion/stance 0.3648 max 25
# batchsize 64, 1e-4, no conclusion/stance 0.3478
# batchsize 32, 1e-4, conclusion/stance 0.3505 max 35


In [None]:
gc.collect()

0

In [31]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model.to(device)
print("Bert loaded")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Bert loaded


In [35]:
max_words_bert = 70
# collate function that uses the tokenizer relative to the bert pretrained model
def bert_vectorize_batch(batch):
    X = [elem["Premise"] + " [SEP] " + elem["Stance"] + " [SEP] " + elem["Conclusion"] for elem in batch]
    Y = [elem["labels"] for elem in batch]
    X = bert_tokenizer(X, padding="max_length", truncation="longest_first", return_tensors = "pt", max_length = max_words_bert) 
    Y_tensor = torch.zeros(len(batch), Y[0].shape[0])
    for i, tokens in enumerate(Y):    
        Y_tensor[i] = Y[i]
    X_tensor = torch.stack([X["input_ids"], X["token_type_ids"], X["attention_mask"]])

    return X_tensor, Y_tensor

train_dataset = whole_dataset["train"]
val_dataset = whole_dataset["val"] 
test_dataset = whole_dataset["test"] 

In [36]:
# Simple model to perform some tests with pytorch
class BertLSTM(nn.Module):
    def __init__(self,bert_model):
        super(BertLSTM, self).__init__() 
        self.lstm_layers = 2
        self.lstm_hs = 128
        bert_hidden_size = bert_model.config.hidden_size

        self.bert_model = bert_model
        for param in self.bert_model.parameters():
            param.requires_grad = False

        self.lstm = nn.LSTM(input_size=bert_hidden_size,
                            hidden_size=self.lstm_hs,
                            num_layers=self.lstm_layers ,
                            batch_first=True,
                            bidirectional=True)
        self.reducer_c0 = nn.Linear(bert_hidden_size, self.lstm_hs)
        self.reducer_h0 = nn.Linear(bert_hidden_size, self.lstm_hs)
        self.linear_1 = nn.Linear(self.lstm_hs*2*self.lstm_layers, self.lstm_hs)
        self.relu = nn.ReLU()
        self.linear_2 = nn.Linear(self.lstm_hs, num_classes) 

    def forward(self, X_batch):
        out = self.bert_model(input_ids=X_batch[0], token_type_ids = X_batch[1], attention_mask = X_batch[2])
        cell = self.reducer_c0(out.pooler_output)
        hidden = self.reducer_h0(out.pooler_output)
        out = out.last_hidden_state[:,1:,:]
        c0 = torch.stack([cell,cell,cell,cell])
        h0 = torch.stack([hidden, hidden, hidden, hidden])
        out_lstm, hc_n  = self.lstm(out, (h0, c0))
        c_n = hc_n[1].permute(1, 0, 2)
        out = torch.cat([c_n[:,0,:], c_n[:,1,:]], 1)
        out2 = torch.cat([c_n[:,2,:], c_n[:,3,:]], 1)
        out = torch.cat([out, out2], 1)
        out = self.linear_1(out)
        out = self.relu(out)
        out = self.linear_2(out)
        return out

In [37]:
batch_size = 32
epochs = 50
learning_rate = 1e-3

loss_fn = nn.BCEWithLogitsLoss()
prebert_classifier = BertLSTM(bert_model)
optimizer = Adam(prebert_classifier.parameters(), lr=learning_rate)

bert_train_loader = DataLoader(whole_dataset["train"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))
bert_val_loader  = DataLoader(whole_dataset["val"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))
bert_test_loader  = DataLoader(whole_dataset["test"], batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))

prebert_classifier.to(device)
summary(prebert_classifier, 
                input_data=next(iter(bert_train_loader))[0],
                device=device)

Layer (type:depth-idx)                                  Output Shape              Param #
BertLSTM                                                [32, 20]                  --
├─BertModel: 1-1                                        [32, 768]                 --
│    └─BertEmbeddings: 2-1                              [32, 70, 768]             --
│    │    └─Embedding: 3-1                              [32, 70, 768]             (23,440,896)
│    │    └─Embedding: 3-2                              [32, 70, 768]             (1,536)
│    │    └─Embedding: 3-3                              [1, 70, 768]              (393,216)
│    │    └─LayerNorm: 3-4                              [32, 70, 768]             (1,536)
│    │    └─Dropout: 3-5                                [32, 70, 768]             --
│    └─BertEncoder: 2-2                                 [32, 70, 768]             --
│    │    └─ModuleList: 3-6                             --                        (85,054,464)
│    └─BertPooler: 2-3 

In [None]:
fix_random(seed)
prebert_classifier = TrainModel(prebert_classifier, loss_fn, optimizer, bert_train_loader, bert_val_loader, epochs, {"patience": 3, "delta": 1e-4}, "bertencoder")

100%|██████████| 131/131 [00:23<00:00,  5.55it/s]


Valid Loss : 0.387
Train Loss : 0.398


100%|██████████| 131/131 [00:24<00:00,  5.44it/s]


Valid Loss : 0.368
Train Loss : 0.348


100%|██████████| 131/131 [00:22<00:00,  5.75it/s]


Valid Loss : 0.363
Train Loss : 0.327


 80%|████████  | 105/131 [00:18<00:04,  5.29it/s]

In [None]:
print_report(prebert_classifier, bert_val_loader, whole_dataset["val"]["labels"], 0.25)

In [None]:
prebert_classifier = None
gc.collect()

26

In [21]:
# Simple model to perform some tests with pytorch
class FineTunedBert(nn.Module):
    def __init__(self, bert_model):
        super(FineTunedBert, self).__init__() 
        self.bert_model = bert_model
        for param in self.bert_model.parameters():
            param.requires_grad = True
        bert_hidden_size = bert_model.config.hidden_size
        self.linear_1 = nn.Linear(bert_hidden_size, bert_hidden_size//2)
        self.relu = nn.ReLU()
        self.linear_2 = nn.Linear(bert_hidden_size//2, num_classes)

    def forward(self, X_batch):

        out = self.bert_model(input_ids=X_batch[0], 
                              token_type_ids = X_batch[1],
                              attention_mask = X_batch[2])

        out = out.last_hidden_state[:,0,:]
        out = self.linear_1(out)
        out = self.relu(out)
        out = self.linear_2(out)
        return out

# Training function
def finetune_bert(model, loss_fn, optimizer, train_loader, val_loader, epochs, early_stopping_info, model_name, scheduler):
    patience_acc = 0
    precedent_loss = np.Inf
    model.train()
    for i in range(1, epochs+1):
        losses = []
        for X, Y in tqdm(train_loader):
            model.zero_grad()
            Y_preds = model(X)
            loss = loss_fn(Y_preds, Y)
            losses.append(loss.item())

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()

        loss = CalcValLoss(model, loss_fn, val_loader)
        print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
        if precedent_loss - loss < early_stopping_info["delta"]:
           patience_acc = patience_acc + 1
        else:
          patience_acc = 0
          precedent_loss = loss
          torch.save(model, model_name + "_best.pth")

        if patience_acc > early_stopping_info["patience"]:
          return torch.load(model_name + "best.pth")


    return model

In [23]:
bert_model_unfrozen = BertModel.from_pretrained('bert-base-uncased')
bert_model_unfrozen.to(device)
print("reloaded")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


reloaded


In [24]:
batch_size = 16
epochs = 4
learning_rate = 5e-5

loss_fn = nn.BCEWithLogitsLoss()

finetune_classifier = FineTunedBert(bert_model_unfrozen)
optimizer = AdamW(finetune_classifier.parameters(), lr=learning_rate, eps=1e-8)


bert_train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))
bert_val_loader  = DataLoader(val_dataset, batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))
bert_test_loader  = DataLoader(test_dataset, batch_size=batch_size, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0,
                                            num_training_steps=len(bert_train_loader)*epochs)

finetune_classifier.to(device)
summary(finetune_classifier,input_data=next(iter(bert_train_loader))[0], device=device, dtypes = [torch.int]*3)

Layer (type:depth-idx)                                  Output Shape              Param #
FineTunedBert                                           [16, 20]                  --
├─BertModel: 1-1                                        [16, 768]                 --
│    └─BertEmbeddings: 2-1                              [16, 70, 768]             --
│    │    └─Embedding: 3-1                              [16, 70, 768]             23,440,896
│    │    └─Embedding: 3-2                              [16, 70, 768]             1,536
│    │    └─Embedding: 3-3                              [1, 70, 768]              393,216
│    │    └─LayerNorm: 3-4                              [16, 70, 768]             1,536
│    │    └─Dropout: 3-5                                [16, 70, 768]             --
│    └─BertEncoder: 2-2                                 [16, 70, 768]             --
│    │    └─ModuleList: 3-6                             --                        85,054,464
│    └─BertPooler: 2-3           

In [25]:
fix_random(seed)
finetune_classifier = finetune_bert(finetune_classifier, 
                                   loss_fn, optimizer,
                                   bert_train_loader,
                                   bert_val_loader,
                                   epochs,
                                   {"patience": 3, "delta": 1e-4}, 
                                   "finebert", scheduler)
# 5e-5, 3 epochs, 32 batch size 0.436
# 3e-5, 3 epochs, 16 batch size 0.43
# 5e-5, 3 epochs, 16 batch size 0.4887
# 5e-5, 4 epochs, 16 batch size 0.5115


100%|██████████| 261/261 [01:01<00:00,  4.28it/s]


Valid Loss : 0.373
Train Loss : 0.404


100%|██████████| 261/261 [01:02<00:00,  4.16it/s]


Valid Loss : 0.345
Train Loss : 0.323


100%|██████████| 261/261 [01:02<00:00,  4.18it/s]


Valid Loss : 0.338
Train Loss : 0.279


100%|██████████| 261/261 [01:02<00:00,  4.15it/s]


Valid Loss : 0.329
Train Loss : 0.252


In [26]:
print("FINETUNED BERT:")
print_report(finetune_classifier, bert_val_loader, whole_dataset["val"]["labels"], 0.25)

FINETUNED BERT:
----- MACRO AVG. -----
  F1-score:	0.403
  Precision:	0.4485
  Recall:	0.4291
  Accuracy:	0.841
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.3554		0.2488		0.622		0.848		82
  Self-direction: action    	0.5338		0.4942		0.5802		0.756		293
  Stimulation               	0.1154		0.3333		0.0698		0.9622		43
  Hedonism                  	0.2703		1.0		0.1562		0.9778		32
  Achievement               	0.6441		0.5745		0.7328		0.7584		363
  Power: dominance          	0.2694		0.3939		0.2047		0.8841		127
  Power: resources          	0.6031		0.58		0.6282		0.8118		277
  Face                      	0.0		0.0		0.0		0.9269		89
  Security: personal        	0.7101		0.6254		0.8214		0.7223		504
  Security: societal        	0.7298		0.6716		0.7991		0.7798		453
  Tradition                 	0.2745		0.3182		0.2414		0.9088		87
  Conformity: rules         	0.5463		0.468		0.6559		0.7502		279
  Conformity: interpersonal 	0.0		0.0		0.0		0.954		5

In [None]:
chn_loader = DataLoader(whole_dataset["test_chn"], batch_size=32, collate_fn=lambda x:tuple(y.to(device) for y in vectorize_batch(x)))
bert_chn_loader = DataLoader(whole_dataset["test_chn"], batch_size=32, collate_fn=lambda x:tuple(y.to(device) for y in bert_vectorize_batch(x)))


In [None]:
print_report(embed_classifier, chn_loader, whole_dataset["test_chn"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.1647
  Precision:	0.2646
  Recall:	0.1446
  Accuracy:	0.8745
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.0		0.0		0.0		0.88		6
  Self-direction: action    	0.2667		0.5		0.1818		0.89		11
  Stimulation               	0.0		0.0		0.0		1.0		0
  Hedonism                  	0.0		0.0		0.0		0.98		2
  Achievement               	0.481		0.475		0.4872		0.59		39
  Power: dominance          	0.0		0.0		0.0		0.99		1
  Power: resources          	0.3529		0.4		0.3158		0.78		19
  Face                      	0.0		0.0		0.0		0.99		1
  Security: personal        	0.4762		0.4545		0.5		0.67		30
  Security: societal        	0.3462		0.4286		0.2903		0.66		31
  Tradition                 	0.0		0.0		0.0		1.0		0
  Conformity: rules         	0.125		1.0		0.0667		0.86		15
  Conformity: interpersonal 	0.0		0.0		0.0		0.99		1
  Humility                  	0.0		0.0		0.0		0.95		5
  Benevolence: caring       	0.1429		0.5		0.0833		0.

In [None]:
print_report(prebert_classifier, bert_chn_loader, whole_dataset["test_chn"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.2551
  Precision:	0.2188
  Recall:	0.3798
  Accuracy:	0.8035
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.5556		0.4167		0.8333		0.92		6
  Self-direction: action    	0.2979		0.1944		0.6364		0.67		11
  Stimulation               	0.0		0.0		0.0		1.0		0
  Hedonism                  	0.0		0.0		0.0		0.98		2
  Achievement               	0.6		0.4444		0.9231		0.52		39
  Power: dominance          	0.0		0.0		0.0		0.9		1
  Power: resources          	0.375		0.2459		0.7895		0.5		19
  Face                      	0.0		0.0		0.0		0.99		1
  Security: personal        	0.495		0.3521		0.8333		0.49		30
  Security: societal        	0.4719		0.3621		0.6774		0.53		31
  Tradition                 	0.0		0.0		0.0		1.0		0
  Conformity: rules         	0.3448		0.3571		0.3333		0.81		15
  Conformity: interpersonal 	0.0		0.0		0.0		0.99		1
  Humility                  	0.0		0.0		0.0		0.95		5
  Benevolence: caring       	0.4167

In [None]:
print_report(finetune_classifier, bert_chn_loader, whole_dataset["test_chn"]["labels"], 0.25)

----- MACRO AVG. -----
  F1-score:	0.2712
  Precision:	0.2211
  Recall:	0.3944
  Accuracy:	0.836
----- PER-CLASS VALUES -----
  				F1-score	Precision	Recall		Accuracy	Support
  Self-direction: thought   	0.375		0.2308		1.0		0.8		6
  Self-direction: action    	0.36		0.2308		0.8182		0.68		11
  Stimulation               	0.0		0.0		0.0		0.98		0
  Hedonism                  	0.0		0.0		0.0		0.98		2
  Achievement               	0.6598		0.5517		0.8205		0.67		39
  Power: dominance          	0.0		0.0		0.0		0.99		1
  Power: resources          	0.4314		0.3438		0.5789		0.71		19
  Face                      	0.0		0.0		0.0		0.98		1
  Security: personal        	0.5412		0.4182		0.7667		0.61		30
  Security: societal        	0.4444		0.3902		0.5161		0.6		31
  Tradition                 	0.0		0.0		0.0		1.0		0
  Conformity: rules         	0.4324		0.3636		0.5333		0.79		15
  Conformity: interpersonal 	0.0		0.0		0.0		0.99		1
  Humility                  	0.0		0.0		0.0		0.95		5
  Benevolence: caring       	0.3158	