# Assignment 2

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Human Value Detection, Multi-label classification, Transformers, BERT


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

### Instructions

* **Download** the specificed training, validation, and test files.
* **Encode** split files into a pandas.DataFrame object.
* For each split, **merge** the arguments and labels dataframes into a single dataframe.
* **Merge** level 2 annotations to level 3 categories.

In [None]:
!pip install seaborn
!pip install transformers
!pip install torcheval
!pip install gdown

# Introduction

You are tasked to address the [Human Value Detection challenge](https://aclanthology.org/2022.acl-long.306/).

## Problem definition

Arguments are paired with their conveyed human values.

Arguments are in the form of **premise** $\rightarrow$ **conclusion**.

### Example:

**Premise**: *``fast food should be banned because it is really bad for your health and is costly''*

**Conclusion**: *``We should ban fast food''*

**Stance**: *in favour of*

<center>
    <img src="images/human_values.png" alt="human values" />
</center>

# [Task 1 - 0.5 points] Corpus

Check the official page of the challenge [here](https://touche.webis.de/semeval23/touche23-web/).

The challenge offers several corpora for evaluation and testing.

You are going to work with the standard training, validation, and test splits.

#### Arguments
* arguments-training.tsv
* arguments-validation.tsv
* arguments-test.tsv

#### Human values
* labels-training.tsv
* labels-validation.tsv
* labels-test.tsv

In [None]:
!curl https://zenodo.org/records/8248658/files/arguments-training.tsv?download=1 -o arguments-training.tsv
!curl https://zenodo.org/records/8248658/files/arguments-validation.tsv?download=1 -o arguments-validation.tsv
!curl https://zenodo.org/records/8248658/files/arguments-test.tsv?download=1 -o arguments-test.tsv
!curl https://zenodo.org/records/8248658/files/labels-training.tsv?download=1 -o labels-training.tsv
!curl https://zenodo.org/records/8248658/files/labels-validation.tsv?download=1 -o labels-validation.tsv
!curl https://zenodo.org/records/8248658/files/labels-test.tsv?download=1 -o labels-test.tsv

### Example

#### arguments-*.tsv
```

Argument ID    A01005

Conclusion     We should ban fast food

Stance         in favor of

Premise        fast food should be banned because it is really bad for your health and is costly.
```

#### labels-*.tsv

```
Argument ID                A01005

Self-direction: thought    0
Self-direction: action     0
...
Universalism: objectivity: 0
```

### Splits

The standard splits contain

   * **Train**: 5393 arguments
   * **Validation**: 1896 arguments
   * **Test**: 1576 arguments

### Annotations

In this assignment, you are tasked to address a multi-label classification problem.

You are going to consider **level 3** categories:

* Openness to change
* Self-enhancement
* Conversation
* Self-transcendence

**How to do that?**

You have to merge (**logical OR**) annotations of level 2 categories belonging to the same level 3 category.

**Pay attention to shared level 2 categories** (e.g., Hedonism). $\rightarrow$ [see Table 1 in the original paper.](https://aclanthology.org/2022.acl-long.306/)

#### Example

```
Self-direction: thought:    0
Self-direction: action:     1
Stimulation:                0
Hedonism:                   1

Openess to change           1
```

In [None]:
import copy
import random
import numpy as np
import pandas as pd
import torch
from torch import nn
from torch.utils.data import DataLoader
from torcheval.metrics.functional import binary_f1_score
from transformers import AutoTokenizer, AutoModel
import gdown

In [None]:
gdown.download('https://drive.usercontent.google.com/u/0/uc?id=17Nb2c918XvENe6JoP65cX95x27zXJT77&export=download', 'utils.tar.gz', quiet=False)

!tar -xf utils.tar.gz

In [None]:
from file_reader import import_features, import_labels
from dataframe_modifier import modify_stance, create_third_level_labels
from CustomDataset import CustomDataset
from network_trainer import train, evaluate_model
from plots import generate_summary, generate_precision_recall_curve, generate_confusion_matrix, \
    generate_f1_scores_table, generate_bar_plot_with_f1_scores, generate_training_history_plots, \
    show_some_misclassified_examples, generate_bar_plot, generate_correlation_heatmap


save_best_models = False
# If `load_best_models` is set to True, the notebook will automatically try to download
# the models' weights from Google Drive
load_best_models = True
models_load_link = 'https://drive.usercontent.google.com/download?id=13xq53-QPIqb-SQAURvPL0v-YvzW6iPxh&export=download&confirm=t&uuid=75139e51-946d-4242-8a66-b50a58c05e5d'
models_load_path = 'best_models.tar'
models_save_path = 'best_models.tar'

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

model_id = 1
models_dict = [{'name': 'bert-base-uncased', 'head_size': 768},
               {'name': 'roberta-base', 'head_size': 768},
               {'name': 'roberta-large', 'head_size': 1024}]

initializer_seed = 111
# TODO: before running with the good seeds fix y-axis limits of plots
seeds = [339, 1234, 4321]

num_epochs = 10

In [None]:
random.seed(initializer_seed)
np.random.seed(initializer_seed)
torch.manual_seed(initializer_seed)

train_dataframe, validation_dataframe, test_dataframe = import_features()
lab_train_dataframe, lab_validation_dataframe, lab_test_dataframe = import_labels()
modify_stance(train_dataframe, validation_dataframe, test_dataframe)

third_level_train_dataframe, third_level_validation_dataframe, third_level_test_dataframe = \
    create_third_level_labels(lab_train_dataframe, lab_validation_dataframe, lab_test_dataframe)

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(models_dict[model_id]['name'])

# Generate datasets and dataloaders
training_set = CustomDataset(train_dataframe, third_level_train_dataframe, tokenizer)
validation_set = CustomDataset(validation_dataframe, third_level_validation_dataframe, tokenizer)
test_set = CustomDataset(test_dataframe, third_level_test_dataframe, tokenizer)

training_loader = DataLoader(training_set, batch_size=16, shuffle=True)
validation_loader = DataLoader(validation_set, batch_size=16, shuffle=False)
test_loader = DataLoader(test_set, batch_size=16, shuffle=False)

In [None]:
generate_bar_plot(third_level_train_dataframe, third_level_validation_dataframe, third_level_test_dataframe)

In [None]:
generate_correlation_heatmap(third_level_train_dataframe, third_level_validation_dataframe, third_level_test_dataframe)

# [Task 2 - 2.0 points] Model definition

You are tasked to define several neural models for multi-label classification.

<center>
    <img src="images/model_schema.png" alt="model_schema" />
</center>

### Instructions

* **Baseline**: implement a random uniform classifier (an individual classifier per category).
* **Baseline**: implement a majority classifier (an individual classifier per category).

<br/>

* **BERT w/ C**: define a BERT-based classifier that receives an argument **conclusion** as input.
* **BERT w/ CP**: add argument **premise** as an additional input.
* **BERT w/ CPS**: add argument premise-to-conclusion **stance** as an additional input.

### Notes

**Do not mix models**. Each model has its own instructions.

You are **free** to select the BERT-based model card from huggingface.

#### Examples

```
bert-base-uncased
prajjwal1/bert-tiny
distilbert-base-uncased
roberta-base
```

### BERT w/ C

<center>
    <img src="images/bert_c.png" alt="BERT w/ C" />
</center>

### BERT w/ CP

<center>
    <img src="images/bert_cp.png" alt="BERT w/ CP" />
</center>

### BERT w/ CPS

<center>
    <img src="images/bert_cps.png" alt="BERT w/ CPS" />
</center>

### Input concatenation

<center>
    <img src="images/input_merging.png" alt="Input merging" />
</center>

### Notes

The **stance** input has to be encoded into a numerical format.

You **should** use the same model instance to encode **premise** and **conclusion** inputs.

In [None]:
def random_uniform_classifier(test_labels):
    test_labels = test_labels.values
    n_instances = test_labels.shape[0]
    n_classes = test_labels.shape[1]

    predictions = np.random.randint(2, size=(n_instances, n_classes))
    predictions = torch.Tensor(predictions)
    return predictions


def majority_classifier(training_labels, test_labels):
    training_labels = training_labels.values
    test_labels = test_labels.values
    n_instances = test_labels.shape[0]
    n_classes = test_labels.shape[1]

    predictions = np.zeros((n_instances, n_classes))
    for i in range(n_classes):
        if np.sum(training_labels[:, i]) >= len(training_labels) / 2:
            predictions[:, i] = 1
    predictions = torch.Tensor(predictions)
    return predictions


def one_baseline(test_labels):
    test_labels = test_labels.values
    n_instances = test_labels.shape[0]
    n_classes = test_labels.shape[1]

    predictions = np.ones((n_instances, n_classes))
    predictions = torch.Tensor(predictions)
    return predictions


# Baselines
# 1. Random
predictions_random = random_uniform_classifier(third_level_test_dataframe)
val_predictions_random = random_uniform_classifier(third_level_validation_dataframe)

# 2. Majority
predictions_majority = majority_classifier(third_level_train_dataframe, third_level_test_dataframe)
val_predictions_majority = majority_classifier(third_level_train_dataframe, third_level_validation_dataframe)

# 3. 1-baseline
predictions_one = one_baseline(third_level_test_dataframe)
val_predictions_one = one_baseline(third_level_validation_dataframe)

In [None]:
class ClassifierC(nn.Module):
    def __init__(self, name, head_size):
        super(ClassifierC, self).__init__()
        self.name_ = 'C'
        self.embedder = AutoModel.from_pretrained(name)
        for param in self.embedder.parameters():
            param.requires_grad = False
        self.linear = nn.Linear(head_size, 4)
        self.tanh = nn.Tanh()

    def forward(self, x):
        x = x[0]
        attention_mask = x['attention_mask'].unsqueeze(-1)
        x = self.embedder(**x).last_hidden_state
        x = x * attention_mask
        x = x.mean(dim=1)
        x = self.linear(x)
        x = (self.tanh(x) + 1) / 2
        return x


class ClassifierCP(nn.Module):
    def __init__(self, name, head_size):
        super(ClassifierCP, self).__init__()
        self.name_ = 'CP'
        self.embedder = AutoModel.from_pretrained(name)
        for param in self.embedder.parameters():
            param.requires_grad = False
        self.linear = nn.Linear(head_size * 2, 4)
        self.tanh = nn.Tanh()

    def forward(self, x):
        y = x[0]
        z = x[1]
        attention_mask_y = y['attention_mask'].unsqueeze(-1)
        attention_mask_z = z['attention_mask'].unsqueeze(-1)
        y = self.embedder(**y).last_hidden_state
        z = self.embedder(**z).last_hidden_state
        y = y * attention_mask_y
        z = z * attention_mask_z
        y = y.mean(dim=1)
        z = z.mean(dim=1)
        x = torch.cat([y, z], 1)
        x = self.linear(x)
        x = (self.tanh(x) + 1) / 2
        return x


class ClassifierCPS(nn.Module):
    def __init__(self, name, head_size):
        super(ClassifierCPS, self).__init__()
        self.name_ = 'CPS'
        self.embedder = AutoModel.from_pretrained(name)
        for param in self.embedder.parameters():
            param.requires_grad = False
        self.linear = nn.Linear(head_size * 2 + 1, 4)
        self.tanh = nn.Tanh()

    def forward(self, x):
        y = x[0]
        z = x[1]
        w = x[2]
        attention_mask_y = y['attention_mask'].unsqueeze(-1)
        attention_mask_z = z['attention_mask'].unsqueeze(-1)
        y = self.embedder(**y).last_hidden_state
        z = self.embedder(**z).last_hidden_state
        y = y * attention_mask_y
        z = z * attention_mask_z
        y = y.mean(dim=1)
        z = z.mean(dim=1)
        x = torch.cat([y, z, w.reshape((y.shape[0], 1))], 1)
        x = self.linear(x)
        x = (self.tanh(x) + 1) / 2
        return x

# [Task 3 - 0.5 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using per-category binary F1-score.
* Compute the average binary F1-score over all categories (macro F1-score).

### Example

You start with individual predictions ($\rightarrow$ samples).

```
Openess to change:    0 0 1 0 1 1 0 ...
Self-enhancement:     1 0 0 0 1 0 1 ...
Conversation:         0 0 0 1 1 0 1 ...
Self-transcendence:   1 1 0 1 0 1 0 ...
```

You compute per-category binary F1-score.

```
Openess to change F1:    0.35
Self-enhancement F1:     0.55
Conversation F1:         0.80
Self-transcendence F1:   0.21
```

You then average per-category scores.
```
Average F1: ~0.48
```

In [None]:
def calculate_f1_score(predictions, targets, verbose=False):
    cols = predictions.shape[1]
    single_class_scores = torch.zeros(cols)
    for i in range(cols):
        single_class_scores[i] = binary_f1_score(predictions[:, i], targets[:, i])
        if verbose:
            print('F1 score for column %d: %.3f' % (i, single_class_scores[i]))
    return torch.mean(single_class_scores), single_class_scores

# [Task 4 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate **all** defined models.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Pick **at least** three seeds for robust estimation.
* Compute metrics on the validation set.
* Report **per-category** and **macro** F1-score for comparison.

In [None]:
# Initialize training data structures
histories = {}
best_models = {}
models = {'C': ClassifierC,
          'CP': ClassifierCP,
          'CPS': ClassifierCPS
          }

# Train models
if not load_best_models:
    for seed in seeds:
        for model_type, model_class in models.items():
            random.seed(seed)
            np.random.seed(seed)
            torch.manual_seed(seed)
            print('Set seed to: ', seed)

            model = model_class(**models_dict[model_id])
            model_trained, history = train(model, training_loader, validation_loader, num_epochs)
            if model_type not in histories:
                histories[model_type] = history
                best_models[model_type] = model_trained
            else:
                if history['best_val_macro_f1'] > histories[model_type]['best_val_macro_f1']:
                    histories[model_type] = history
                    best_models[model_type] = model_trained

    if save_best_models:
        torch.save({
            'modelC_state_dict': best_models['C'].state_dict(),
            'modelCP_state_dict': best_models['CP'].state_dict(),
            'modelCPS_state_dict': best_models['CPS'].state_dict(),
            'historyC': histories['C'],
            'historyCP': histories['CP'],
            'historyCPS': histories['CPS']
        }, models_save_path)
        print('Models saved successfully to: ', models_save_path, '\n')
else:
    gdown.download(models_load_link, models_load_path, quiet=False)

    checkpoint = torch.load(models_load_path, map_location=device)
    best_models['C'] = ClassifierC(**models_dict[model_id])
    best_models['C'].load_state_dict(checkpoint['modelC_state_dict'])
    best_models['CP'] = ClassifierCP(**models_dict[model_id])
    best_models['CP'].load_state_dict(checkpoint['modelCP_state_dict'])
    best_models['CPS'] = ClassifierCPS(**models_dict[model_id])
    best_models['CPS'].load_state_dict(checkpoint['modelCPS_state_dict'])
    histories['C'] = checkpoint['historyC']
    histories['CP'] = checkpoint['historyCP']
    histories['CPS'] = checkpoint['historyCPS']
    print('Models loaded successfully from: ', models_load_path, '\n')

## Validation Macro F1 scores

In [None]:
val_macro_f1_scores = []
# Baselines over validation set
val_macro_f1_scores.append(
    ['random', calculate_f1_score(val_predictions_random, torch.Tensor(third_level_validation_dataframe.values))[0].item()])
val_macro_f1_scores.append(
    ['majority', calculate_f1_score(val_predictions_majority, torch.Tensor(third_level_validation_dataframe.values))[0].item()])
val_macro_f1_scores.append(
    ['one', calculate_f1_score(val_predictions_one, torch.Tensor(third_level_validation_dataframe.values))[0].item()])

for model_type, history in histories.items():
    val_macro_f1_scores.append([model_type, history['best_val_macro_f1']])

val_macro_f1_scores = pd.DataFrame(val_macro_f1_scores, columns=['Model', 'Macro F1 score'])
val_macro_f1_scores

## Evaluation of best models on test set

In [None]:
outputs_dict = {'random': predictions_random, 'majority': predictions_majority, 'one': predictions_one}
labels = third_level_test_dataframe.values
crisp_predictions_dict = copy.deepcopy(outputs_dict)

summaries = []
for model_type, model in best_models.items():
    _, _, _, outputs, labels_, crisp_predictions = evaluate_model(model, test_loader, device, verbose=False)
    assert np.array_equal(labels, labels_)
    outputs_dict[model_type] = outputs
    crisp_predictions_dict[model_type] = crisp_predictions

    summaries.append(generate_summary(crisp_predictions, labels, verbose=False))

## Test Macro F1 scores - Classifier C

In [None]:
summaries[0]

## Test Macro F1 scores - Classifier CP

In [None]:
summaries[1]

## Test Macro F1 scores - Classifier CPS

In [None]:
summaries[2]

## Yet another F1 score table

In [None]:
f1_score_table = generate_f1_scores_table(outputs_dict, labels, crisp_predictions_dict, verbose=False)
f1_score_table

# [Task 5 - 1.0 points] Error Analysis

You are tasked to discuss your results.

### Instructions

* **Compare** classification performance of BERT-based models with respect to baselines.
* Discuss **difference in prediction** between the best performing BERT-based model and its variants.

### Notes

You can check the [original paper](https://aclanthology.org/2022.acl-long.306/) for suggestions on how to perform comparisons (e.g., plots, tables, etc...).

In [None]:
generate_training_history_plots(histories)

In [None]:
generate_precision_recall_curve(outputs_dict, labels, crisp_predictions_dict)

In [None]:
generate_confusion_matrix(outputs_dict, labels, crisp_predictions_dict)

In [None]:
generate_bar_plot_with_f1_scores(outputs_dict, labels, crisp_predictions_dict)

In [None]:
misclassified_dict = show_some_misclassified_examples(test_dataframe, labels, crisp_predictions_dict, verbose=False)

## Misclassified examples - Classifier C - Label: Openness to change

In [None]:
misclassified_dict['C'][0]

## Misclassified examples - Classifier C - Label: Self-transcendence

In [None]:
misclassified_dict['C'][1]

## Miscassified examples - Classifier C - Label: Self-enhancement

In [None]:
misclassified_dict['C'][2]

## Misclassified examples - Classifier C - Label: Conservation

In [None]:
misclassified_dict['C'][3]

## Misclassified examples - Classifier CP - Label: Openness to change

In [None]:
misclassified_dict['CP'][0]

## Misclassified examples - Classifier CP - Label: Self-transcendence

In [None]:
misclassified_dict['CP'][1]

## Misclassified examples - Classifier CP - Label: Self-enhancement

In [None]:
misclassified_dict['CP'][2]

## Misclassified examples - Classifier CP - Label: Conservation

In [None]:
misclassified_dict['CP'][3]

## Misclassified examples - Classifier CPS - Label: Openness to change

In [None]:
misclassified_dict['CPS'][0] 

## Misclassified examples - Classifier CPs - Label: Self-transcendence

In [None]:
misclassified_dict['CPS'][1]

## Misclassified examples - Classifier CPS - Label: Self-enhancement

In [None]:
misclassified_dict['CPS'][2]

## Misclassified examples - Classifier CPS - Label: Conservation

In [None]:
misclassified_dict['CPS'][3]

# [Task 6 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Model card

You are **free** to choose the BERT-base model card you like from huggingface.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Model Training

You are **free** to choose training hyper-parameters for BERT-based models (e.g., number of epochs, etc...).

### Neural Libraries

You are **free** to use any library of your choice to address the assignment (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

# The End