# Assignment 2

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Human Value Detection, Multi-label classification, Transformers, BERT


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the [Human Value Detection challenge](https://aclanthology.org/2022.acl-long.306/).

## Problem definition

Arguments are paired with their conveyed human values.

Arguments are in the form of **premise** $\rightarrow$ **conclusion**.

### Example:

**Premise**: *``fast food should be banned because it is really bad for your health and is costly''*

**Conclusion**: *``We should ban fast food''*

**Stance**: *in favour of*

<center>
    <img src="images/human_values.png" alt="human values" />
</center>

# [Task 1 - 0.5 points] Corpus

Check the official page of the challenge [here](https://touche.webis.de/semeval23/touche23-web/).

The challenge offers several corpora for evaluation and testing.

You are going to work with the standard training, validation, and test splits.

#### Arguments
* arguments-training.tsv
* arguments-validation.tsv
* arguments-test.tsv

#### Human values
* labels-training.tsv
* labels-validation.tsv
* labels-test.tsv

In [48]:
import pandas as pd

train_dataframe = pd.read_csv('arguments-training.tsv', sep='\t')
validation_dataframe = pd.read_csv('arguments-validation.tsv', sep='\t')
test_dataframe = pd.read_csv('arguments-test.tsv', sep='\t')

lab_train_dataframe = pd.read_csv('labels-training.tsv', sep='\t')
lab_validation_dataframe = pd.read_csv('labels-validation.tsv', sep='\t')
lab_test_dataframe = pd.read_csv('labels-test.tsv', sep='\t')

In [49]:
train_dataframe['Stance'].replace('against', 0, inplace=True)
train_dataframe['Stance'].replace('in favor of', 1, inplace=True)

validation_dataframe['Stance'].replace('against', 0, inplace=True)
validation_dataframe['Stance'].replace('in favor of', 1, inplace=True)

test_dataframe['Stance'].replace('against', 0, inplace=True)
test_dataframe['Stance'].replace('in favor of', 1, inplace=True)

In [50]:
print(train_dataframe.columns)

print(train_dataframe.shape)
print(lab_train_dataframe.shape)

print(validation_dataframe.shape)
print(lab_validation_dataframe.shape)

print(test_dataframe.shape)
print(lab_test_dataframe.shape)

Index(['Argument ID', 'Conclusion', 'Stance', 'Premise'], dtype='object')
(5393, 4)
(5393, 21)
(1896, 4)
(1896, 21)
(1576, 4)
(1576, 21)


### Example

#### arguments-*.tsv
```

Argument ID    A01005

Conclusion     We should ban fast food

Stance         in favor of

Premise        fast food should be banned because it is really bad for your health and is costly.
```

#### labels-*.tsv

```
Argument ID                A01005

Self-direction: thought    0
Self-direction: action     0
...
Universalism: objectivity: 0
```

### Splits

The standard splits contain

   * **Train**: 5393 arguments
   * **Validation**: 1896 arguments
   * **Test**: 1576 arguments

### Annotations

In this assignment, you are tasked to address a multi-label classification problem.

You are going to consider **level 3** categories:

* Openness to change
* Self-enhancement
* Conversation
* Self-transcendence

**How to do that?**

You have to merge (**logical OR**) annotations of level 2 categories belonging to the same level 3 category.

**Pay attention to shared level 2 categories** (e.g., Hedonism). $\rightarrow$ [see Table 1 in the original paper.](https://aclanthology.org/2022.acl-long.306/)

#### Example

```
Self-direction: thought:    0
Self-direction: action:     1
Stimulation:                0
Hedonism:                   1

Openess to change           1
```

### Instructions

* **Download** the specificed training, validation, and test files.
* **Encode** split files into a pandas.DataFrame object.
* For each split, **merge** the arguments and labels dataframes into a single dataframe.
* **Merge** level 2 annotations to level 3 categories.

In [51]:
import numpy as np

oc_cols = ['thought', 'action', 'stimulation', 'hedonism']
st_cols = ['humility', 'caring', 'dependability', 'concern', 'nature', 'tollerance', 'objectivity']
se_cols = ['hedonism', 'achievement', 'dominance', 'resources', 'face']
cn_cols = ['humility', 'interpersonal', 'rules', 'tradition', 'societal', 'personal', 'face']

funct = lambda y,z: list(filter(lambda x : any([i.lower() in x.lower() for i in z]), y.columns))

third_level_cols = {'OC': oc_cols, 'ST': st_cols, 'SE': se_cols, 'CN': cn_cols}
third_level_train_dataframe = pd.DataFrame()
third_level_validation_dataframe = pd.DataFrame()
third_level_test_dataframe = pd.DataFrame()

third_level_train_dataframe['Argument ID'] = lab_train_dataframe['Argument ID']
third_level_validation_dataframe['Argument ID'] = lab_validation_dataframe['Argument ID']
third_level_test_dataframe['Argument ID'] = lab_test_dataframe['Argument ID']

for k,v in third_level_cols.items():
    train_dataframe_reduce = lab_train_dataframe[funct(lab_train_dataframe, v)].apply(np.sum, axis=1) > 0
    third_level_train_dataframe[k] = train_dataframe_reduce.astype(np.int32)
    
    validation_dataframe_reduce = lab_validation_dataframe[funct(lab_validation_dataframe, v)].apply(np.sum, axis=1) > 0
    third_level_validation_dataframe[k] = validation_dataframe_reduce.astype(np.int32)
    
    test_dataframe_reduce = lab_test_dataframe[funct(lab_test_dataframe, v)].apply(np.sum, axis=1) > 0
    third_level_test_dataframe[k] = test_dataframe_reduce.astype(np.int32)

training_set = pd.merge(train_dataframe, third_level_train_dataframe)
validation_set = pd.merge(validation_dataframe, third_level_validation_dataframe)
test_set = pd.merge(test_dataframe, third_level_test_dataframe)

training_set.drop(columns=['Argument ID'], inplace=True)
validation_set.drop(columns=['Argument ID'], inplace=True)
test_set.drop(columns=['Argument ID'], inplace=True)

In [52]:
from datasets import Dataset

train_data = Dataset.from_pandas(training_set)
val_data = Dataset.from_pandas(validation_set)
test_data = Dataset.from_pandas(test_set)

# [Task 2 - 2.0 points] Model definition

You are tasked to define several neural models for multi-label classification.

<center>
    <img src="images/model_schema.png" alt="model_schema" />
</center>

### Instructions

* **Baseline**: implement a random uniform classifier (an individual classifier per category).
* **Baseline**: implement a majority classifier (an individual classifier per category).

<br/>

* **BERT w/ C**: define a BERT-based classifier that receives an argument **conclusion** as input.
* **BERT w/ CP**: add argument **premise** as an additional input.
* **BERT w/ CPS**: add argument premise-to-conclusion **stance** as an additional input.

In [53]:
import random as rnd

#baseline: random uniform classifier 
def random_uniform_classifier(n_instances, cols_name):
    random_dataframe = pd.DataFrame(0, index=list(range(n_instances)), columns=cols_name)
    for i in range(n_instances):
        rnd_col = rnd.choice(cols_name)
        random_dataframe.iloc[i, list(cols_name).index(rnd_col)] = 1
    return random_dataframe

#baseline: random uniform classifier 
def majority_classifier(targets):
    n_instances = targets.shape[0]
    max_idx = np.argmax(targets.apply(np.sum, axis=0))
    majority_dataframe = pd.DataFrame(0, index=list(range(n_instances)), columns=targets.columns)
    for i in range(n_instances):
        majority_dataframe.iloc[i, max_idx] = 1
    return majority_dataframe

In [93]:
random_uniform_classifier(third_level_test_dataframe.shape[0], third_level_test_dataframe.drop(columns=['Argument ID']).columns)

Unnamed: 0,OC,ST,SE,CN
0,1,0,0,0
1,1,0,0,0
2,0,0,1,0
3,0,1,0,0
4,0,1,0,0
...,...,...,...,...
1571,0,0,0,1
1572,0,0,1,0
1573,0,1,0,0
1574,0,1,0,0


In [55]:
majority_classifier(third_level_test_dataframe.drop(columns=['Argument ID']))

Unnamed: 0,OC,ST,SE,CN
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,1,0,0
...,...,...,...,...
1571,0,1,0,0
1572,0,1,0,0
1573,0,1,0,0
1574,0,1,0,0


### Notes

**Do not mix models**. Each model has its own instructions.

You are **free** to select the BERT-based model card from huggingface.

#### Examples

```
bert-base-uncased
prajjwal1/bert-tiny
distilbert-base-uncased
roberta-base
```

In [56]:
class BertModel:
    def __init__(self):
        pass
    
    def concatenate_inputs(self):
        pass
    
    def tokenize(self):
        pass
    
    def train(self):
        pass
    
    def validate(self):
        pass
    

### BERT w/ C

<center>
    <img src="images/bert_c.png" alt="BERT w/ C" />
</center>

In [86]:
from transformers import AutoTokenizer, AutoModelForMaskedLM, BertForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(third_level_train_dataframe.columns[1:]), problem_type="multi_label_classification")

train_data = train_data.map(lambda x : tokenizer(x['Conclusion'], truncation=True), batched=True)
train_data = train_data.map(lambda x : tokenizer(x['Premise'], truncation=True), batched=True)
# train_data = train_data.map(lambda x : tokenizer(x['Stance'], truncation=True), batched=True)

val_data = val_data.map(lambda x : tokenizer(x['Conclusion'], truncation=True), batched=True)
val_data = val_data.map(lambda x : tokenizer(x['Premise'], truncation=True), batched=True)
# val_data = val_data.map(lambda x : tokenizer(x['Stance'], truncation=True), batched=True)

test_data = test_data.map(lambda x : tokenizer(x['Conclusion'], truncation=True), batched=True)
test_data = test_data.map(lambda x : tokenizer(x['Premise'], truncation=True), batched=True)
# test_data = test_data.map(lambda x : tokenizer(x['Stance'], truncation=True), batched=True)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/5393 [00:00<?, ? examples/s]

Map:   0%|          | 0/5393 [00:00<?, ? examples/s]

Map:   0%|          | 0/1896 [00:00<?, ? examples/s]

Map:   0%|          | 0/1896 [00:00<?, ? examples/s]

Map:   0%|          | 0/1576 [00:00<?, ? examples/s]

Map:   0%|          | 0/1576 [00:00<?, ? examples/s]

In [87]:
def concatenate_inputs(input_list, stance=None):
    embeddings = []
    for inp in input_list:
        embedder = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
        embeddings.append(embedder(inp))
    if stance is not None:
        embeddings.append(stance)
    return embeddings

### BERT w/ CP

<center>
    <img src="images/bert_cp.png" alt="BERT w/ CP" />
</center>

### BERT w/ CPS

<center>
    <img src="images/bert_cps.png" alt="BERT w/ CPS" />
</center>

### Input concatenation

<center>
    <img src="images/input_merging.png" alt="Input merging" />
</center>

### Notes

The **stance** input has to be encoded into a numerical format.

You **should** use the same model instance to encode **premise** and **conclusion** inputs.

# [Task 3 - 0.5 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using per-category binary F1-score.
* Compute the average binary F1-score over all categories (macro F1-score).

### Example

You start with individual predictions ($\rightarrow$ samples).

```
Openess to change:    0 0 1 0 1 1 0 ...
Self-enhancement:     1 0 0 0 1 0 1 ...
Conversation:         0 0 0 1 1 0 1 ...
Self-transcendence:   1 1 0 1 0 1 0 ...
```

You compute per-category binary F1-score.

```
Openess to change F1:    0.35
Self-enhancement F1:     0.55
Conversation F1:         0.80
Self-transcendence F1:   0.21
```

You then average per-category scores.
```
Average F1: ~0.48
```

In [88]:
from sklearn.metrics import f1_score
import evaluate

def f1_score(predictions, targets):
    cols = predictions.columns
    per_col_scores = np.zeros(len(cols))
    for i,c in enumerate(cols):
        per_col_scores[i] = f1_score(predictions[c], target[c])
    return per_col_scores.mean()

acc_metric = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)
    
    print(labels)
    
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='macro')
    acc = acc_metric.compute(predictions=predictions, references=labels)
    return {**f1, **acc}

# [Task 4 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate **all** defined models.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Pick **at least** three seeds for robust estimation.
* Compute metrics on the validation set.
* Report **per-category** and **macro** F1-score for comparison.

In [89]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [90]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="test_dir",                 # where to save model
    learning_rate=2e-5,                   
    per_device_train_batch_size=8,         # accelerate defines distributed training
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",           # when to report evaluation metrics/losses
    save_strategy="epoch",                 # when to save checkpoint
    load_best_model_at_end=True,
    report_to='none',                       # disabling wandb (default)
    label_names=['OC','ST','SE','CN']
)

In [91]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [92]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


TypeError: forward() got an unexpected keyword argument 'OC'

# [Task 5 - 1.0 points] Error Analysis

You are tasked to discuss your results.

### Instructions

* **Compare** classification performance of BERT-based models with respect to baselines.
* Discuss **difference in prediction** between the best performing BERT-based model and its variants.

### Notes

You can check the [original paper](https://aclanthology.org/2022.acl-long.306/) for suggestions on how to perform comparisons (e.g., plots, tables, etc...).

# [Task 6 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Model card

You are **free** to choose the BERT-base model card you like from huggingface.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Model Training

You are **free** to choose training hyper-parameters for BERT-based models (e.g., number of epochs, etc...).

### Neural Libraries

You are **free** to use any library of your choice to address the assignment (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

# The End