# Assignment 2

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Human Value Detection, Multi-label classification, Transformers, BERT


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the [Human Value Detection challenge](https://aclanthology.org/2022.acl-long.306/).

## Problem definition

Arguments are paired with their conveyed human values.

Arguments are in the form of **premise** $\rightarrow$ **conclusion**.

### Example:

**Premise**: *``fast food should be banned because it is really bad for your health and is costly''*

**Conclusion**: *``We should ban fast food''*

**Stance**: *in favour of*

<center>
    <img src="https://github.com/LorenzoScaioli/NLP_multi-label-text-classification-with-transformers/blob/main/images/human_values.png?raw=1" alt="human values" />
</center>

# [Task 1 - 0.5 points] Corpus

Check the official page of the challenge [here](https://touche.webis.de/semeval23/touche23-web/).

The challenge offers several corpora for evaluation and testing.

You are going to work with the standard training, validation, and test splits.

#### Arguments
* arguments-training.tsv
* arguments-validation.tsv
* arguments-test.tsv

#### Human values
* labels-training.tsv
* labels-validation.tsv
* labels-test.tsv

### Example

#### arguments-*.tsv
```

Argument ID    A01005

Conclusion     We should ban fast food

Stance         in favor of

Premise        fast food should be banned because it is really bad for your health and is costly.
```

#### labels-*.tsv

```
Argument ID                A01005

Self-direction: thought    0
Self-direction: action     0
...
Universalism: objectivity: 0
```

### Splits

The standard splits contain

   * **Train**: 5393 arguments
   * **Validation**: 1896 arguments
   * **Test**: 1576 arguments

### Annotations

In this assignment, you are tasked to address a multi-label classification problem.

You are going to consider **level 3** categories:

* Openness to change
* Self-enhancement
* Conversation
* Self-transcendence

**How to do that?**

You have to merge (**logical OR**) annotations of level 2 categories belonging to the same level 3 category.

**Pay attention to shared level 2 categories** (e.g., Hedonism). $\rightarrow$ [see Table 1 in the original paper.](https://aclanthology.org/2022.acl-long.306/)

#### Example

```
Self-direction: thought:    0
Self-direction: action:     1
Stimulation:                0
Hedonism:                   1

Openess to change           1
```

### Instructions

* **Download** the specificed training, validation, and test files.
* **Encode** split files into a pandas.DataFrame object.
* For each split, **merge** the arguments and labels dataframes into a single dataframe.
* **Merge** level 2 annotations to level 3 categories.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# system packages
from pathlib import Path
import shutil
import urllib
import tarfile
import sys

# data and numerical management packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.dummy import DummyClassifier

# useful during debugging (progress bars)
from tqdm import tqdm

# random seed
import random

In [3]:
!pip install torch==1.13.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
!pip install transformers==4.30.0
!pip install datasets==2.13.2
!pip install accelerate -U
!pip install evaluate
!pip install tensordict

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116
Collecting torch==1.13.0+cu116
  Downloading https://download.pytorch.org/whl/cu116/torch-1.13.0%2Bcu116-cp310-cp310-linux_x86_64.whl (1983.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 GB[0m [31m420.3 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 2.1.0+cu121
    Uninstalling torch-2.1.0+cu121:
      Successfully uninstalled torch-2.1.0+cu121
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.1.0+cu121 requires torch==2.1.0, but you have torch 1.13.0+cu116 which is incompatible.
torchdata 0.7.0 requires torch==2.1.0, but you have torch 1.13.0+cu116 which is incompatible.
torchtext 0.16.0 requires torch==2.1.0, but you have torch 1.13.0+cu116 

In [4]:
import torch
torch.cuda.is_available()

True

In [5]:
def fix_random(seed: int) -> None:
    random.seed(seed)
    np.random.seed(seed)
    generator = np.random.default_rng(seed)

    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    return generator

seed = np.random.randint(2**25, 2**26)
print(f"Seed: {seed}")
rand_gen = fix_random(seed=seed)

Seed: 49179310


#### Download

In [6]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)

def download_url(download_path: Path, url: str):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=download_path, reporthook=t.update_to)


def download_dataset(download_path: Path, url: str):
    print("Downloading dataset...")
    download_url(url=url, download_path=download_path)
    print("Download complete!")

def clean_download(url, name, folder):
    print(f"Current work directory: {Path.cwd()}")
    dataset_folder = Path.cwd().joinpath("Datasets").joinpath(folder)

    if not dataset_folder.exists():
        dataset_folder.mkdir(parents=True)

    dataset_path = dataset_folder.joinpath(name)

    if not dataset_path.exists():
        download_dataset(dataset_path, url)

In [7]:
arguments_split = ["arguments-training.tsv", "arguments-validation.tsv", "arguments-test.tsv"]
labels_split = ["labels-training.tsv", "labels-validation.tsv", "labels-test.tsv"]

for argument in arguments_split:
    url = "https://zenodo.org/records/8248658/files/" + argument + "?download=1"
    clean_download(url, argument, "Arguments")

for label in labels_split:
    url = "https://zenodo.org/records/8248658/files/" + label + "?download=1"
    clean_download(url, label, "Labels")


Current work directory: /content
Downloading dataset...


arguments-training.tsv?download=1: 1.02MB [00:01, 604kB/s]                           


Download complete!
Current work directory: /content
Downloading dataset...


arguments-validation.tsv?download=1: 369kB [00:01, 188kB/s]                           


Download complete!
Current work directory: /content
Downloading dataset...


arguments-test.tsv?download=1: 295kB [00:01, 197kB/s]                           


Download complete!
Current work directory: /content
Downloading dataset...


labels-training.tsv?download=1: 254kB [00:01, 182kB/s]                           


Download complete!
Current work directory: /content
Downloading dataset...


labels-validation.tsv?download=1: 90.1kB [00:01, 65.8kB/s]                            


Download complete!
Current work directory: /content
Downloading dataset...


labels-test.tsv?download=1: 81.9kB [00:04, 16.6kB/s]                            

Download complete!





#### Encode

In [8]:
# Read the arguments split files into DataFrames
arguments_train_df = pd.read_table('Datasets/Arguments/arguments-training.tsv', sep='\t')
arguments_val_df = pd.read_table('Datasets/Arguments/arguments-validation.tsv', sep='\t')
arguments_test_df = pd.read_table('Datasets/Arguments/arguments-test.tsv', sep='\t')

# Read the labels split files into DataFrames
labels_train_df = pd.read_table('Datasets/Labels/labels-training.tsv', sep='\t')
labels_val_df = pd.read_table('Datasets/Labels/labels-validation.tsv', sep='\t')
labels_test_df = pd.read_table('Datasets/Labels/labels-test.tsv', sep='\t')


#### Merge arguments and labels

In [9]:
train_merge_df = pd.merge(arguments_train_df, labels_train_df, on='Argument ID')
val_merge_df = pd.merge(arguments_val_df, labels_val_df, on='Argument ID')
test_merge_df = pd.merge(arguments_test_df, labels_test_df, on='Argument ID')


#### Merge level 2 annotations to level 3 categories

* Openness to change
* Self-enhancement
* Conversation
* Self-transcendence

In [10]:
level_2_categories = train_merge_df.columns[4:]

In [11]:
print(type(level_2_categories))

<class 'pandas.core.indexes.base.Index'>


In [12]:
level_2_to_Openness_to_change = level_2_categories[:4]
level_2_to_Self_enhancement = level_2_categories[3:8]
level_2_to_Conservation = level_2_categories[7:14]
level_2_to_Self_transcendence = level_2_categories[13:]

In [13]:
def merge_categories(df):
    df['Openness to change'] = [int(any(df[level_2_to_Openness_to_change].loc[i])) for i in range(len(df))]
    df['Self-enhancement'] = [int(any(df[level_2_to_Self_enhancement].loc[i])) for i in range(len(df))]
    df['Conservation'] = [int(any(df[level_2_to_Conservation].loc[i])) for i in range(len(df))]
    df['Self-transcendence'] = [int(any(df[level_2_to_Self_transcendence].loc[i])) for i in range(len(df))]
    return df.drop(level_2_categories, axis=1)

In [14]:
final_train_df = merge_categories(train_merge_df)
final_val_df = merge_categories(val_merge_df)
final_test_df = merge_categories(test_merge_df)

In [15]:
final_train_df.head()

Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Openness to change,Self-enhancement,Conservation,Self-transcendence
0,A01002,We should ban human cloning,in favor of,we should ban human cloning as it will only ca...,0,0,1,0
1,A01005,We should ban fast food,in favor of,fast food should be banned because it is reall...,0,0,1,0
2,A01006,We should end the use of economic sanctions,against,sometimes economic sanctions are the only thin...,0,1,1,0
3,A01007,We should abolish capital punishment,against,capital punishment is sometimes the only optio...,0,0,1,1
4,A01008,We should ban factory farming,against,factory farming allows for the production of c...,0,0,1,1


In [16]:
# print the dimension of the three splits (train, val, test)
print("Train shape:", final_train_df.shape)
print("Val shape:", final_val_df.shape)
print("Test shape:", final_test_df.shape)

# print the number of the longest conclusion and longest premise in the three splits (train, val, test)
print()
print("Train max premise length:", final_train_df['Premise'].str.len().max())
print("Train max conclusion length:", final_train_df['Conclusion'].str.len().max())
print("Val max premise length:", final_val_df['Premise'].str.len().max())
print("Val max conclusion length:", final_val_df['Conclusion'].str.len().max())
print("Test max premise length:", final_test_df['Premise'].str.len().max())
print("Test max conclusion length:", final_test_df['Conclusion'].str.len().max())

# print the longest union of premise and conclusion in the three splits (train, val, test)
print()
print("Train max union length:", (final_train_df['Premise'] + final_train_df['Conclusion']).str.len().max())
print("Val max union length:", (final_val_df['Premise'] + final_val_df['Conclusion']).str.len().max())
print("Test max union length:", (final_test_df['Premise'] + final_test_df['Conclusion']).str.len().max())

# print how many unions of premise and conclusion are longer than 512 in the three splits (train, val, test)
print()
print("Train # unions longer than 512:", len((final_train_df['Premise'] + final_train_df['Conclusion']).loc[(final_train_df['Premise'] + final_train_df['Conclusion']).str.len() > 508]))
print("Val # unions longer than 512:", len((final_val_df['Premise'] + final_val_df['Conclusion']).loc[(final_val_df['Premise'] + final_val_df['Conclusion']).str.len() > 508]))
print("Test # unions longer than 512:", len((final_test_df['Premise'] + final_test_df['Conclusion']).loc[(final_test_df['Premise'] + final_test_df['Conclusion']).str.len() > 508]))

Train shape: (5393, 8)
Val shape: (1896, 8)
Test shape: (1576, 8)

Train max premise length: 792
Train max conclusion length: 190
Val max premise length: 825
Val max conclusion length: 184
Test max premise length: 822
Test max conclusion length: 157

Train max union length: 844
Val max union length: 857
Test max union length: 857

Train # unions longer than 512: 77
Val # unions longer than 512: 25
Test # unions longer than 512: 23


# [Task 2 - 2.0 points] Model definition

You are tasked to define several neural models for multi-label classification.

<center>
    <img src="https://github.com/LorenzoScaioli/NLP_multi-label-text-classification-with-transformers/blob/main/images/model_schema.png?raw=1" alt="model_schema" />
</center>

### Instructions

* **Baseline**: implement a random uniform classifier (an individual classifier per category).
* **Baseline**: implement a majority classifier (an individual classifier per category).

<br/>

* **BERT w/ C**: define a BERT-based classifier that receives an argument **conclusion** as input.
* **BERT w/ CP**: add argument **premise** as an additional input.
* **BERT w/ CPS**: add argument premise-to-conclusion **stance** as an additional input.

### Notes

**Do not mix models**. Each model has its own instructions.

You are **free** to select the BERT-based model card from huggingface.

#### Examples

```
bert-base-uncased
prajjwal1/bert-tiny
distilbert-base-uncased
roberta-base
```

### BERT w/ C

<center>
    <img src="https://github.com/LorenzoScaioli/NLP_multi-label-text-classification-with-transformers/blob/main/images/bert_c.png?raw=1" alt="BERT w/ C" />
</center>

### BERT w/ CP

<center>
    <img src="https://github.com/LorenzoScaioli/NLP_multi-label-text-classification-with-transformers/blob/main/images/bert_cp.png?raw=1" alt="BERT w/ CP" />
</center>

### BERT w/ CPS

<center>
    <img src="https://github.com/LorenzoScaioli/NLP_multi-label-text-classification-with-transformers/blob/main/images/bert_cps.png?raw=1" alt="BERT w/ CPS" />
</center>

### Input concatenation

<center>
    <img src="https://github.com/LorenzoScaioli/NLP_multi-label-text-classification-with-transformers/blob/main/images/input_merging.png?raw=1" alt="Input merging" />
</center>

### Notes

The **stance** input has to be encoded into a numerical format.

You **should** use the same model instance to encode **premise** and **conclusion** inputs.

### Instructions

* **Baseline**: implement a random uniform classifier (an individual classifier per category).
* **Baseline**: implement a majority classifier (an individual classifier per category).

<br/>

* **BERT w/ C**: define a BERT-based classifier that receives an argument **conclusion** as input.
* **BERT w/ CP**: add argument **premise** as an additional input.
* **BERT w/ CPS**: add argument premise-to-conclusion **stance** as an additional input.

#### Text encoding

Transformers typically use [SentencePiece tokenizer](https://github.com/google/sentencepiece) to perform sub-word level tokenization.

In particular, the `transformers` library offers the `AutoTokenizer` class to quickly retrieve our chosen transformer's ad-hoc tokenizer.

The `model_card` variable defines the *path* where to look for our pre-trained model.

You can check [huggingface's hub](https://huggingface.co/models) model hub to pick the model card according to your preference.

In [17]:
from transformers import AutoTokenizer

model_card = 'prajjwal1/bert-tiny'

tokenizer = AutoTokenizer.from_pretrained(model_card)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [18]:
labels = ['Openness to change', 'Self-enhancement', 'Conservation', 'Self-transcendence']
num_labels = len(labels)
id2label = {i:label for i, label in enumerate(labels)}
label2id = {label:i for i, label in enumerate(labels)}
labels

['Openness to change',
 'Self-enhancement',
 'Conservation',
 'Self-transcendence']

Encoding Stance in numerical format

In [19]:
le = LabelEncoder()

def encode_stance(df):
    return le.fit_transform(df['Stance'])

final_train_df['Stance'] = encode_stance(final_train_df)
final_val_df['Stance'] = encode_stance(final_val_df)
final_test_df['Stance'] = encode_stance(final_test_df)

In [20]:
final_train_df

Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Openness to change,Self-enhancement,Conservation,Self-transcendence
0,A01002,We should ban human cloning,1,we should ban human cloning as it will only ca...,0,0,1,0
1,A01005,We should ban fast food,1,fast food should be banned because it is reall...,0,0,1,0
2,A01006,We should end the use of economic sanctions,0,sometimes economic sanctions are the only thin...,0,1,1,0
3,A01007,We should abolish capital punishment,0,capital punishment is sometimes the only optio...,0,0,1,1
4,A01008,We should ban factory farming,0,factory farming allows for the production of c...,0,0,1,1
...,...,...,...,...,...,...,...,...
5388,E08016,The EU should integrate the armed forces of it...,1,"On the one hand, we have Russia killing countl...",0,1,1,1
5389,E08017,Food whose production has been subsidized with...,1,The subsidies were originally intended to ensu...,0,0,1,1
5390,E08018,Food whose production has been subsidized with...,1,These products come mainly from large enterpri...,0,0,0,1
5391,E08019,Food whose production has been subsidized with...,1,Subsidies often make farmers in recipient coun...,0,0,1,1


In [21]:
from datasets import Dataset

train_dataset = Dataset.from_pandas(final_train_df)
val_dataset = Dataset.from_pandas(final_val_df)
test_dataset = Dataset.from_pandas(final_test_df)

In [22]:
train_dataset.features.keys()

dict_keys(['Argument ID', 'Conclusion', 'Stance', 'Premise', 'Openness to change', 'Self-enhancement', 'Conservation', 'Self-transcendence', '__index_level_0__'])

In [23]:
def preprocess_data_conclusion(dataset):
  # take a batch of texts
  text = dataset["Conclusion"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=512)
  # add labels
  labels_batch = {k: dataset[k] for k in dataset.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()

  return encoding



def preprocess_data_conclusion_premise(dataset):
  # take a batch of texts
  text1 = dataset["Conclusion"]
  text2 = dataset["Premise"]
  # encode them
  encoding = tokenizer(text1, text2, padding="max_length", truncation=True, max_length=512)
  # add labels
  labels_batch = {k: dataset[k] for k in dataset.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text1), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()

  return encoding



def preprocess_data_conclusion_premise_stance(dataset):
  # take a batch of texts
  text1 = dataset["Conclusion"]
  text2 = dataset["Premise"]
  text3 = list(map(str, dataset["Stance"]))
  text = []
  for i, t in enumerate(text1):
    text.append(t + '[SEP]' + text3[i])
  # encode them
  encoding = tokenizer(text, text2, padding="max_length", truncation=True, max_length=512)
  # add labels
  labels_batch = {k: dataset[k] for k in dataset.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text1), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()

  return encoding

In [24]:
encoded_c_train_dataset = train_dataset.map(preprocess_data_conclusion, batched=True, remove_columns=train_dataset.column_names)
encoded_c_val_dataset = val_dataset.map(preprocess_data_conclusion, batched=True, remove_columns=val_dataset.column_names)
encoded_c_test_dataset = test_dataset.map(preprocess_data_conclusion, batched=True, remove_columns=test_dataset.column_names)

Map:   0%|          | 0/5393 [00:00<?, ? examples/s]

Map:   0%|          | 0/1896 [00:00<?, ? examples/s]

Map:   0%|          | 0/1576 [00:00<?, ? examples/s]

In [25]:
encoded_cp_train_dataset = train_dataset.map(preprocess_data_conclusion_premise, batched=True, remove_columns=train_dataset.column_names)
encoded_cp_val_dataset = val_dataset.map(preprocess_data_conclusion_premise, batched=True, remove_columns=val_dataset.column_names)
encoded_cp_test_dataset = test_dataset.map(preprocess_data_conclusion_premise, batched=True, remove_columns=test_dataset.column_names)

Map:   0%|          | 0/5393 [00:00<?, ? examples/s]

Map:   0%|          | 0/1896 [00:00<?, ? examples/s]

Map:   0%|          | 0/1576 [00:00<?, ? examples/s]

In [26]:
encoded_cps_train_dataset = train_dataset.map(preprocess_data_conclusion_premise_stance, batched=True, remove_columns=train_dataset.column_names)
encoded_cps_val_dataset = val_dataset.map(preprocess_data_conclusion_premise_stance, batched=True, remove_columns=val_dataset.column_names)
encoded_cps_test_dataset = test_dataset.map(preprocess_data_conclusion_premise_stance, batched=True, remove_columns=test_dataset.column_names)

Map:   0%|          | 0/5393 [00:00<?, ? examples/s]

Map:   0%|          | 0/1896 [00:00<?, ? examples/s]

Map:   0%|          | 0/1576 [00:00<?, ? examples/s]

### Test to check if everything works

In [27]:
example = encoded_cps_train_dataset[0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [28]:
print(type(example['input_ids']))

<class 'list'>


In [29]:
tokenizer.decode(example['input_ids'])


'[CLS] we should ban human cloning [SEP] 1 [SEP] we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

In [30]:
example['labels']


[0.0, 0.0, 1.0, 0.0]

In [31]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['Conservation']

### Others

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch datasets.

In [32]:
encoded_c_train_dataset.set_format("torch")
encoded_c_val_dataset.set_format("torch")
encoded_c_test_dataset.set_format("torch")

encoded_cp_train_dataset.set_format("torch")
encoded_cp_val_dataset.set_format("torch")
encoded_cp_test_dataset.set_format("torch")

encoded_cps_train_dataset.set_format("torch")
encoded_cps_val_dataset.set_format("torch")
encoded_cps_test_dataset.set_format("torch")

In [33]:
from tensordict import TensorDict

### Implementing a random uniform classifier for each category

In [34]:
otc_uniform_classifier = DummyClassifier(strategy="uniform")
se_uniform_classifier = DummyClassifier(strategy="uniform")
cons_uniform_classifier = DummyClassifier(strategy="uniform")
st_uniform_classifier = DummyClassifier(strategy="uniform")

### Implementing a majority classifier for each category

In [35]:
otc_majority_classifier = DummyClassifier(strategy="prior")
se_majority_classifier = DummyClassifier(strategy="prior")
cons_majority_classifier = DummyClassifier(strategy="prior")
st_majority_classifier = DummyClassifier(strategy="prior")

### Defining a BERT-based classifier that receives an argument **conclusion** as input.

We first need to format input data to be fed as mini-batches in a training/evaluation procedure.<br>
https://discuss.huggingface.co/t/whats-the-input-of-bert/14932

In [36]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [37]:
from transformers import AutoModelForSequenceClassification

def create_models():
    c_model = AutoModelForSequenceClassification.from_pretrained(model_card,
                                                             problem_type="multi_label_classification",
                                                             num_labels=num_labels,
                                                             id2label=id2label,
                                                             label2id=label2id)

    cp_model = AutoModelForSequenceClassification.from_pretrained(model_card,
                                                                problem_type="multi_label_classification",
                                                                num_labels=num_labels,
                                                                id2label=id2label,
                                                                label2id=label2id)

    cps_model = AutoModelForSequenceClassification.from_pretrained(model_card,
                                                                problem_type="multi_label_classification",
                                                                num_labels=num_labels,
                                                                id2label=id2label,
                                                                label2id=label2id)
    return [c_model, cp_model, cps_model]

# [Task 3 - 0.5 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using per-category binary F1-score.
* Compute the average binary F1-score over all categories (macro F1-score).

### Example

You start with individual predictions ($\rightarrow$ samples).

```
Openess to change:    0 0 1 0 1 1 0 ...
Self-enhancement:     1 0 0 0 1 0 1 ...
Conservation:         0 0 0 1 1 0 1 ...
Self-transcendence:   1 1 0 1 0 1 0 ...
```

You compute per-category binary F1-score.

```
Openess to change F1:    0.35
Self-enhancement F1:     0.55
Conservation F1:         0.80
Self-transcendence F1:   0.21
```

You then average per-category scores.
```
Average F1: ~0.48
```

In [38]:
batch_size = 8
metric_name = "f1"

In [39]:
from sklearn.metrics import f1_score, accuracy_score, multilabel_confusion_matrix, classification_report
from transformers import EvalPrediction, TrainerCallback
import torch

threshold = 0.5
cr_dict = None

def compute_metrics(p: EvalPrediction):
    global cr_dict
    predictions = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    true_labels=p.label_ids

    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    y_true = true_labels

    accuracy = []

    if cr_dict is None:
        cr_dict = classification_report(y_true, y_pred, target_names=labels, output_dict=True, zero_division=0)
        for key,value in cr_dict.items():
            cr_dict[key] = {k: [v] for k,v in value.items()}
            cr_dict[key]['accuracy'] = []
            if key in labels:
                accuracy.append(accuracy_score(y_true[:, label2id[key]], y_pred[:, label2id[key]]))
                cr_dict[key]['accuracy'].append(accuracy[-1])
        cr_dict['macro avg']['accuracy'].append(np.mean(accuracy))
    else:
        cr = classification_report(y_true, y_pred, target_names=labels, output_dict=True, zero_division=0)
        for key,value in cr.items():
            for k in value.keys():
                cr_dict[key][k].append(cr[key][k])
            if key in labels:
                accuracy.append(accuracy_score(y_true[:, label2id[key]], y_pred[:, label2id[key]]))
                cr_dict[key]['accuracy'].append(accuracy[-1])
        cr_dict['macro avg']['accuracy'].append(np.mean(accuracy))

    macro_precision = cr_dict['macro avg']['precision'][-1]
    macro_recall = cr_dict['macro avg']['recall'][-1]
    macro_f1 = cr_dict['macro avg']['f1-score'][-1]
    macro_accuracy = cr_dict['macro avg']['accuracy'][-1]

    # return the metrics as a dictionary
    return {'f1': macro_f1, 'precision': macro_precision, 'recall': macro_recall, 'accuracy': macro_accuracy}

# [Task 4 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate **all** defined models.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Pick **at least** three seeds for robust estimation.
* Compute metrics on the validation set.
* Report **per-category** and **macro** F1-score for comparison.

In [40]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="test_dir",                 # where to save model
    learning_rate=0.01,
    per_device_train_batch_size=batch_size,         # accelerate defines distributed training
    per_device_eval_batch_size=batch_size,
    num_train_epochs=50,
    weight_decay=0.01,
    logging_strategy="epoch",
    evaluation_strategy="epoch",           # when to report evaluation metrics/losses
    save_strategy="epoch",                 # when to save checkpoint
    load_best_model_at_end=True,
    report_to='none'                       # disabling wandb (default)
)

In [41]:
from torch.optim.lr_scheduler import ReduceLROnPlateau

def create_trainers(models):
    c_optimizer = torch.optim.ASGD(models[0].parameters(), lr = 0.01)
    c_reduce_lr = ReduceLROnPlateau(c_optimizer, 'min', factor=0.5, patience=5)

    c_trainer = Trainer(
        model=models[0],
        args=args,
        train_dataset=encoded_c_train_dataset,
        eval_dataset=encoded_c_val_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        optimizers=[c_optimizer, c_reduce_lr]
    )

    cp_optimizer = torch.optim.ASGD(models[1].parameters(), lr = 0.01)
    cp_reduce_lr = ReduceLROnPlateau(cp_optimizer, 'min', factor=0.5, patience=5)

    cp_trainer = Trainer(
        model=models[1],
        args=args,
        train_dataset=encoded_cp_train_dataset,
        eval_dataset=encoded_cp_val_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        optimizers=[cp_optimizer, cp_reduce_lr]
    )

    cps_optimizer = torch.optim.ASGD(models[2].parameters(), lr = 0.01)
    cps_reduce_lr = ReduceLROnPlateau(cps_optimizer, 'min', factor=0.5, patience=5)

    cps_trainer = Trainer(
        model=models[2],
        args=args,
        train_dataset=encoded_cps_train_dataset,
        eval_dataset=encoded_cps_val_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        optimizers=[cps_optimizer, cps_reduce_lr]
    )
    return [c_trainer, cp_trainer, cps_trainer]

In [42]:
def download_model(model_to_download, model_seed=0):
    model_folder = Path.cwd().joinpath(f"models/{model_to_download.name}")

    for name in [f"my_history_{model_to_download.name}_{model_seed}.npy", f"my_metrics_val_{model_to_download.name}_{model_seed}.npy", f"my_metrics_test_{model_to_download.name}_{model_seed}.npy", f"my_cm_val_{model_to_download.name}_{model_seed}.npy", f"my_cm_test_{model_to_download.name}_{model_seed}.npy"]:
        url = f"https://github.com/LorenzoScaioli/NLP-Models/raw/main/models_Assigment_2/{model_to_download.name}/{name}"

        checkpoint_history = model_folder.joinpath(f"{name}")

        if not checkpoint_history.exists():
            download_dataset(checkpoint_history, url)

    url = f"https://github.com/LorenzoScaioli/NLP-Models/raw/main/models_Assigment_2/{model_to_download.name}/{model_to_download.name}_{model_seed}.pth"
    model_path = model_folder.joinpath(f"{model_to_download.name}_{model_seed}.h5")

    if not model_path.exists():
        download_dataset(model_path, url)

In [43]:
from urllib.error import HTTPError

def get_model(trainer, model, name, model_seed, train_model):
    global cr_dict
    models_folder = Path.cwd().joinpath("drive/MyDrive/models")

    if not models_folder.exists():
        models_folder.mkdir(parents=True)

    model_folder = models_folder.joinpath(f"{name}")

    if not model_folder.exists():
        model_folder.mkdir(parents=True)

    model_path = model_folder.joinpath(f"{name}_{model_seed}.pth")
    checkpoint_dir = model_folder
    checkpoint_path = model_folder.joinpath("/cp.ckpt")

    if train_model:
        trainer.train()
        # model.save_weights(checkpoint_path)
        torch.save(model, model_path)

        history = trainer.state.log_history
        np.save(checkpoint_dir.joinpath(f'my_history_{name}_{model_seed}.npy'), history)
        metrics_dict_val = cr_dict
        np.save(checkpoint_dir.joinpath(f'my_metrics_val_{name}_{model_seed}.npy'), metrics_dict_val)
        cr_dict = None
        #cm_val, cm_test = metrics_callback.get_cm()
        #np.save(checkpoint_dir.joinpath(f'my_cm_val_{name}_{model_seed}.npy'), cm_val)
        #np.save(checkpoint_dir.joinpath(f'my_cm_test_{name}_{model_seed}.npy'), cm_test)
    else:
        # model.load_weights(checkpoint_path, by_name=True, skip_mismatch=True)
        try:
            download_model(model, model_seed)
            model = torch.load(model_path)

            history = np.load(checkpoint_dir.joinpath(f'my_history_{name}_{model_seed}.npy'), allow_pickle='TRUE').item()
            metrics_dict_val = np.load(checkpoint_dir.joinpath(f'my_metrics_val_{name}_{model_seed}.npy'), allow_pickle='TRUE').item()
            # cm_val = np.load(checkpoint_dir.joinpath(f'my_cm_val_{name}_{model_seed}.npy'), allow_pickle='TRUE')
            # cm_test = np.load(checkpoint_dir.joinpath(f'my_cm_test_{name}_{model_seed}.npy'), allow_pickle='TRUE')
        except(HTTPError):
            print()
            print("Error: Model not found! Train it first!")
            return None, None, None, None, None, None

    return model, history, metrics_dict_val #cm_val, cm_test

In [44]:
model_names = ['c_model', 'cp_model', 'cps_model']
# models = [c_model, cp_model, cps_model]
# trainers = [c_trainer, cp_trainer, cps_trainer]
seeds = [2115992153, 3236146088, 749713082]
trained_models = []; trained_history = []; trained_metrics_dict_val = []; #trained_cm_val = []; trained_cm_test = []
train_model = True

for seed in seeds:
    fix_random(seed=seed)
    models = create_models()
    trainers = create_trainers(models)
    for i in range(3):
        model = models[i]
        trainer = trainers[i]
        name = model_names[i]
        print(f"Model: {name} - Seed: {seed}")
        globals()[f"{name}_{seed}"], globals()[f"history_{name}_{seed}"], globals()[f"metrics_dict_val_{name}_{seed}"] = get_model(trainer, model, name, seed, train_model)
        #globals()[f"cm_val_{name}_{seed}"], globals()[f"cm_test_{name}_{seed}"]
        print()
        trained_models.append(globals()[f"{name}_{seed}"])
        trained_history.append(globals()[f"history_{name}_{seed}"])

        trained_metrics_dict_val.append(globals()[f"metrics_dict_val_{name}_{seed}"])
        #trained_cm_val.append(globals()[f"cm_val_{name}_{seed}"])
        #trained_cm_test.append(globals()[f"cm_test_{name}_{seed}"])
    print(" ")

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

Model: c_model - Seed: 2115992153


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.6081,0.60774,0.540231,0.508215,0.591243,0.675501
2,0.5903,0.611627,0.517018,0.539905,0.555085,0.687368
3,0.5803,0.610654,0.573901,0.520361,0.642373,0.68605
4,0.5721,0.616305,0.624479,0.598414,0.68222,0.670095
5,0.5673,0.623458,0.558317,0.523284,0.610734,0.686445
6,0.5626,0.627988,0.558393,0.657264,0.584161,0.685918
7,0.56,0.620172,0.624594,0.618545,0.661604,0.680512
8,0.5566,0.614164,0.616266,0.605525,0.653256,0.663898
9,0.5542,0.62074,0.590628,0.612043,0.617803,0.668908
10,0.5528,0.61806,0.642948,0.615704,0.696879,0.682094



Model: cp_model - Seed: 2115992153


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.6066,0.594663,0.543865,0.547839,0.581073,0.694884
2,0.5826,0.582835,0.642337,0.672612,0.690474,0.710311
3,0.567,0.584605,0.595443,0.670691,0.625288,0.702532
4,0.5547,0.582833,0.629442,0.704522,0.662451,0.711234
5,0.5452,0.596443,0.584806,0.736805,0.611079,0.706751
6,0.5389,0.573943,0.672936,0.713055,0.689787,0.717827
7,0.533,0.577244,0.678781,0.708592,0.727579,0.717695
8,0.5249,0.571487,0.683071,0.725001,0.706382,0.722969
9,0.5206,0.574933,0.664531,0.739814,0.666268,0.722838
10,0.5165,0.564996,0.712937,0.700732,0.747165,0.723761



Model: cps_model - Seed: 2115992153


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.6068,0.596894,0.53695,0.538187,0.575706,0.690269
2,0.5844,0.594686,0.544176,0.670748,0.578572,0.693038
3,0.5677,0.583518,0.620284,0.68438,0.650854,0.706224
4,0.5563,0.580689,0.663422,0.703304,0.678489,0.711893
5,0.5493,0.575365,0.689918,0.695574,0.716441,0.712553
6,0.5411,0.568977,0.706045,0.688675,0.745119,0.717827
7,0.5337,0.571112,0.702462,0.700195,0.730803,0.719541
8,0.53,0.573581,0.683833,0.716733,0.699762,0.722046
9,0.5227,0.57414,0.692645,0.716169,0.707877,0.722178
10,0.5191,0.566524,0.705563,0.712235,0.730413,0.726793



 


Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

Model: c_model - Seed: 3236146088


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.6096,0.605648,0.534279,0.496856,0.588701,0.666799
2,0.5926,0.605345,0.536857,0.532437,0.577119,0.688159
3,0.5807,0.610193,0.564388,0.508797,0.635311,0.674974
4,0.5722,0.615752,0.579237,0.510673,0.669492,0.676424
5,0.5661,0.623424,0.543698,0.49839,0.603955,0.666403
6,0.5626,0.625125,0.528349,0.531531,0.567797,0.686577
7,0.5597,0.62256,0.617404,0.629222,0.692752,0.677215
8,0.5579,0.617203,0.59268,0.580387,0.63073,0.648734
9,0.5529,0.621847,0.591251,0.59976,0.622949,0.658887
10,0.5531,0.62281,0.632358,0.590674,0.706257,0.658096



Model: cp_model - Seed: 3236146088


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.6084,0.600855,0.542415,0.534789,0.583051,0.690005
2,0.5865,0.591051,0.593064,0.642529,0.640366,0.699895
3,0.5695,0.592921,0.578131,0.648523,0.605534,0.692115
4,0.5573,0.587055,0.621094,0.672712,0.662752,0.706487
5,0.5482,0.600927,0.583661,0.722495,0.602336,0.702927
6,0.5419,0.583026,0.654216,0.704692,0.667336,0.711102
7,0.535,0.581277,0.668231,0.705634,0.724959,0.716904
8,0.5274,0.57598,0.673931,0.71682,0.706171,0.720464
9,0.5234,0.573289,0.68166,0.721017,0.698236,0.723365
10,0.5196,0.566643,0.719504,0.700893,0.763439,0.725211



Model: cps_model - Seed: 3236146088


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.6068,0.59656,0.534198,0.540962,0.572034,0.690665
2,0.5818,0.595036,0.537381,0.666468,0.569063,0.691851
3,0.5651,0.588781,0.588598,0.667443,0.628708,0.703191
4,0.554,0.582507,0.63168,0.695601,0.650925,0.707147
5,0.5461,0.57698,0.670848,0.699966,0.687734,0.709784
6,0.5389,0.569615,0.700263,0.69272,0.736314,0.717563
7,0.5309,0.572534,0.695789,0.697088,0.723101,0.715585
8,0.5275,0.573115,0.671422,0.722995,0.681714,0.719805
9,0.5196,0.577644,0.675362,0.725351,0.682518,0.720332
10,0.5163,0.57303,0.688223,0.720927,0.70218,0.722574



 


Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

Model: c_model - Seed: 749713082


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.6096,0.607993,0.537191,0.503966,0.588983,0.672468
2,0.593,0.61438,0.517288,0.541858,0.555085,0.687896
3,0.5811,0.61281,0.553841,0.500097,0.622599,0.666271
4,0.5738,0.616921,0.567675,0.500901,0.655367,0.662579
5,0.568,0.627926,0.533446,0.496961,0.587288,0.667062
6,0.5632,0.634201,0.511077,0.542853,0.549435,0.687104
7,0.5609,0.62019,0.554518,0.566301,0.613882,0.667062
8,0.5567,0.615358,0.563613,0.581844,0.605348,0.660601
9,0.5557,0.61842,0.5966,0.62953,0.620013,0.676424
10,0.5547,0.620629,0.588328,0.587844,0.663827,0.68394



Model: cp_model - Seed: 749713082


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.6073,0.600355,0.541,0.538638,0.580226,0.691192
2,0.5859,0.588176,0.613714,0.642774,0.659868,0.69818
3,0.5687,0.588768,0.593728,0.650383,0.622938,0.695807
4,0.5565,0.586731,0.621071,0.675989,0.660047,0.706619
5,0.548,0.600439,0.587546,0.736421,0.616208,0.705432
6,0.5422,0.584519,0.64029,0.702696,0.657359,0.710047
7,0.5358,0.581628,0.663436,0.70698,0.703673,0.717695
8,0.5273,0.579341,0.665042,0.725115,0.689595,0.72086
9,0.523,0.576356,0.659522,0.721562,0.670605,0.718354
10,0.519,0.566444,0.706733,0.702928,0.740749,0.724156



Model: cps_model - Seed: 749713082


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.6077,0.594125,0.532384,0.550117,0.568362,0.692906
2,0.5837,0.591801,0.547071,0.693319,0.576428,0.695543
3,0.5654,0.580753,0.623124,0.674616,0.661058,0.706487
4,0.5546,0.578986,0.652117,0.702099,0.672233,0.713476
5,0.5468,0.571684,0.692732,0.697822,0.717779,0.714135
6,0.5389,0.567183,0.717153,0.690438,0.759635,0.720728
7,0.5303,0.567045,0.711923,0.69737,0.747113,0.721123
8,0.528,0.56893,0.680367,0.731203,0.691416,0.725079
9,0.5198,0.573677,0.690376,0.719645,0.702217,0.722969
10,0.5169,0.564937,0.707745,0.714761,0.729463,0.727057



 


In [45]:
plt.plot(pd.DataFrame(cp_trainer.state.log_history).groupby('epoch').max()['eval_loss'], '-')

NameError: name 'plt' is not defined

##Check of the training

In [None]:
import matplotlib.pyplot as plt

def graph_plots(history, model_name):
    plt.figure(figsize=(12, 6))

    plt.subplot(1, 2, 1)
    plt.plot(history['accuracy'])
    plt.plot(history['val_accuracy'])
    plt.xlabel("Epochs")
    plt.ylabel("Accuracy")
    plt.legend(['accuracy', 'val_accuracy'])
    plt.title('Accuracy - ' + model_name)

    plt.subplot(1, 2, 2)
    plt.plot(history['loss'])
    plt.plot(history['val_loss'])
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.legend(['loss', 'val_loss'])
    plt.title('Loss - ' + model_name)

    plt.tight_layout()
    plt.show()

In [None]:
train_accuracy, val_accuracy = dict(), dict()
i=0
for model in trained_models[0:12:4]:
    metrics_dict = trained_metrics_dict_test[trained_models.index(model)]
    cm = trained_cm_test[trained_models.index(model)]
    history = trained_history[trained_models.index(model)]
    graph_plots(history, model.name)

    train_accuracy[i], val_accuracy[i] = history['accuracy'][-1], history['val_accuracy'][-1]
    i += 1

print("Baseline: Train Accuracy: {:.4f} - Val Accuracy: {:.4f}".format(train_accuracy[0], val_accuracy[0]))
print("Model 1 : Train Accuracy: {:.4f} - Val Accuracy: {:.4f}".format(train_accuracy[1], val_accuracy[1]))
print("Model 2 : Train Accuracy: {:.4f} - Val Accuracy: {:.4f}".format(train_accuracy[2], val_accuracy[2]))

# [Task 5 - 1.0 points] Error Analysis

You are tasked to discuss your results.

### Instructions

* **Compare** classification performance of BERT-based models with respect to baselines.
* Discuss **difference in prediction** between the best performing BERT-based model and its variants.

### Notes

You can check the [original paper](https://aclanthology.org/2022.acl-long.306/) for suggestions on how to perform comparisons (e.g., plots, tables, etc...).

# [Task 6 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Model card

You are **free** to choose the BERT-base model card you like from huggingface.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Model Training

You are **free** to choose training hyper-parameters for BERT-based models (e.g., number of epochs, etc...).

### Neural Libraries

You are **free** to use any library of your choice to address the assignment (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

# The End