In this second assignment, you are challenged to employ Hugging Face transformers for the same classification task as in the first assignment.

You should explore Hugging Face models to find a pre-trained model that is suitable and promising for fine-tuning with data for the ADU type classification task. It should make sense to pick one that has been pre-trained with Portuguese (either in isolation or in a multi-lingual fashion), possibly with data from a similar genre.

As a bonus, you can also employ a domain adaptation approach, by leveraging on the full text of opinion articles made available.

You should compare the performance of your model(s) with the ones developed for the first assignment. For the final delivery, prepare a short presentation (max 10 slides) documenting your approach.



## Loading the dataset 

In [4]:
import pandas as pd

dataset = pd.read_excel("OpArticles_ADUs.xlsx")
dataset.head()

Unnamed: 0,article_id,annotator,node,ranges,tokens,label
0,5d04a31b896a7fea069ef06f,A,0,"[[2516, 2556]]",O facto não é apenas fruto da ignorância,Value
1,5d04a31b896a7fea069ef06f,A,1,"[[2568, 2806]]",havia no seu humor mais jornalismo (mais inves...,Value
2,5d04a31b896a7fea069ef06f,A,3,"[[3169, 3190]]",É tudo cómico na FIFA,Value
3,5d04a31b896a7fea069ef06f,A,4,"[[3198, 3285]]",o que todos nós permitimos que esta organizaçã...,Value
4,5d04a31b896a7fea069ef06f,A,6,"[[4257, 4296]]",não nos fazem rir à custa dos poderosos,Value


## Data cleaning

Some text spans were annotated more than once. In these cases, there are 2 possibilities:


1.   The text span is kept, if all annotations consider that the example belongs to the same class; 
2.   The text span is eliminated, if different annotators assign different labels to the example. 



In [19]:
import numpy as np

grouped_df = dataset.groupby(by=['article_id', 'ranges'])
dataset_dict = {"tokens": [], "label": [], "article_id": []}

for i, group in grouped_df:
    dict_counts = {x: group["label"].value_counts()[x] for x in np.unique(group[['label']].values)}
    if len(dict_counts.keys()) > 1:
        continue
    dataset_dict["article_id"].append(group["article_id"].values[0])
    dataset_dict["tokens"].append(group["tokens"].values[0])
    dataset_dict["label"].append(list(dict_counts.keys())[0])
    
dataset = pd.DataFrame(dataset_dict, columns = ["tokens", "label", "article_id"])
dataset

Unnamed: 0,tokens,label,article_id
0,presumo que essas partilhas tenham gerado um e...,Value,5cdd971b896a7fea062d6e3d
1,essas partilhas tenham gerado um efeito bola d...,Value,5cdd971b896a7fea062d6e3d
2,esta questão ter [justificadamente] despertado...,Value,5cdd971b896a7fea062d6e3d
3,a ocasião propicia um debate amplo na sociedad...,Value,5cdd971b896a7fea062d6e3d
4,a tomada urgente de medidas por parte da tutel...,Value,5cdd971b896a7fea062d6e3d
...,...,...,...
10248,Um presidente de câmara pode pertencer à admin...,Value,5d04c671896a7fea06a11275
10249,eticamente é reprovável,Value(-),5d04c671896a7fea06a11275
10250,"eticamente é reprovável e, o bom senso, aconse...",Value,5d04c671896a7fea06a11275
10251,"o bom senso, aconselha a não o fazer",Value,5d04c671896a7fea06a11275


In [21]:
dataset["label"].value_counts()

Value       5003
Fact        2235
Value(-)    1768
Value(+)     849
Policy       398
Name: label, dtype: int64

The dataset is now ready for splitting. Without any augmentation, it contains roughly 10.000 samples. Similarly to assignment 1, it is unbalanced, having significantly more "Value" examples.

In order to easily use and split the dataset, we need to convert it into a Hugging Face dataset.

In [None]:
!pip install datasets --quiet
from datasets import Dataset

dataset_hf = Dataset.from_pandas(dataset)