In this second assignment, you are challenged to employ Hugging Face transformers for the same classification task as in the first assignment.

You should explore Hugging Face models to find a pre-trained model that is suitable and promising for fine-tuning with data for the ADU type classification task. It should make sense to pick one that has been pre-trained with Portuguese (either in isolation or in a multi-lingual fashion), possibly with data from a similar genre.

As a bonus, you can also employ a domain adaptation approach, by leveraging on the full text of opinion articles made available.

You should compare the performance of your model(s) with the ones developed for the first assignment. For the final delivery, prepare a short presentation (max 10 slides) documenting your approach.



## Loading the dataset 

In [1]:
%ls

OpArticles_ADUs.xlsx  [0m[01;34mresults[0m/  [01;34msample_data[0m/  [01;34mtest_trainer[0m/


In [2]:
import pandas as pd

dataset = pd.read_excel("OpArticles_ADUs.xlsx")
dataset.head()

Unnamed: 0,article_id,annotator,node,ranges,tokens,label
0,5d04a31b896a7fea069ef06f,A,0,"[[2516, 2556]]",O facto não é apenas fruto da ignorância,Value
1,5d04a31b896a7fea069ef06f,A,1,"[[2568, 2806]]",havia no seu humor mais jornalismo (mais inves...,Value
2,5d04a31b896a7fea069ef06f,A,3,"[[3169, 3190]]",É tudo cómico na FIFA,Value
3,5d04a31b896a7fea069ef06f,A,4,"[[3198, 3285]]",o que todos nós permitimos que esta organizaçã...,Value
4,5d04a31b896a7fea069ef06f,A,6,"[[4257, 4296]]",não nos fazem rir à custa dos poderosos,Value


## Data cleaning

Some text spans were annotated more than once. In these cases, there are 2 possibilities:


1.   The text span is kept, if all annotations consider that the example belongs to the same class; 
2.   The text span is eliminated, if different annotators assign different labels to the example. 



In [3]:
import numpy as np

grouped_df = dataset.groupby(by=['article_id', 'ranges'])
dataset_dict = {"tokens": [], "label": [], "article_id": []}

for i, group in grouped_df:
    dict_counts = {x: group["label"].value_counts()[x] for x in np.unique(group[['label']].values)}
    if len(dict_counts.keys()) > 1:
        continue
    dataset_dict["article_id"].append(group["article_id"].values[0])
    dataset_dict["tokens"].append(group["tokens"].values[0])
    dataset_dict["label"].append(list(dict_counts.keys())[0])
    
dataset = pd.DataFrame(dataset_dict, columns = ["tokens", "label", "article_id"])
dataset

Unnamed: 0,tokens,label,article_id
0,presumo que essas partilhas tenham gerado um e...,Value,5cdd971b896a7fea062d6e3d
1,essas partilhas tenham gerado um efeito bola d...,Value,5cdd971b896a7fea062d6e3d
2,esta questão ter [justificadamente] despertado...,Value,5cdd971b896a7fea062d6e3d
3,a ocasião propicia um debate amplo na sociedad...,Value,5cdd971b896a7fea062d6e3d
4,a tomada urgente de medidas por parte da tutel...,Value,5cdd971b896a7fea062d6e3d
...,...,...,...
10248,Um presidente de câmara pode pertencer à admin...,Value,5d04c671896a7fea06a11275
10249,eticamente é reprovável,Value(-),5d04c671896a7fea06a11275
10250,"eticamente é reprovável e, o bom senso, aconse...",Value,5d04c671896a7fea06a11275
10251,"o bom senso, aconselha a não o fazer",Value,5d04c671896a7fea06a11275


In [4]:
dataset["label"].value_counts()

Value       5003
Fact        2235
Value(-)    1768
Value(+)     849
Policy       398
Name: label, dtype: int64

The dataset is now ready for splitting. Without any augmentation, it contains roughly 10.000 samples. Similarly to assignment 1, it is unbalanced, having significantly more "Value" examples.

In order to easily use and split the dataset, we need to convert it into a Hugging Face dataset.

In [5]:
!pip install datasets --quiet
from datasets import Dataset

labels = ['Value','Fact','Value(+)','Value(-)','Policy']
numeric_labels = []

for label in dataset["label"]:
    new_label = labels.index(label)
    numeric_labels.append(new_label)

dataset["label"] = numeric_labels

dataset_hf = Dataset.from_pandas(dataset)

## Splitting the dataset

We can now split the dataset into training, testing and validating sets.

In [13]:
from datasets import DatasetDict

# 90% train, 10% test+validation
train_test = dataset_hf.train_test_split(test_size=0.1)

# Split the 10% test+validation set in half test, half validation
valid_test = train_test['test'].train_test_split(test_size=0.5)

# gather everyone if you want to have a single DatasetDict
train_valid_test_dataset = DatasetDict({
    'train': train_test['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']
})

BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment. It is available in two sizes: Base and Large.

In [7]:
# Baseline model

model_name = "neuralmind/bert-base-portuguese-cased"

## Loading the model and tokenizer

In [None]:
#!pip install transformers --quiet

from transformers import AutoTokenizer  # Or BertTokenizer
from transformers import AutoModelForPreTraining  # Or BertForPreTraining for loading pretraining heads
from transformers import AutoModel  # or BertModel, for BERT without pretraining heads
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)
tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=False, padding=True, truncation=True, model_max_len=512)


:## Apply the tokenizer loaded into the text spans

In [9]:
def preprocess_function(sample):
    return tokenizer(sample["tokens"], truncation=True, padding=True)


tokenized_dataset = train_valid_test_dataset.map(preprocess_function, batched=True)
train_valid_test_dataset

  0%|          | 0/10 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['tokens', 'label', 'article_id'],
        num_rows: 9227
    })
    validation: Dataset({
        features: ['tokens', 'label', 'article_id'],
        num_rows: 513
    })
    test: Dataset({
        features: ['tokens', 'label', 'article_id'],
        num_rows: 513
    })
})

In [10]:
import torch

inputs = tokenizer(train_valid_test_dataset['test'][0]['tokens'], padding=True, truncation=True, return_tensors="pt")

outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits)
print(predictions)


tensor([[0.1555, 0.1946, 0.1978, 0.2135, 0.2387]], grad_fn=<SoftmaxBackward0>)


  


## Fine-tuning

The next step is to fine-tune the model using our training data. 

In [11]:

from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import load_metric
import numpy as np

metric = load_metric("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# training_args = TrainingArguments("test_trainer", num_train_epochs=1)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)



In [12]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens, article_id. If tokens, article_id are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 9227
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 577


Epoch,Training Loss,Validation Loss,Accuracy
1,1.0711,0.915767,0.625731


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens, article_id. If tokens, article_id are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 513
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-577
Configuration saved in ./results/checkpoint-577/config.json
Model weights saved in ./results/checkpoint-577/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-577/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-577/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-577 (score: 0.9157670736312866).


TrainOutput(global_step=577, training_loss=1.054740350415843, metrics={'train_runtime': 13306.2252, 'train_samples_per_second': 0.693, 'train_steps_per_second': 0.043, 'total_flos': 732351320754516.0, 'train_loss': 1.054740350415843, 'epoch': 1.0})

In [18]:
!zip -r /results-bert-base-portuguese-cased.zip results-bert-base-pt-cased

  adding: results-bert-base-pt-cased/ (stored 0%)
  adding: results-bert-base-pt-cased/special_tokens_map.json (deflated 40%)
  adding: results-bert-base-pt-cased/config.json (deflated 56%)
  adding: results-bert-base-pt-cased/pytorch_model.bin (deflated 7%)
  adding: results-bert-base-pt-cased/training_args.bin (deflated 48%)
  adding: results-bert-base-pt-cased/tokenizer_config.json (deflated 38%)
  adding: results-bert-base-pt-cased/vocab.txt (deflated 52%)
  adding: results-bert-base-pt-cased/tokenizer.json (deflated 72%)


### Saving the model

In [None]:
trainer.save_model("results-bert-base-pt-cased")

## Loading and Using the model

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer2 = AutoTokenizer.from_pretrained("./results-bert-base-pt-cased")
model2 = AutoModelForSequenceClassification.from_pretrained("./results-bert-base-pt-cased", num_labels=5)

In [28]:
import torch

from transformers import TextClassificationPipeline

pipe = TextClassificationPipeline(model=model2, tokenizer=tokenizer2) #, return_all_scores=True)

y_pred= []
for p in tokenized_dataset['test']['tokens']:
    ti = tokenizer2(p, return_tensors="pt")
    out = model2(**ti)
    pred = torch.argmax(out.logits)
    y_pred.append(pred.item())

In [30]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

y_test = tokenized_dataset['test']['label']

#print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred, average='macro'))
print('Recall: ', recall_score(y_test, y_pred, average='macro'))
print('F1: ', f1_score(y_test, y_pred, average='macro'))

Accuracy:  0.6588693957115009
Precision:  0.6226469037298042
Recall:  0.5901132817799484
F1:  0.603815603241097
