# Modelo de clasificación binaria para la detección de xenofobia en tweets de migración

### Librerias útiles

##### Hugging Face
Se trata de una comunidad y plataforma de ciencia de datos que ofrece una gran cantidad de herramientas para la creación y evaluación de modelos de aprendizaje profundo. Entre ellas destacan [[Omer Mahmood @ Towards Data Science](https://towardsdatascience.com/whats-hugging-face-122f4e7eb11a#:~:text=Hugging%20Face%20is%20a%20community,(OS)%20code%20and%20technologies.)]:
* Herramientas que permiten a los usuarios construir, entrenar y desplegar modelos basados en código abierto
* Un lugar donde una amplia comunidad de científicos de datos, ingenieros de aprendizaje profundo e investigadores pueden reunirse para compartir ideas, obtener apoyo, contribuir a los proyectos e incluso compartir sus modelos entrenados o puros.

<figure>
    <img src="./assets/images/hug.png"
         alt="Hugging Face logo">
    <figcaption>Hugging Face logo</figcaption>
</figure>

No se encontró ningún modelo entrenado para la tarea de interés, por lo que se entrenaron y probaron 12 modelos a modo de seleccionar el mejor de ellos. La selección de estos 12 modelos consistió en tomar aquellos que estuviesen entrenados para una tarea similar a la xenofobia (en este caso fue el discurso de odio) y/o que hayan sido entrenados para comprender el idioma español.

De este estudio el modelo seleccionado fue [RoBERTuito-base-uncased](https://arxiv.org/abs/2111.09453)

##### PyTorch
Es una librería de aprendizaje automático de código abierto *que acelera el camino desde la creación de prototipos de investigación hasta el despliegue de producción [[PyTorch](https://pytorch.org/)].*

<figure>
    <img src="./assets/images/pytorch.png"
         alt="PyTorch logo">
    <figcaption>PyTorch logo</figcaption>
</figure>

Algunos de los modelos disponibles en Hugging Face se encuentran implementados sobre esta librería. De este modo, algunas de las herramientas disponibles en PyTorch son fácilmente adaptables con Hugging Face, esto nos permitirá añadir o modificar la estructura de una red neuronal, así como su comportamiento, permitiendo al usuario implementar varias funciones personalizadas.

##### Scikit learn
Se trata de una librería de código abierto enfocada en proveer herramientas de aprendizaje de máquinas tales como modelos estadísticos y matemáticos, así como métricas de evaluación comunes en algoritmos de aprendizaje de máquinas.
<figure>
    <img src="./assets/images/scikit.png"
         alt="scikit-learn logo"
         style="max-width: 20%; height: auto">
    <figcaption>scikit-learn logo</figcaption>
</figure>

Esta librería nos permitirá implementar de manera sencilla las métricas de evaluación del modelo de interés.

In [1]:
#Imports

#HuggingFace library
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, Trainer, TrainingArguments
from datasets import Dataset, Value, ClassLabel, Features

#PyTorch Neural Networks
import torch
import torch.nn as nn

#data reading
import pandas as pd

#math
import numpy as np

#scikit-learn metrics
from sklearn.metrics import (
    confusion_matrix, recall_score, accuracy_score, recall_score, precision_score, f1_score, classification_report
)

### Cargando el modelo preentrenado

In [2]:
#set model name
model_name = "pysentimiento/robertuito-base-uncased"
#download model from huggingface
model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=2)

#Load tokenizer
#Un tokenizer es un objeto que convierte una secuencia de caracteres en una secuencia de números.
#es una especie de filtro que prepara el texto para que el modelo lo pueda entender.
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at pysentimiento/robertuito-base-uncased were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at pysentimiento/robertuito-base-uncased and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_

### Configurando el modelo

In [3]:
#Config model: assign names to each label
model.config.id2label = {0: 'ok', 1:'hateful'}
id2label = {0: 'ok', 1:'hateful'}
model.config.label2id = {v:k for k,v in id2label.items()}

#Add special tokens to the tokenizer
tokenizer.add_tokens(['@usuario', 'url', 'hashtag', 'emoji'])
model.resize_token_embeddings(len(tokenizer))

Embedding(30002, 768)

### Lectura de datos procesados

In [4]:
data_train = pd.read_csv('xeno_train_unbalanced_retiquetado.csv')
data_train = data_train[:80]
data_test = pd.read_csv('xeno_test_unbalanced_retiquetado.csv')
data_valid = pd.read_csv('xeno_valid_unbalanced_retiquetado.csv')
data_train.dropna(subset=['text'],inplace=True)
data_train.sample(n=5, random_state=2022)

Unnamed: 0,id,text,date,label
34,1186249127304349952,Cómo #Venezolano le pido a los gobiernos de Br...,2021-09-07,0.0
39,1340821499914160128,@elcomercio_peru Que @FSagasti cierre las fron...,2021-08-08,0.0
43,1154893184641769984,No todos los migrantes son buenas personas 👇🏼,2021-10-14,1.0
47,1154971839921189888,@GaelDiputada @sebastianpinera @nmonckeberg Hi...,2021-09-29,1.0
5,1284129763662210048,@ChileEnLibertad Chilenos marginales contra mi...,2021-07-15,0.0


In [5]:
data_train[:3000]

Unnamed: 0,id,text,date,label
0,1181977505433029888,"@MafeCarrascal Si,cómo no. Aquí no hay desplaz...",2021-10-15,1.0
1,1136799389878560000,@ASINEG @gdehoyoswalther @lopezobrador_ @Refor...,2021-08-30,1.0
2,1266506225019160064,"@jguaido atención Sr Guaido, nadie le echa la ...",2021-09-11,0.0
3,1214630095614350080,He entrado a varios restaurantes donde han con...,2021-07-27,1.0
4,1306026546004720128,@allan_parada Pienso lo mismo. A mi casa tampo...,2021-06-22,1.0
...,...,...,...,...
75,1159894568831249920,"@m__casti El tema no va por ahí, Chile es un p...",2021-09-04,0.0
76,1186292795537009920,@Psicovivir Y ud de verdad cree que el verdade...,2021-10-12,1.0
77,1345418348818010112,@CHVNoticias La demanda por viviendas en las c...,2021-08-29,1.0
78,1306237767412839936,@javiernavia Claro si...los 4 años del pro era...,2021-08-28,0.0


In [6]:
def tokenize(batch):
        """Tokenize text in current mini batch. This is a util function for get_dataset_from_dataframes function

        Args:
            batch (batched datasets.arrow_dataset.Dataset)
        
        Returns:
            [datasets.arrow_dataset.Dataset]: Mapped text-label dataset
        """
        return tokenizer(batch['text'], padding=False, truncation=True)

def format_dataset(dataset):
    """Map text-label for specific dataset from pandas. This is a util function for get_dataset_from_dataframes function

    Args:
        dataset (datasets.arrow_dataset.Dataset): Dataset from pandas DataFrame

    Returns:
        [datasets.arrow_dataset.Dataset]: Mapped text-label dataset
    """
    def get_labels(examples):
        return {'labels': examples['label']}

    dataset = dataset.map(get_labels)
    return dataset

#Features to map insto dataset
features = Features({
    'text': Value('string'),
    'label': ClassLabel(num_classes=2, names=['ok', 'hateful'])
    })

train_dataset = Dataset.from_pandas(data_train, features=features)
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=8)
train_dataset = format_dataset(train_dataset)

test_dataset = Dataset.from_pandas(data_test, features=features)
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=8)
test_dataset = format_dataset(test_dataset)

valid_dataset = Dataset.from_pandas(data_valid, features=features)
valid_dataset = valid_dataset.map(tokenize, batched=True, batch_size=8)
valid_dataset = format_dataset(valid_dataset)

#what happened?
#train_dataset[8290]
#tokenizer.decode(np.random.randint(0, 30002, size=20).tolist())


  0%|          | 0/10 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


0ex [00:00, ?ex/s]

  0%|          | 0/942 [00:00<?, ?ba/s]

0ex [00:00, ?ex/s]

  0%|          | 0/471 [00:00<?, ?ba/s]

0ex [00:00, ?ex/s]

In [7]:
#to be able to use batched training, we need to use a data collator
data_collator = DataCollatorWithPadding(tokenizer, padding='longest')

### Métricas de evaluación

In [8]:
def compute_metrics(p):
    """Compute Accuracy, Precision, Recall and F1 metrics

    Args:
        p ([List]): List with calculated logits by model and real label per sample

    Returns:
        [dict]: dict with calculated metrics
    """
    pred, labels = p
    #Get class with most probability
    pred = np.argmax(pred, axis=-1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    #Binary recall for class 0 (not xenophobic) and class 1 (xenophobic)
    recall_cls0 = recall_score(y_true=labels, y_pred=pred, pos_label=0, average='binary')
    recall_cls1 = recall_score(y_true=labels, y_pred=pred, pos_label=1, average='binary')
    #Binary precision for class 0 and class 1
    precision_cls0 = precision_score(y_true=labels, y_pred=pred, pos_label=0, average='binary')
    precision_cls1 = precision_score(y_true=labels, y_pred=pred, pos_label=1, average='binary')
    #Binary F1 for class 0 and class 1
    f1_cls0 = f1_score(y_true=labels, y_pred=pred, pos_label=0, average='binary')
    f1_cls1 = f1_score(y_true=labels, y_pred=pred, pos_label=1, average='binary')

    #F1 scores: macro (balanced data), micro and weighted (unbalanced data)
    f1_micro = f1_score(y_true=labels, y_pred=pred, average='micro')
    f1_macro = f1_score(y_true=labels, y_pred=pred, average='macro')
    f1_weight = f1_score(y_true=labels, y_pred=pred, average='weighted')

    return {'accuracy': accuracy, 'precision_cls1': precision_cls1, 'precision_cls0': precision_cls0,
            'recall_cls1': recall_cls1, 'recall_cls0': recall_cls0, 'f1_cls1': f1_cls1, 'f1_cls0': f1_cls0,
            'f1_micro': f1_micro, 'f1_macro': f1_macro, 'f1_weigth': f1_weight}

In [12]:
training_args = TrainingArguments(
        output_dir = './',
        num_train_epochs = 3,
        per_device_train_batch_size = 8,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        do_eval=True,
        logging_dir='./logs',
        load_best_model_at_end=True,
        bf16 = False,
        half_precision_backend = 'amp',
        greater_is_better = True
    )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [13]:
trainer_args = {
        'model': model,
        'args': training_args,
        'train_dataset': train_dataset,
        'eval_dataset': valid_dataset,
        'data_collator': data_collator,
        'tokenizer': tokenizer,
        #'compute_metrics': compute_metrics
    }

In [14]:
trainer = Trainer(**trainer_args)
trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 80
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 30


  0%|          | 0/30 [00:00<?, ?it/s]

IndexError: index out of range in self