## Librerías

In [1]:
import numpy as np
import torch
import matplotlib.pyplot as plt
import os

## Descargar datos y pysentimiento

In [2]:
!gdown https://drive.google.com/uc?id=1GSrykEbhF9kJMfhj3Bkm0BiVkjoznR9i

Downloading...
From: https://drive.google.com/uc?id=1GSrykEbhF9kJMfhj3Bkm0BiVkjoznR9i
To: /content/dataMEXA3.zip
  0% 0.00/300k [00:00<?, ?B/s]100% 300k/300k [00:00<00:00, 92.8MB/s]


In [3]:
!unzip dataMEXA3.zip

Archive:  dataMEXA3.zip
   creating: dataMEXA3/
  inflating: dataMEXA3/mex20_test_full.txt  
  inflating: dataMEXA3/mex20_train.txt  
  inflating: dataMEXA3/mex20_train_labels.txt  
  inflating: dataMEXA3/mex20_val.txt  
  inflating: dataMEXA3/mex20_val_labels.txt  


In [4]:
!pip install pysentimiento

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pysentimiento
  Downloading pysentimiento-0.4.2-py3-none-any.whl (30 kB)
Collecting emoji<2.0.0,>=1.6.1
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 14.6 MB/s 
[?25hCollecting transformers==4.13
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 47.2 MB/s 
Collecting datasets<2.0.0,>=1.13.3
  Downloading datasets-1.18.4-py3-none-any.whl (312 kB)
[K     |████████████████████████████████| 312 kB 70.1 MB/s 
[?25hCollecting sklearn<0.1,>=0.0
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 66.6 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.many

## Dataset

La clase mexA3 se crea con tres parámetros:

*   Directorio donde se encuentran todos los datos
*   Split que se utilizará (train o val)
*   El tokenizador

La función preprocess_tweet sirve para preprocesar el texto del tweet antes de ser tokenizado (exclusivo de RoBERTuito).

In [21]:
# CREATE DATASET CLASS---------------------------------------------------------------------------------------------

import os
from torch.utils.data import Dataset
from pysentimiento.preprocessing import preprocess_tweet

class mexA3(Dataset):
    
    def __init__(self, Dir, split, tokenizer, use_labels = True, use_artifi = False):
        self.use_labels = use_labels
        self.use_artifi = use_artifi
        
        if split != 'test':
            text_file   = os.path.join(Dir, 'mex20_' + split + '.txt')
        else:
            text_file   = os.path.join(Dir, 'mex20_' + split + '_full.txt')
        self.text      = [line      for line in open(text_file)]
            
        labels_file = os.path.join(Dir, 'mex20_' + split + '_labels.txt')
        if use_labels:
            self.labels    = [int(line) for line in open(labels_file)]
        
        preprocessed   = [preprocess_tweet(tweet) for tweet in self.text]
        self.encodings = tokenizer(preprocessed, max_length = 128, truncation = True, padding = True)
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        #item['text']   = self.text[idx]
        if self.use_labels:
            if self.use_artifi:
              item['labels'] = [torch.tensor(self.labels[idx]), 1000]
            else:
              item['labels'] = torch.tensor(self.labels[idx])
        return item
        

Ahora se carga el tokenizador preentrenado de RoBERTuito, usando la clase AutoTokenizer

In [7]:
# GET TOKENIZER, VOCAB AND DIR OF DATASET--------------------------------------------------------------------------

from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')
Dir       = "dataMEXA3"
vocab     = tokenizer.get_vocab()

id2w = {}
for w in vocab:
    id2w[vocab[w]] = w

Downloading:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/809k [00:00<?, ?B/s]

Finalmente, se crean las instancias de los datasets train y val

In [27]:
train_dataset = mexA3(Dir, 'train', tokenizer)
val_dataset   = mexA3(Dir, 'val'  , tokenizer, use_artifi = False)

## Modelo

HuggingFace tiene la clase AutoModel, que se puede usar para cargar un modelo pre-entrenado de HuggingFace sin especificar directamente su arquitectura. 

Similarmente existen variaciones que permiten cargar un modelo pre-entrenado sin especificar arquitectura para una tarea específica, como AutoModelForSequenceClassification para clasifiación.

In [14]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "pysentimiento/robertuito-base-cased",
    num_labels=2
)


Downloading:   0%|          | 0.00/677 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/415M [00:00<?, ?B/s]

Some weights of the model checkpoint at pysentimiento/robertuito-base-cased were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at pysentimiento/robertuito-base-cased and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.b

In [15]:
def trainable_parameters_relation(model):
  total_params = 0
  train_params = 0
  for name, param in model.named_parameters():
    curr = np.array(param.shape).prod()
    total_params += curr
    if param.requires_grad:
      #print(name)
      train_params += curr
  
  return 100*train_params/total_params

In [16]:
print("\nParametros entrenables:", trainable_parameters_relation(model), "%")


Parametros entrenables: 100.0 %


## Entrenamiento

HuggingFace cuenta con una clase llamada Trainer, que sirve para entrenar sus modelos de forma sencilla en los casos más estándar (como en este caso, clasificación). 

Antes de crear una instancia de Trainer, es necesario definir los parámetros que se utilizarán para el entrenamiento. Para ello, hay que crear una instancia de la clase TrainingArguments.

**Algunas ventajas**: cuida que el modelo y los datos estén en el mismo dispositivo (gpu, cpu) y, si hay más de 1 gpu, utiliza todos para hacer el entrenamiento en paralelo.

In [31]:
from transformers import TrainingArguments, EvalPrediction, Trainer

training_args = TrainingArguments(
    learning_rate               = 1e-4,
    #weight_decay                 = 0.01,
    num_train_epochs            = 5,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size  = 32,
    logging_steps               = 100,
    output_dir                  = "./training_output",
    overwrite_output_dir        = True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns       = False,
)

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}

trainer = Trainer(
    model           = model,
    args            = training_args,
    train_dataset   = train_dataset,
    eval_dataset    = val_dataset,
    compute_metrics = compute_accuracy,
)
trainer.args._n_gpu = 1

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [19]:
trainer.train()

***** Running training *****
  Num examples = 5278
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 825


Step,Training Loss


KeyboardInterrupt: ignored

In [32]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 587
  Batch size = 32


EvalPrediction(predictions=array([[ 0.21539104,  0.06757316],
       [ 1.5632176 , -1.1584018 ],
       [-0.6915312 ,  0.47481772],
       ...,
       [ 2.3253932 , -1.8887868 ],
       [ 1.3634019 , -1.0877801 ],
       [ 1.4555099 , -1.3061126 ]], dtype=float32), label_ids=array([1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 

{'eval_loss': 0.4406445622444153,
 'eval_acc': 1.0,
 'eval_runtime': 3.7464,
 'eval_samples_per_second': 156.683,
 'eval_steps_per_second': 5.072}

In [34]:
x = np.array([1,2,3])

In [35]:
x

array([1, 2, 3])