## Instalar pysentimiento y AdapterHub

Las dos celdas de código en este apartado se tienen que **ejecutar en el orden en que están**, de otro modo habrá errores. Esto se debe a una incompatibilidad entre pysentimiento y AdapterHub.

El paquete **pysentimiento** es necesario para preprocesar el texto cuando se quiere utilizar el modelo preentrenado RoBERTuito. Se trata de un caso aislado, con otros modelos se puede usar directamente el tokenizador.

In [None]:
!pip install pysentimiento

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pysentimiento
  Downloading pysentimiento-0.4.2-py3-none-any.whl (30 kB)
Collecting transformers==4.13
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 49.9 MB/s 
[?25hCollecting emoji<2.0.0,>=1.6.1
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 71.1 MB/s 
[?25hCollecting sklearn<0.1,>=0.0
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting datasets<2.0.0,>=1.13.3
  Downloading datasets-1.18.4-py3-none-any.whl (312 kB)
[K     |████████████████████████████████| 312 kB 60.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 67.6 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64

La siguiente línea es para descargar e instalar **AdapterHub**. Debido a la incompatibilidad con pysentimiento, al terminar la instalación es posible que imprima algunos errores de incompatibilidad.

In [None]:
!pip install adapter-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting adapter-transformers
  Downloading adapter_transformers-3.1.0-py3-none-any.whl (4.8 MB)
[K     |████████████████████████████████| 4.8 MB 31.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 55.1 MB/s 
Installing collected packages: tokenizers, adapter-transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.10.3
    Uninstalling tokenizers-0.10.3:
      Successfully uninstalled tokenizers-0.10.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
transformers 4.13.0 requires tokenizers<0.11,>=0.10.1, but you have tokenizers 0.12.1 which is incompatible.[0m
Successf

Cuando la celda anterior termine de ejecutarse **deberá reiniciarse el entorno de ejecución**. Para ello, ir a la pestaña *Entorno de ejecución* y seleccionar opción *Reiniciar entorno de ejecución*. Una vez reiniciado el entorno, **ejecutar a partir del apartado de Librerías**.

## Librerías

In [None]:
import numpy as np
import torch
import matplotlib.pyplot as plt
import os

## Descargar datos

In [None]:
!gdown https://drive.google.com/uc?id=1GSrykEbhF9kJMfhj3Bkm0BiVkjoznR9i

Downloading...
From: https://drive.google.com/uc?id=1GSrykEbhF9kJMfhj3Bkm0BiVkjoznR9i
To: /content/dataMEXA3.zip
  0% 0.00/300k [00:00<?, ?B/s]100% 300k/300k [00:00<00:00, 128MB/s]


In [None]:
!unzip dataMEXA3.zip

Archive:  dataMEXA3.zip
   creating: dataMEXA3/
  inflating: dataMEXA3/mex20_test_full.txt  
  inflating: dataMEXA3/mex20_train.txt  
  inflating: dataMEXA3/mex20_train_labels.txt  
  inflating: dataMEXA3/mex20_val.txt  
  inflating: dataMEXA3/mex20_val_labels.txt  


## Dataset

La clase mexA3 se crea con tres parámetros:

*   Directorio donde se encuentran todos los datos
*   Split que se utilizará (train o val)
*   El tokenizador

La función preprocess_tweet sirve para preprocesar el texto del tweet antes de ser tokenizado (exclusivo de RoBERTuito).

In [None]:
# CREATE DATASET CLASS---------------------------------------------------------------------------------------------

import os
from torch.utils.data import Dataset
from pysentimiento.preprocessing import preprocess_tweet

class mexA3(Dataset):
    
    def __init__(self, Dir, split, tokenizer, use_labels = True):
        self.use_labels = use_labels
        
        if split != 'test':
            text_file   = os.path.join(Dir, 'mex20_' + split + '.txt')
        else:
            text_file   = os.path.join(Dir, 'mex20_' + split + '_full.txt')
        self.text      = [line      for line in open(text_file)]
            
        labels_file = os.path.join(Dir, 'mex20_' + split + '_labels.txt')
        if use_labels:
            self.labels    = [int(line) for line in open(labels_file)]
        
        preprocessed   = [preprocess_tweet(tweet) for tweet in self.text]
        self.encodings = tokenizer(preprocessed, max_length = 128, truncation = True, padding = True)
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        #item['text']   = self.text[idx]
        if self.use_labels:
            item['labels'] = torch.tensor(self.labels[idx])#torch.tensor([self.labels[idx]])
        return item
        

Ahora se carga el tokenizador preentrenado de RoBERTuito, usando la clase AutoTokenizer

In [None]:
# GET TOKENIZER, VOCAB AND DIR OF DATASET--------------------------------------------------------------------------

from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')
Dir       = "dataMEXA3"
vocab     = tokenizer.get_vocab()

id2w = {}
for w in vocab:
    id2w[vocab[w]] = w

Downloading tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/809k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Finalmente, se crean las instancias de los datasets train y val

In [None]:
train_dataset = mexA3(Dir, 'train', tokenizer)
val_dataset   = mexA3(Dir, 'val'  , tokenizer)

## Modelo

Así como en HuggingFace se tiene la clase AutoModel, análogamente, en AdapterHub se tiene la clase AutoAdapterModel, con el que se puede cargar un modelo pre-entrenado de HuggingFace sin especificar la arquitectura, con la diferencia de que ya tiene todos los métodos para el uso de adapters incluidos

In [None]:
from transformers import AutoAdapterModel

model = AutoAdapterModel.from_pretrained("pysentimiento/robertuito-base-cased")


Downloading config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/415M [00:00<?, ?B/s]

Some weights of the model checkpoint at pysentimiento/robertuito-base-cased were not used when initializing RobertaAdapterModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaAdapterModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaAdapterModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaAdapterModel were not initialized from the model checkpoint at pysentimiento/robertuito-base-cased and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able

In [None]:
model

RobertaAdapterModel(
  (shared_parameters): ModuleDict()
  (roberta): RobertaModel(
    (shared_parameters): ModuleDict()
    (invertible_adapters): ModuleDict()
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(30000, 768, padding_idx=1)
      (position_embeddings): Embedding(130, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
           

## Adapter

Los modelos soportados por AdapterHub tienen el método "add_adapter"" que sirve para añadir un adapter al modelo. Principalmente tiene  dos parámetros:

*   **adapter_name: [string]** ,    usualmente nombre de la tarea que resuelve
*   **config: [str, dict, AdapterConfig]** ,    la arquitectura del adapter. 

AdapterHub soporta diferentes arquitecturas para adapters de artículos recientes. Hay algunas configuraciones pre-establecidas como: PfeifferConfig(default), HoulsbyConfig, ParallelConfig, etc.. También hay alrededor de 400 adapters pre-entrenados que se pueden utilizar.

Cada **adapter** debe tener una **etiqueta de tipo string**, para referirse a él en el futuro, ya que un mismo modelo base puede tener más de un adapter. Si el adapter se utilizará para clasificación, debe añadirse también su respectiva **cabeza de clasificación con la misma etiqueta**.

In [None]:
# name of the task
task_name = "mexA3"

# Add a new adapter
from transformers import ParallelConfig

model.add_adapter(
    adapter_name = task_name, 
    config       = ParallelConfig()
)

# Add a matching classification head
model.add_classification_head(
    head_name  = task_name,
    num_labels = 2,
    id2label   = { 0: "Neutro", 1: "Agresivo"}
)

In [None]:
model

RobertaAdapterModel(
  (shared_parameters): ModuleDict()
  (roberta): RobertaModel(
    (shared_parameters): ModuleDict()
    (invertible_adapters): ModuleDict()
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(30000, 768, padding_idx=1)
      (position_embeddings): Embedding(130, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
           

Hasta este punto, ya se añadió un adapter y una cabeza de clasificación. Sin embargo, como se puede ver en las siguientes celdas, todos los parámetros son entrenables. 

In [None]:
def trainable_parameters_relation(model):
  total_params = 0
  train_params = 0
  for name, param in model.named_parameters():
    curr = np.array(param.shape).prod()
    total_params += curr
    if param.requires_grad:
      #print(name)
      train_params += curr
  
  return 100*train_params/total_params

In [None]:
print("\nParametros entrenables:", trainable_parameters_relation(model), "%")


Parametros entrenables: 100.0 %


En la siguiente celda se aplican dos operaciones:


*   La primera sirve para activar el adapter que vamos a utilizar, lo que obliga a que el input use las capas asociadas al adapter y a la cabeza de clasificación.
*   La segunda línea es NECESARIA para entrenar el adapter. Una de sus funciones es congelar los pesos del modelo base.



In [None]:
# Activate the adapter
model.set_active_adapters(task_name)
model.train_adapter(task_name)

Se verifica nuevamente el procentaje de parámetros entrenables.

In [None]:
print("\nParametros entrenables:", trainable_parameters_relation(model), "%")


Parametros entrenables: 6.597212378335209 %


## Entrenamiento

HuggingFace cuenta con una clase llamada Trainer, que sirve para entrenar sus modelos de forma sencilla en los casos más estándar (como en este caso, clasificación). 

Antes de crear una instancia de Trainer, es necesario definir los parámetros que se utilizarán para el entrenamiento. Para ello, hay que crear una instancia de la clase TrainingArguments.

**Algunas ventajas**: cuida que el modelo y los datos estén en el mismo dispositivo (gpu, cpu) y, si hay más de 1 gpu, utiliza todos para hacer el entrenamiento en paralelo.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    learning_rate               = 1e-4,
    #weight_decay                 = 0.01,
    num_train_epochs            = 5,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size  = 32,
    logging_steps               = 100,
    output_dir                  = "./training_output",
    overwrite_output_dir        = True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns       = False,
)

Análogamente, AdapterHub tiene la clase AdapterTrainer, que está pensada para entrenar únicamente adapters. Se utiliza igual que Trainer. 

Una de las ventajas es que si los pesos del modelo base no están congelados, al intentar entrenar con AdapterTrainer arrojará error.

In [None]:
from transformers import AdapterTrainer, EvalPrediction

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}

trainer = AdapterTrainer(
    model           = model,
    args            = training_args,
    train_dataset   = train_dataset,
    eval_dataset    = val_dataset,
    compute_metrics = compute_accuracy,
)
trainer.args._n_gpu = 1

In [None]:
trainer.train()

***** Running training *****
  Num examples = 5278
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 825


Step,Training Loss
100,0.3608
200,0.3002
300,0.2455
400,0.1834
500,0.1639
600,0.113
700,0.0728
800,0.0579


Saving model checkpoint to ./training_output/checkpoint-500
Configuration saved in ./training_output/checkpoint-500/mexA3/adapter_config.json
Module weights saved in ./training_output/checkpoint-500/mexA3/pytorch_adapter.bin
Configuration saved in ./training_output/checkpoint-500/mexA3/head_config.json
Module weights saved in ./training_output/checkpoint-500/mexA3/pytorch_model_head.bin
Configuration saved in ./training_output/checkpoint-500/mexA3/head_config.json
Module weights saved in ./training_output/checkpoint-500/mexA3/pytorch_model_head.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=825, training_loss=0.18288082180601178, metrics={'train_runtime': 365.227, 'train_samples_per_second': 72.256, 'train_steps_per_second': 2.259, 'total_flos': 1758574969427640.0, 'train_loss': 0.18288082180601178, 'epoch': 5.0})

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 587
  Batch size = 32


{'eval_loss': 0.37353992462158203,
 'eval_acc': 0.9114139693356048,
 'eval_runtime': 4.2694,
 'eval_samples_per_second': 137.491,
 'eval_steps_per_second': 4.45,
 'epoch': 5.0}

In [None]:
model.save_adapter("weights", "mexA3")

## Otro

In [2]:
data = [1, 2]

other = list(data)

In [3]:
other[0] = 0
data

[1, 2]

In [4]:
inputs = ['nodejs', 'reactjs', 'vuejs']

for i in inputs:
  inputs.append(i.upper())

KeyboardInterrupt: ignored

In [7]:
"hola to TURING".capitalize()

'Hola to turing'

In [8]:
z = set('abc')

In [9]:
z

{'a', 'b', 'c'}

In [10]:
l = [1, 2, 3]
m = map(lambda x: 2**x, l)

In [12]:
list(m)

[2, 4, 8]

In [1]:
def f(x):
  x = 2return x

SyntaxError: ignored

In [2]:
i = 1
l = [2, 3]
while i in l:
  print(1) 

In [7]:
l1 = [1, 2, 3, 4]
l2 = [5, 6, 7]


In [10]:
l1.extend(l2)

In [11]:
l1

[1, 2, 3, 4, 5, 6, 7]

In [14]:
def func1():
  x = 50
  return x 

In [15]:
func1()
print(x)

NameError: ignored

In [18]:
l = [1, 2, 3, 5]

l.append(3, 4)

TypeError: ignored

In [21]:
"caca".join(["c", "a", "b"])

'ccacaacacab'

In [23]:
'The {0} side {1} {2}'.format('bright', 'of', 'life')

'The bright side of life'

In [24]:
import re

result = re.findall('Welcome to Turing', 'Welcome', 1)
print(result)

[]


In [25]:
t = '%(a)s %(b)s %(c)s'
print(t%dict(a='Welcome', b='to', c='Turing'))

Welcome to Turing


In [26]:
dict(a='Welcome', b='to', c='Turing')

{'a': 'Welcome', 'b': 'to', 'c': 'Turing'}

In [28]:
l.pop()

5

In [29]:
l

[1, 2, 3]