## Librerías

In [1]:
import numpy as np
import torch
import matplotlib.pyplot as plt
import os

## Descargar datos y pysentimiento

Descarga los datos de mexA3 que es un conjunto etiquetado en la tarea de detección de agresividad en tweets en español

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Los descomprime

In [3]:
!pip install transformers[torch]
!pip install transformers
# Ejecuta esto en una celda antes de importar
!pip install --upgrade transformers
!pip install torch torchvision torchaudio



## Dataset

La clase mexA3 se crea con tres parámetros:

*   Directorio donde se encuentran todos los datos
*   Split que se utilizará (train o val)
*   El tokenizador

La función preprocess_tweet sirve para preprocesar el texto del tweet antes de ser tokenizado (exclusivo de RoBERTuito).

In [4]:
# CREATE DATASET CLASS---------------------------------------------------------------------------------------------

import os
from torch.utils.data import Dataset

class polar(Dataset):

  def __init__(self, Dir, split, tokenizer, use_labels = True):
    self.use_labels = use_labels

    csv_file   = os.path.join(Dir, split + '.csv')

    self.df = pd.read_csv(csv_file)

    labels_file = os.path.join(Dir, split + '_labels.txt')
    if use_labels:
      self.labels    = self.df['polarization']

    self.texts = self.df['text'].tolist()

    self.encodings = tokenizer(
        self.texts,
        max_length = 128,
        truncation = True,
        padding = True
      )

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['text']   = self.text[idx]
    if self.use_labels:
      item['labels'] = torch.tensor(self.labels[idx])
    return item

Ahora se carga el tokenizador preentrenado de RoBERTuito, usando la clase AutoTokenizer

In [6]:
# GET TOKENIZER, VOCAB AND DIR OF DATASET--------------------------------------------------------------------------
from transformers import AutoTokenizer

# Limpiar cache de Hugging Face
cache_dir = "/tmp/model_cache"
os.environ['TRANSFORMERS_CACHE'] = cache_dir

# Forzar descarga sin usar cache
tokenizer = AutoTokenizer.from_pretrained(
    'PlanTL-GOB-ES/roberta-base-bne',
    cache_dir=cache_dir,
    force_download=True,
    local_files_only=False  # Asegurar que descarga si no está localmente
)

Dir = "/content/proyecto"
vocab = tokenizer.get_vocab()
print(f"Tokenizer cargado. Vocabulario: {len(vocab)} tokens")

TypeError: expected str, bytes or os.PathLike object, not NoneType

Finalmente, se crean las instancias de los datasets train y val

In [None]:
train_dataset = mexA3(Dir, 'train', tokenizer)
val_dataset   = mexA3(Dir, 'val'  , tokenizer)

## Modelo

HuggingFace tiene la clase AutoModel, que se puede usar para cargar un modelo pre-entrenado de HuggingFace sin especificar directamente su arquitectura.

Similarmente existen variaciones que permiten cargar un modelo pre-entrenado sin especificar arquitectura para una tarea específica, como AutoModelForSequenceClassification para clasifiación.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('PlanTL-GOB-ES/roberta-base-bne', num_labels = 2)

Downloading (…)lve/main/config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at PlanTL-GOB-ES/roberta-base-bne and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50262, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [None]:
def trainable_parameters_relation(model):
  total_params = 0
  train_params = 0
  for name, param in model.named_parameters():
    curr = np.array(param.shape).prod() # [768, 768] -> 768*768
    total_params += curr
    if param.requires_grad:
      #print(name)
      train_params += curr

  return 100*train_params/total_params

In [None]:
print("\nParametros entrenables:", trainable_parameters_relation(model), "%")


Parametros entrenables: 100.0 %


## Entrenamiento

HuggingFace cuenta con una clase llamada Trainer, que sirve para entrenar sus modelos de forma sencilla en los casos más estándar (como en este caso, clasificación).

Antes de crear una instancia de Trainer, es necesario definir los parámetros que se utilizarán para el entrenamiento. Para ello, hay que crear una instancia de la clase TrainingArguments.

**Algunas ventajas**: cuida que el modelo y los datos estén en el mismo dispositivo (gpu, cpu) y, si hay más de 1 gpu, utiliza todos para hacer el entrenamiento en paralelo.

In [None]:
from transformers import TrainingArguments, EvalPrediction, Trainer

training_args = TrainingArguments(
    learning_rate               = 1e-4,
    #weight_decay                 = 0.01,
    num_train_epochs            = 5,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size  = 32,
    logging_steps               = 100,
    output_dir                  = "./training_output",
    overwrite_output_dir        = True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns       = False,
)

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}

trainer = Trainer(
    model           = model,
    args            = training_args,
    train_dataset   = train_dataset,
    eval_dataset    = val_dataset,
    compute_metrics = compute_accuracy,
)
trainer.args._n_gpu = 1

In [None]:
trainer.train()

Step,Training Loss
100,0.4283
200,0.3087
300,0.2008
400,0.1072
500,0.073
600,0.0285
700,0.0194
800,0.0152


TrainOutput(global_step=825, training_loss=0.14323089307907855, metrics={'train_runtime': 375.6867, 'train_samples_per_second': 70.245, 'train_steps_per_second': 2.196, 'total_flos': 1247660291186400.0, 'train_loss': 0.14323089307907855, 'epoch': 5.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.6119934916496277,
 'eval_acc': 0.8858603066439523,
 'eval_runtime': 2.2527,
 'eval_samples_per_second': 260.579,
 'eval_steps_per_second': 8.434,
 'epoch': 5.0}