## RoBERTuito for Text Classification

This notebook shows how to use [RoBERTuito](https://huggingface.co/pysentimiento/robertuito-base-uncased) for text classification tasks.

First, let's install some packages

In [None]:
!pip install pysentimiento transformers datasets accelerate evaluate

Collecting pysentimiento
  Downloading pysentimiento-0.7.3-py3-none-any.whl (39 kB)
Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji>=1.6.1 (from pysentimiento)
  Downloading emoji-2.12.1-py3-none-any.whl (431 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-an

Let's load a dataset -- in this case, a Spanish sentiment analysis dataset from CardiffNLP.

In [None]:
from datasets import load_dataset

ds = load_dataset("cardiffnlp/tweet_sentiment_multilingual", "spanish")

ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/120k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/58.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1839 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/324 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/870 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})

In [None]:
ds["train"].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'neutral', 'positive'], id=None)}

## Load models

For this task, we use `robertuito-base-uncased` (there are other two versions: `robertuito-base-uncased`, and `robertuito-base-deacc`)

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "pysentimiento/robertuito-base-uncased"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.model_max_length = 128

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/435M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at pysentimiento/robertuito-base-uncased and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/323 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/858k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

## Preprocessing

Before tokenizing our model, we have to run the `preprocess_tweet` function to our data.


In [None]:
from pysentimiento.preprocessing import preprocess_tweet
preprocessed_ds = ds.map(lambda ex: {"text": preprocess_tweet(ex["text"], lang="es")})

Map:   0%|          | 0/1839 [00:00<?, ? examples/s]

Map:   0%|          | 0/324 [00:00<?, ? examples/s]

Map:   0%|          | 0/870 [00:00<?, ? examples/s]

## Tokenization

In [None]:
tokenized_ds = preprocessed_ds.map(
    lambda batch: tokenizer(batch["text"], padding=False, truncation=True),
    batched=True, batch_size=32
)

Map:   0%|          | 0/1839 [00:00<?, ? examples/s]

Map:   0%|          | 0/324 [00:00<?, ? examples/s]

Map:   0%|          | 0/870 [00:00<?, ? examples/s]

## Training

In [None]:
!pip install ipdb

Collecting ipdb
  Downloading ipdb-0.13.13-py3-none-any.whl (12 kB)
Collecting jedi>=0.16 (from ipython>=7.31.1->ipdb)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, ipdb
Successfully installed ipdb-0.13.13 jedi-0.19.1


In [None]:
import numpy as np
import evaluate

f1_metric = evaluate.load("f1")
recall_metric = evaluate.load("recall")

def compute_metrics (eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)

    results = {}
    results.update(f1_metric.compute(predictions=preds, references = labels, average="macro"))
    results.update(recall_metric.compute(predictions=preds, references = labels, average="macro"))
    return results

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

In [None]:
!pip install accelerate -U



In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

training_args = TrainingArguments(
    per_device_train_batch_size=32,
    output_dir="test_trainer",
    do_eval=True,
    evaluation_strategy="epoch",
    num_train_epochs=5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)
trainer.train()



{'eval_loss': 0.6764400601387024, 'eval_f1': 0.7016023129632692, 'eval_recall': 0.7098765432098766, 'eval_runtime': 0.7742, 'eval_samples_per_second': 418.492, 'eval_steps_per_second': 52.957, 'epoch': 1.0}
{'eval_loss': 0.7022017240524292, 'eval_f1': 0.7109246517872125, 'eval_recall': 0.7129629629629629, 'eval_runtime': 0.7285, 'eval_samples_per_second': 444.738, 'eval_steps_per_second': 56.279, 'epoch': 2.0}
{'eval_loss': 0.8494974970817566, 'eval_f1': 0.7286901794129621, 'eval_recall': 0.7314814814814815, 'eval_runtime': 0.724, 'eval_samples_per_second': 447.542, 'eval_steps_per_second': 56.633, 'epoch': 3.0}
{'eval_loss': 0.9995766878128052, 'eval_f1': 0.6969559035735508, 'eval_recall': 0.7006172839506174, 'eval_runtime': 0.728, 'eval_samples_per_second': 445.037, 'eval_steps_per_second': 56.316, 'epoch': 4.0}
{'eval_loss': 1.0335071086883545, 'eval_f1': 0.7118672734235393, 'eval_recall': 0.7129629629629629, 'eval_runtime': 0.733, 'eval_samples_per_second': 442.024, 'eval_steps_per

TrainOutput(global_step=290, training_loss=0.33819240701609643, metrics={'train_runtime': 67.2168, 'train_samples_per_second': 136.796, 'train_steps_per_second': 4.314, 'train_loss': 0.33819240701609643, 'epoch': 5.0})

In [None]:
trainer.evaluate(tokenized_ds["test"])

{'eval_loss': 1.0287749767303467, 'eval_f1': 0.7205647843993468, 'eval_recall': 0.7229885057471264, 'eval_runtime': 2.1135, 'eval_samples_per_second': 411.647, 'eval_steps_per_second': 51.574, 'epoch': 5.0}


{'eval_loss': 1.0287749767303467,
 'eval_f1': 0.7205647843993468,
 'eval_recall': 0.7229885057471264,
 'eval_runtime': 2.1135,
 'eval_samples_per_second': 411.647,
 'eval_steps_per_second': 51.574,
 'epoch': 5.0}

**Clasificación de Sentimientos **

In [None]:
from datasets import load_dataset

# Cargar el conjunto de datos de entrenamiento
dataset = load_dataset('./dataset.csv', 'spanish')

# Obtener las características del conjunto de datos
label_names = dataset['train'].features['label'].names

# Crear el label_map
label_map = {i: label for i, label in enumerate(label_names)}

print(label_map)

{0: 'negative', 1: 'neutral', 2: 'positive'}


In [None]:
import pandas as pd

# Cargar el archivo CSV
df = pd.read_csv("dataset.csv")

# Visualizar el DataFrame
df.head()


Unnamed: 0,texto,clasificacion
0,Tenemos una reunión importante mañana.,0
1,¿Cuándo es la próxima reunión?,1
2,¿Puedes confirmar tu asistencia al evento?,2
3,El evento empezará a las 9:00 AM.,3
4,Es crucial completar este informe antes del vi...,0


In [None]:
# Instalar las bibliotecas necesarias
!pip install transformers
!pip install torch

# Importar las bibliotecas necesarias
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, accuracy_score
from transformers import AutoTokenizer, AutoModel
import torch

# Subir el archivo CSV
# Yo lo subí directamente

# from google.colab import files
# uploaded = files.upload()

# Leer el archivo CSV
df = pd.read_csv("dataset.csv")

# Ver el contenido del DataFrame
print(df.head())

# Dividir los datos en características (X) y etiquetas (y)
X = df['texto']
y = df['clasificacion']

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Cargar el tokenizador y el modelo de transformers
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModel.from_pretrained('distilbert-base-uncased')

# Función para obtener embeddings de BERT
def get_embeddings(text_list):
    inputs = tokenizer(text_list, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).numpy()

# Obtener embeddings para el conjunto de entrenamiento y prueba
X_train_embeddings = get_embeddings(X_train.tolist())
X_test_embeddings = get_embeddings(X_test.tolist())

# Entrenar un modelo de scikit-learn utilizando las representaciones de transformers
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_embeddings, y_train)

# Hacer predicciones sobre el conjunto de prueba
y_pred = clf.predict(X_test_embeddings)

# Evaluar el rendimiento del modelo
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')


                                               texto  clasificacion
0             Tenemos una reunión importante mañana.              0
1                     ¿Cuándo es la próxima reunión?              1
2         ¿Puedes confirmar tu asistencia al evento?              2
3                  El evento empezará a las 9:00 AM.              3
4  Es crucial completar este informe antes del vi...              0


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.00      0.00      0.00         2
           2       0.25      1.00      0.40         1

    accuracy                           0.25         4
   macro avg       0.08      0.33      0.13         4
weighted avg       0.06      0.25      0.10         4

Accuracy: 0.25


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# Función para predecir la clasificación de un nuevo texto
def predict_class(text):
    embedding = get_embeddings([text])
    prediction = clf.predict(embedding)
    return prediction[0]

# Ejemplo de predicción con un nuevo texto
nuevo_texto = "¿Cuál es la agenda de la reunión?"
prediccion = predict_class(nuevo_texto)
print(f'El texto: "{nuevo_texto}" está clasificado como: {prediccion}')

El texto: "¿Cuál es la agenda de la reunión?" está clasificado como: 1


In [None]:
# Ejemplo de predicción con un nuevo texto
nuevo_texto = "Es esencial revisar todos los detalles del proyecto"
prediccion = predict_class(nuevo_texto)
print(f'El texto: "{nuevo_texto}" está clasificado como: {prediccion}')

El texto: "Es esencial revisar todos los detalles del proyecto" está clasificado como: 0


In [None]:
# Ejemplo de predicción con un nuevo texto
nuevo_texto = "Te confirmo la sesión para la siguiente semana"
prediccion = predict_class(nuevo_texto)
print(f'El texto: "{nuevo_texto}" está clasificado como: {prediccion}')

El texto: "Te confirmo la sesión para la siguiente semana" está clasificado como: 0


**Exportar el modelo para utilizarlo con Tensorflow.js**

In [None]:
import tensorflow as tf

# Guardar el modelo entrenado
model.save('path_to_saved_model')
