## RoBERTuito for Text Classification

This notebook shows how to use [RoBERTuito](https://huggingface.co/pysentimiento/robertuito-base-uncased) for text classification tasks.

First, let's install some packages

In [2]:
!pip install pysentimiento transformers datasets accelerate evaluate

discover_other_daemon: 1Defaulting to user installation because normal site-packages is not writeable
Collecting pysentimiento
  Downloading pysentimiento-0.7.3-py3-none-any.whl (39 kB)
Collecting transformers
  Downloading transformers-4.41.1-py3-none-any.whl (9.1 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)


Let's load a dataset -- in this case, a Spanish sentiment analysis dataset from CardiffNLP.

In [3]:
from datasets import load_dataset

ds = load_dataset("cardiffnlp/tweet_sentiment_multilingual", "spanish")

ds

Downloading data:   0%|          | 0.00/120k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/58.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1839 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/324 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/870 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})

In [6]:
ds["train"].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'neutral', 'positive'], id=None)}

## Load models

For this task, we use `robertuito-base-uncased` (there are other two versions: `robertuito-base-uncased`, and `robertuito-base-deacc`)

In [39]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "pysentimiento/robertuito-base-uncased"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.model_max_length = 128

## Preprocessing

Before tokenizing our model, we have to run the `preprocess_tweet` function to our data.


In [5]:
from pysentimiento.preprocessing import preprocess_tweet
preprocessed_ds = ds.map(lambda ex: {"text": preprocess_tweet(ex["text"], lang="es")})

Map:   0%|          | 0/1839 [00:00<?, ? examples/s]

Map:   0%|          | 0/324 [00:00<?, ? examples/s]

Map:   0%|          | 0/870 [00:00<?, ? examples/s]

## Tokenization

In [7]:
tokenized_ds = preprocessed_ds.map(
    lambda batch: tokenizer(batch["text"], padding=False, truncation=True),
    batched=True, batch_size=32
)

Map:   0%|          | 0/1839 [00:00<?, ? examples/s]

Map:   0%|          | 0/324 [00:00<?, ? examples/s]

Map:   0%|          | 0/870 [00:00<?, ? examples/s]

## Training

In [12]:
!pip install ipdb scikit-learn

discover_other_daemon: 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m936.5 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Collecting joblib>=1.2.0
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m775.7 kB/s[0m eta [36m0:00:00[0m1m785.1 kB/s[0m eta [36m0:00:01[0m
[?25hCollecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.5.0 threadpoolctl-3.5.0


In [13]:
import numpy as np
import evaluate

f1_metric = evaluate.load("f1")
recall_metric = evaluate.load("recall")

def compute_metrics (eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)

    results = {}
    results.update(f1_metric.compute(predictions=preds, references = labels, average="macro"))
    results.update(recall_metric.compute(predictions=preds, references = labels, average="macro"))
    return results

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

In [14]:
!pip install accelerate -U

discover_other_daemon: 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable


In [15]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

training_args = TrainingArguments(
    per_device_train_batch_size=32,
    output_dir="test_trainer",
    do_eval=True,
    evaluation_strategy="epoch",
    num_train_epochs=5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)
trainer.train()



{'eval_loss': 0.6776155233383179, 'eval_f1': 0.6929067147702548, 'eval_recall': 0.7006172839506174, 'eval_runtime': 89.5024, 'eval_samples_per_second': 3.62, 'eval_steps_per_second': 0.458, 'epoch': 1.0}
{'eval_loss': 0.7138878107070923, 'eval_f1': 0.6899220798095035, 'eval_recall': 0.6944444444444445, 'eval_runtime': 58.3122, 'eval_samples_per_second': 5.556, 'eval_steps_per_second': 0.703, 'epoch': 2.0}
{'eval_loss': 0.835770308971405, 'eval_f1': 0.7044238701527704, 'eval_recall': 0.70679012345679, 'eval_runtime': 55.4362, 'eval_samples_per_second': 5.845, 'eval_steps_per_second': 0.74, 'epoch': 3.0}
{'eval_loss': 0.9408085346221924, 'eval_f1': 0.6948477223539852, 'eval_recall': 0.6944444444444443, 'eval_runtime': 54.8063, 'eval_samples_per_second': 5.912, 'eval_steps_per_second': 0.748, 'epoch': 4.0}
{'eval_loss': 1.0213658809661865, 'eval_f1': 0.700978022950104, 'eval_recall': 0.7006172839506174, 'eval_runtime': 54.645, 'eval_samples_per_second': 5.929, 'eval_steps_per_second': 0.7

TrainOutput(global_step=290, training_loss=0.3424160398285964, metrics={'train_runtime': 2812.5582, 'train_samples_per_second': 3.269, 'train_steps_per_second': 0.103, 'train_loss': 0.3424160398285964, 'epoch': 5.0})

In [16]:
trainer.evaluate(tokenized_ds["test"])

{'eval_loss': 0.9929825067520142, 'eval_f1': 0.7152447854858499, 'eval_recall': 0.7172413793103448, 'eval_runtime': 136.3666, 'eval_samples_per_second': 6.38, 'eval_steps_per_second': 0.799, 'epoch': 5.0}


{'eval_loss': 0.9929825067520142,
 'eval_f1': 0.7152447854858499,
 'eval_recall': 0.7172413793103448,
 'eval_runtime': 136.3666,
 'eval_samples_per_second': 6.38,
 'eval_steps_per_second': 0.799,
 'epoch': 5.0}

**Clasificación de Sentimientos **

In [19]:
import pandas as pd

# Cargar el archivo CSV
df = pd.read_csv("dataset.csv")

# Visualizar el DataFrame
df.head()


Unnamed: 0,texto,clasificacion
0,Tenemos una reunión importante mañana.,0
1,¿Cuándo es la próxima reunión?,1
2,¿Puedes confirmar tu asistencia al evento?,2
3,El evento empezará a las 9:00 AM.,3
4,Es crucial completar este informe antes del vi...,0


In [20]:
# Instalar las bibliotecas necesarias
!pip install transformers
!pip install torch

# Importar las bibliotecas necesarias
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, accuracy_score
from transformers import AutoTokenizer, AutoModel
import torch

# Subir el archivo CSV
# Yo lo subí directamente

# from google.colab import files
# uploaded = files.upload()

# Leer el archivo CSV
df = pd.read_csv("dataset.csv")

# Ver el contenido del DataFrame
print(df.head())

# Dividir los datos en características (X) y etiquetas (y)
X = df['texto']
y = df['clasificacion']

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Cargar el tokenizador y el modelo de transformers
#tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
#model = AutoModel.from_pretrained('distilbert-base-uncased')

# Función para obtener embeddings de BERT
def get_embeddings(text_list):
    inputs = tokenizer(text_list, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).numpy()

# Obtener embeddings para el conjunto de entrenamiento y prueba
X_train_embeddings = get_embeddings(X_train.tolist())
X_test_embeddings = get_embeddings(X_test.tolist())

# Entrenar un modelo de scikit-learn utilizando las representaciones de transformers
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_embeddings, y_train)

# Hacer predicciones sobre el conjunto de prueba
y_pred = clf.predict(X_test_embeddings)

# Evaluar el rendimiento del modelo
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')


discover_other_daemon: 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
discover_other_daemon: 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
                                               texto  clasificacion
0             Tenemos una reunión importante mañana.              0
1                     ¿Cuándo es la próxima reunión?              1
2         ¿Puedes confirmar tu asistencia al evento?              2
3                  El evento empezará a las 9:00 AM.              3
4  Es crucial completar este informe antes del vi...              0


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.00      0.00      0.00         2
           2       0.25      1.00      0.40         1

    accuracy                           0.25         4
   macro avg       0.08      0.33      0.13         4
weighted avg       0.06      0.25      0.10         4

Accuracy: 0.25


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [21]:
# Función para predecir la clasificación de un nuevo texto
def predict_class(text):
    embedding = get_embeddings([text])
    prediction = clf.predict(embedding)
    return prediction[0]

# Ejemplo de predicción con un nuevo texto
nuevo_texto = "¿Cuál es la agenda de la reunión?"
prediccion = predict_class(nuevo_texto)
print(f'El texto: "{nuevo_texto}" está clasificado como: {prediccion}')

El texto: "¿Cuál es la agenda de la reunión?" está clasificado como: 1


In [22]:
# Ejemplo de predicción con un nuevo texto
nuevo_texto = "Es esencial revisar todos los detalles del proyecto"
prediccion = predict_class(nuevo_texto)
print(f'El texto: "{nuevo_texto}" está clasificado como: {prediccion}')

El texto: "Es esencial revisar todos los detalles del proyecto" está clasificado como: 0


In [23]:
# Ejemplo de predicción con un nuevo texto
nuevo_texto = "Te confirmo la sesión para la siguiente semana"
prediccion = predict_class(nuevo_texto)
print(f'El texto: "{nuevo_texto}" está clasificado como: {prediccion}')

El texto: "Te confirmo la sesión para la siguiente semana" está clasificado como: 0


**Exportar el modelo para utilizarlo con Tensorflow.js**

In [41]:
# Guardar el modelo y el tokenizador en un directorio
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
import tensorflow as tf

# Función de forward pass
@tf.function(input_signature=[tf.TensorSpec(shape=[None, None], dtype=tf.int32, name="input_ids"),
                              tf.TensorSpec(shape=[None, None], dtype=tf.int32, name="attention_mask")])
def serving(input_ids, attention_mask):
    return model([input_ids, attention_mask])

# Exportar el modelo en formato SavedModel
model_save_path = "saved_model"
tf.saved_model.save(model, model_save_path, signatures={"serving_default": serving})


ValueError: Expected an object of type `Trackable`, such as `tf.Module` or a subclass of the `Trackable` class, for export. Got RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(30000, 768, padding_idx=1)
      (position_embeddings): Embedding(130, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (classifier): RobertaClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=3, bias=True)
  )
) with type <class 'transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification'>.

In [27]:
!pip install tensorflowjs

discover_other_daemon: 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflowjs
  Downloading tensorflowjs-4.19.0-py3-none-any.whl (89 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.1/89.1 kB[0m [31m776.6 kB/s[0m eta [36m0:00:00[0m1m926.5 kB/s[0m eta [36m0:00:01[0m
[?25hCollecting flax>=0.7.2
  Downloading flax-0.8.4-py3-none-any.whl (698 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m698.6/698.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m
[?25hCollecting importlib_resources>=5.9.0
  Downloading importlib_resources-6.4.0-py3-none-any.whl (38 kB)
Collecting jax>=0.4.13
  Downloading jax-0.4.28-py3-none-any.whl (1.9 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hCollecting jaxlib>=0.4.13
  Downloading jaxlib-0.4.28-cp311-cp311-

In [37]:
!tensorflowjs_converter --input_format=tf_saved_model --output_format=tfjs_graph_model path_to_saved_model model-js


discover_other_daemon: 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2024-05-25 21:24:45.879418: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/home/hectormtz/.local/bin/tensorflowjs_converter", line 8, in <module>
    sys.exit(pip_main())
             ^^^^^^^^^^
  File "/home/hectormtz/.local/lib/python3.11/site-packages/tensorflowjs/converters/converter.py", line 959, in pip_main
    main([' '.join(sys.argv[1:])])
  File "/home/hectormtz/.local/lib/python3.11/site-packages/tensorflowjs/converters/converter.py", line 963, in main
    convert(argv[0].split(' '))
  File "/home/hectormtz/.local/lib/python3.11/site-packages/tensorflowjs/converters/converter.py", line 949, in convert
    _dispatch_converter(input_format, output_format, args, quantization_dtype_map,
  File "/home/he