[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/PLN-disca-iimas/DigitalHumanitiesSchool/blob/main/demo-transformers_classification.ipynb)

# Clasificación de textos con 🤗 Transformers

> Cómo afinar un modelo de distilbert para clasificar los tweets del TASS.

## Configuración

Si está ejecutando este notebook en Google Colab, ejecute la siguiente celda para instalar las bibliotecas que necesitamos:

In [1]:
!pip install datasets
!pip install evaluate
!pip install transformers==4.28.0

Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/486.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m15.9 MB/s[0m eta [36m0:

Para compartir tu modelo con la comunidad, primero crea una cuenta en el [Hugging Face Hub](https://huggingface.co/join). A continuación, ejecute la siguiente celda y proporcione su nombre de usuario y contraseña para generar un token de autenticación:

In [None]:
# Esto sólo funciona en Google Colab! Para los notebooks normales, es necesario ejecutar esto en el terminal
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the '

Si no tienes instalado [Git LFS](https://git-lfs.github.com), puedes hacerlo descomentando y ejecutando la celda de abajo:

In [None]:
!apt install git-lfs
!git config --global user.email "helena.adorno@gmail.com"
!git config --global user.name "helenpy"

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.


In [None]:
#si usamos colab
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Cargar y explorar los datos

Utilizaremos 🤗 Datasets para cargar y procesar nuestro conjunto de datos.

In [None]:
from datasets import load_dataset, Dataset, Features, ClassLabel, Sequence, Value, DatasetDict
import pandas as pd

# Lectura del archivo donde se encuentran los datos de entrenamiento y validación
data = pd.read_csv('/content/gdrive/MyDrive/Meia2023/Modulo2-ClasificacionTextos/corpusTASS-2020/train.tsv', sep='\t')
data_dev = pd.read_csv('/content/gdrive/MyDrive/Meia2023/Modulo2-ClasificacionTextos/corpusTASS-2020/dev.tsv', sep='\t')

data = data.drop(columns=['id', 'pais'])
data_dev = data_dev.drop(columns=['id', 'pais'])

# Diccionario de mapeo de etiquetas
mapeo_etiquetas = {'N': 0, 'NEU': 1, 'P': 2}

# Transformación de la columna "etiqueta"
#data['etiqueta_num'] = data['etiqueta'].map(mapeo_etiquetas)
#data_dev['etiqueta_num'] = data_dev['etiqueta'].map(mapeo_etiquetas)

#definimos la estructura del dataset de Huggingface
features = Features({
    'texto': Value('string'),
    'etiqueta': ClassLabel(num_classes=3, names=['P', 'N', 'NEU'])
})

#transformamos de dataframe a dataset
dataset_train = Dataset.from_pandas(data, features=features)
dataset_dev = Dataset.from_pandas(data_dev, features=features)

dataset_train = dataset_train.rename_column("texto", "text")
dataset_train = dataset_train.rename_column("etiqueta", "label")
dataset_dev = dataset_dev.rename_column("texto", "text")
dataset_dev = dataset_dev.rename_column("etiqueta", "label")

#creamos diccionario de dataset
dataset = DatasetDict({'train': dataset_train, 'test': dataset_dev})
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 4802
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2443
    })
})

In [None]:
import random
import pandas as pd
from datasets import ClassLabel
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    "Taken from https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb"

    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(dataset['train'])

Unnamed: 0,text,label
0,Mejor me duermo buenas noches,NEU
1,Aquí todo tranquilo y allá quien sabe,NEU
2,.@UGMadrid la recomendación en nuestra. ¡Nada mejor que una buena partida de paintball!,P
3,@Crisnosesae @l30n4_ @Beatrice_Val en serio? No sabía eso de ella se me ha caído su imagen a tope,N
4,@Javiercd18 Jajaja es que te pones a buscar defectos en una buena iniciativa de redes Yo también te quiero y me haces falta.,P
5,"El porno es motivación, no lo pueden prohibir",P
6,@ajbodoc ¡Armar un buen equipo es espectacular! Tendremos esta propuesta en cuenta en próximas actualizaciones,P
7,"Mi último tweet del año. 2017 va a ser un año muy perro en muchos ámbitos. Pero puedo con él, va a traer muchas cosas buenas",P
8,Lo unico que van a hacer es que me vuelva mamona con todos como antes,N
9,@NoilyMV yo soy totalmente puntual,NEU


In [None]:
dataset.set_format("pandas")
df = dataset['train'][:]
df.head()

Unnamed: 0,text,label
0,@morbosaborealis jajajaja... eso es verdad... ...,1
1,@Adriansoler espero y deseo que el interior te...,2
2,"comprendo que te molen mis tattoos, pero no te...",2
3,"Mi última partida jugada, con Sona support. La...",0
4,Tranquilos que con el.dinero de Camacho seguro...,0


In [None]:
dataset.reset_format()

In [None]:
show_random_elements(dataset['train'], num_examples=3)

Unnamed: 0,text,label
0,En serio tengo muchas ganas de verte Te extraño demasiadoo..!,N
1,"Quieras o no, vivir rodeada de mar, te influye. Paso un par de semanas sin ir a la playa y ya me hace falta.",N
2,Recibir una sorpresa inesperada y pasar una noche increible no tiene precio,P


## Tokenizar las reseñas

In [None]:
from transformers import AutoTokenizer

#model_checkpoint = "vg055/roberta-base-bne-finetuned-Tass2020"
model_checkpoint = "distilbert-base-multilingual-cased"#distilbert-base-multilingual-cased #vg055/roberta-base-bne-finetuned-Tass2020
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

## Tokenización de las reseñas

In [None]:
tokenizer.vocab_size

119547

In [None]:
text = "¡hola, estamos muy felices practicando la tokenizacion!"
tokenized_text = tokenizer.encode(text)

for token in tokenized_text:
    print(token, tokenizer.decode([token]))

101 [CLS]
199 ¡
110516 hol
10113 ##a
117 ,
11504 esta
13386 ##mos
13436 muy
13077 fel
39801 ##ices
56309 prac
13640 ##tica
10605 ##ndo
10109 la
18436 tok
18687 ##eni
104679 ##zaci
10263 ##on
106 !
102 [SEP]


In [None]:
encoded_text = tokenizer(text, return_tensors="pt")
encoded_text

{'input_ids': tensor([[   101,    199, 110516,  10113,    117,  11504,  13386,  13436,  13077,
          39801,  56309,  13640,  10605,  10109,  18436,  18687, 104679,  10263,
            106,    102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
def tokenize_reviews(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
columns = dataset['train'].column_names
columns.remove("label")
encoded_dataset = dataset.map(tokenize_reviews, batched=True, remove_columns=columns)
encoded_dataset

Map:   0%|          | 0/4802 [00:00<?, ? examples/s]

Map:   0%|          | 0/2443 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 4802
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 2443
    })
})

In [None]:
encoded_dataset['train'][0]

{'label': 1,
 'input_ids': [101,
  137,
  24984,
  19804,
  107956,
  33269,
  10201,
  10320,
  10320,
  10320,
  119,
  119,
  119,
  36584,
  10196,
  79381,
  119,
  119,
  119,
  36579,
  10192,
  13605,
  11381,
  10854,
  13819,
  10133,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

## Cargar el modelo preentrenado

In [None]:
from transformers import AutoModelForSequenceClassification

num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading pytorch_model.bin:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.we

### De las input IDs a los hidden states

In [None]:
outputs = model(**encoded_text)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-0.1350,  0.1222,  0.1086]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

## Definir las métricas de rendimiento

In [None]:
import evaluate

metric = evaluate.load("accuracy")
metric

EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
    

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
import transformers
transformers.__version__

'4.28.0'

## Afinar el modelo preentrenado

In [None]:
from transformers import TrainingArguments

model_name = model_checkpoint.split("/")[-1]

batch_size = 12
num_train_epochs=3
train_dataset = encoded_dataset["train"]#.shuffle(seed=42).select(range(num_train_samples))
logging_steps = len(train_dataset) // (2 * batch_size * num_train_epochs)

training_args = TrainingArguments(
    output_dir="results-meia_2",
    num_train_epochs=num_train_epochs,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps=logging_steps,
    push_to_hub=True,
    push_to_hub_model_id=f"{model_name}-finetuned-tass"
)



In [None]:
from transformers import Trainer

test_dataset = encoded_dataset["test"]

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

Cloning https://huggingface.co/helenpy/distilbert-base-multilingual-cased-finetuned-tass into local empty directory.


In [None]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.9193,0.946644,0.534998
2,0.7795,0.935073,0.562423


TrainOutput(global_step=802, training_loss=0.8945368772135709, metrics={'train_runtime': 112.1186, 'train_samples_per_second': 85.659, 'train_steps_per_second': 7.153, 'total_flos': 112305897048132.0, 'train_loss': 0.8945368772135709, 'epoch': 2.0})

## OPCIONAL: Empuje hacia el Hugging Face Hub

Si te conectaste a tu cuenta de Huggingface y en el trainer activaste los parámetros de push_to_hub y push_to_hub_model_id, entonces puedes subir tu modelo al repositorio de Huggingface, realizando lo siguiente

In [None]:
trainer.push_to_hub()

Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 1.00/255M [00:00<?, ?B/s]

Upload file runs/Jun21_14-41-16_73a0e079d595/events.out.tfevents.1687358491.73a0e079d595.282.0:   0%|         …

To https://huggingface.co/helenpy/distilbert-base-uncased-finetuned-tass
   63f8492..f759890  main -> main

   63f8492..f759890  main -> main

To https://huggingface.co/helenpy/distilbert-base-uncased-finetuned-tass
   f759890..2dc7f05  main -> main

   f759890..2dc7f05  main -> main



'https://huggingface.co/helenpy/distilbert-base-uncased-finetuned-tass/commit/f75989096eb8fa1984e6d4c28a92e6127a81a908'

## Descargue el modelo desde el Hub

In [None]:
from transformers import pipeline

model_checkpoint = "helenpy/distilbert-base-uncased-finetuned-tass"
pipe = pipeline("sentiment-analysis", model=model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/769 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
pipe("estoy muy contento de estudiar en este curso")

[{'label': 'LABEL_1', 'score': 0.6821224093437195}]