#**Finetuning con Transformers (Sentiment Analysis)**

Basado en https://huggingface.co/transformers/v3.2.0/custom_datasets.html

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


- Dataset a usar: IMDB
- Tarea: Sentiment Analysis

Carga de Dataset IMDB

In [None]:
from datasets import load_dataset
imdb = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

Ver un ejemplo:

In [None]:
imdb["train"][1]

{'text': '"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn\'t matter what one\'s political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn\'t true. I\'ve seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don\'t exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we\'re treated to the site of Vincent Gallo\'s throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, 

Preprocesamiento



El siguiente paso es tokenizar el texto en un formato legible por el modelo. Es importante cargar el mismo tokenizador con el que se entrenó un modelo para garantizar palabras tokenizadas de manera adecuada. Cargue el tokenizador DistilBERT con AutoTokenizer porque eventualmente entrenaremos un clasificador usando un modelo DistilBERT preentrenado:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


Ahora que ha instanciado un tokenizador, cree una función que tokenizará el texto. También debe truncar secuencias más largas en el texto para que no superen la longitud de entrada máxima del modelo:



In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

Use la función de Datasets map 🤗 para aplicar la función de preprocesamiento a todo el conjunto de datos. También puede configurar batched=True para aplicar la función de preprocesamiento a varios elementos del conjunto de datos a la vez para un preprocesamiento más rápido:

In [None]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)



Map:   0%|          | 0/25000 [00:00<?, ? examples/s]



Por último, rellena (padding) el texto para que tenga una longitud uniforme. Si bien es posible rellenar el texto en la función tokenizadora configurando padding=True, es más eficiente rellenar el texto solo con la longitud del elemento más largo de su batch. Esto se conoce como relleno dinámico. Puede hacer esto con la función DataCollatorWithPadding:

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

**Fine tuning con Trainer**

Ahora cargue su modelo con la clase AutoModelForSequenceClassification junto con la cantidad de etiquetas esperadas:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.

En este punto, solo quedan tres pasos:

1. Defina sus hiperparámetros de entrenamiento en TrainingArguments.
2. Pase los argumentos de entrenamiento a un Trainer junto con el modelo, el conjunto de datos, el tokenizador y el data collator.
3. Llame a `Trainer.train()` para ajustar tu modelo.

Definición de métricas


In [None]:
import numpy as np
from datasets import load_metric
def compute_metrics(eval_preds):
    metric = load_metric("f1") #accuracy")#"f1","macro")#, "multilabel")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average='macro')

In [None]:
!pip install accelerate -U

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from transformers import TrainingArguments, Trainer

In [None]:


training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2, # por una cuestión de tiempo
    weight_decay=0.01,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


In [None]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.3118
1000,0.2447


Step,Training Loss
500,0.3118
1000,0.2447
1500,0.2317
2000,0.1647
2500,0.1441
3000,0.1514


TrainOutput(global_step=3126, training_loss=0.20539091492187345, metrics={'train_runtime': 2467.6616, 'train_samples_per_second': 20.262, 'train_steps_per_second': 1.267, 'total_flos': 6561288258498624.0, 'train_loss': 0.20539091492187345, 'epoch': 2.0})

In [None]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.23661759495735168, 'eval_f1': 0.9301509860126845, 'eval_runtime': 442.6363, 'eval_samples_per_second': 56.48, 'eval_steps_per_second': 3.531, 'epoch': 2.0}


{'eval_loss': 0.23661759495735168, 'eval_f1': 0.9301509860126845, 'eval_runtime': 442.6363, 'eval_samples_per_second': 56.48, 'eval_steps_per_second': 3.531, 'epoch': 2.0}