<a href="https://colab.research.google.com/github/Spocklight/Data_Mining_Bases/blob/main/Extra_ejercicio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

❗<font color=”blue”> Este notebook puede tardar muchos minutos en procesarse </font> ❗

In [None]:
# Transformers installation
! pip install transformers datasets

#Ejercicio extra Data Mining
##*Alejandro Sierra Fernández*
En el siguiente archivo le haremos fine-tuning a un modelo pre-entrenado de la librería Transformers. Este método reduce costes computacionales, costes medioambientales y además puede ahorrar una gran cantidad de trabajo, evitando la tarea de recopilar documentos de forma manual o mediante scratching.

Comenzamos descargando un dataset en el que disponemos de críticas sobre distintos negocios o establecimientos procedentes de Yeld:

[Yelp Reviews](https://huggingface.co/datasets/yelp_review_full) dataset

In [3]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][10]

Downloading builder script:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/979 [00:00<?, ?B/s]

Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...


Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

{'label': 0,
 'text': "Owning a driving range inside the city limits is like a license to print money.  I don't think I ask much out of a driving range.  Decent mats, clean balls and accessible hours.  Hell you need even less people now with the advent of the machine that doles out the balls.  This place has none of them.  It is april and there are no grass tees yet.  BTW they opened for the season this week although it has been golfing weather for a month.  The mats look like the carpet at my 107 year old aunt Irene's house.  Worn and thread bare.  Let's talk about the hours.  This place is equipped with lights yet they only sell buckets of balls until 730.  It is still light out.  Finally lets you have the pit to hit into.  When I arrived I wasn't sure if this was a driving range or an excavation site for a mastodon or a strip mining operation.  There is no grass on the range. Just mud.  Makes it a good tool to figure out how far you actually are hitting the ball.  Oh, they are cash 

Sabemos que el dataset contiene una variable con 5 clases diferentes, dependiendo del número de estrellas que los clientes coloquen en la reseña. En esta ocasión, en lugar de estudiar el dataset, incluír variables y tratar la información vamos a pasar directamente a la tokenización y a la modelización.

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

  0%|          | 0/650 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

Generamos unos datasets más pequeños:

In [18]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Transformers nos proporciona una clase Train optimizada para entrenar modelos Transformers. Empezamos descargando el modelo e indicándoles el número de clases a predecir, que como acabamos de decir son cinco:

In [19]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

Trainer no evalúa automáticamente el rendimiento del modelo durante el entrenaiento. Para ello le pasamos una función que calcule y reporte las métricas. Definimos para ello:

In [20]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

A continuación definimos la clase TrainingArguments, que contiene todos los hiperparámetros que podemos modificar para ajustar nuestro modelo. En este caso emplearemos la opción "default", aunque se puede modificar para buscar un mejor rendimiento:

(Enlace a [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments))

In [21]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

Ya tenemos todo lo necesario para definir nuestro objeto Trainer: El modelo, los datasets, la fución que nos analice el rendimiento y los parámetros a modificar.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375


Epoch,Training Loss,Validation Loss


Finalmente obtenemos el modelo ajustado con los parámetros que mejor rendimiento ofrecen para nuestro dataset. Con este modelo ya podríamos empezar a hacer predicciones.