# Ejercicios UD03_03

# Análisis de sentimientos con Distilbert

En la práctica [Analisis de sentimientos con DistilBERT](https://colab.research.google.com/github/martinezpenya/MIA-IABD-2425/blob/main/UD03/notebooks/3.-models_llenguatge_ES.ipynb) hemos entrenado a Distilbert para clasificar los sentimientos en los textos, utilizando una base de datos de Tweets.

En este ejercicio, queremos usar Distilbert para clasificar las reseñas de películas y el mismo entrenamiento no nos sirve. Por esta razón, utilizaremos el aprendizaje de transferencia para adaptar el modelo a la nueva tarea, utilizando el `DataSet` de las reseñas de películas IMDB, disponibles en Huggingface como un `imdb`.

## Objetivos de la práctica

* Reproduce el proceso visto en la práctica [Analisis de sentimientos con DistilBERT](https://colab.research.google.com/github/martinezpenya/MIA-IABD-2425/blob/main/UD03/notebooks/3.-models_llenguatge_ES.ipynb) para entrenar a Distilbert para clasificar los sentimientos en las reseñas de películas.
* Deberá entrenar el modelo (Distilbert u otros transformadores) con el `dataset` de las reseñas de películas IMDB.
* Deberá analizar las predicciones del modelo y compararlo con el modelo entrenado en la práctica [Analisis de sentimientos con DistilBERT](https://colab.research.google.com/github/martinezpenya/MIA-IABD-2425/blob/main/UD03/notebooks/3.-models_llenguatge_ES.ipynb).
* Utilitza un `pipeline` de transformers para hacer predicciones [`zero-shot`](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.ZeroShotClassificationPipeline) sobre reseñas de películas. Comparar los resultados con los del modelo entrenado con el `dataset` de IMDB.
* Escribe una celda con tus comentarios y/o conclusiones finales comparando las dos alternativas realizadas en el ejercicio.

In [1]:
# Instalamos las librerias que vamos a usar

import os
os.environ["WANDB_DISABLED"] = "true"

%pip install -U transformers datasets evaluate accelerate scikit-learn accuracy

Collecting transformers
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting accelerate
  Downloading accelerate-1.4.0-py3-none-any.whl.metadata (19 kB)
Collecting accuracy
  Downloading accuracy-0.1.1-py2.py3-none-any.whl.metadata (4.8 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting clumper>=0.2.15 (from accuracy)
  Downloading clumper-0.2.15-py2.py

In [2]:
import datasets

# Cargamos el Dataset
dataset = datasets.load_dataset('stanfordnlp/imdb')

# Mostramos los datos de ejemplo

dataset['train'][0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

### Tokenización

In [4]:
# importamos el tokenizador de DistilBERT
from transformers import AutoTokenizer

# Cargamos el tokenizador
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')

# Mostramos un ejemplo de tokenización
tokenizer.tokenize('FC Barcelona is fucked this year')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['fc', 'barcelona', 'is', 'fucked', 'this', 'year']

In [5]:
# Definimos una función para preprocesar el texto.
# Truncamos los textos para asegurarnos de que no excedan el máximo tamaño de entrada deDistilBert

def tokenize(examples):
    return tokenizer(examples["text"], padding='max_length', truncation=True)

In [6]:
dades_tokenitzades = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

### Evaluación

In [7]:
import evaluate

accuracy = evaluate.load('accuracy')

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [8]:
# Definimos una función para calcular la precisión del modelo

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

### Etiquetas

In [9]:
id_a_etiqueta = {
    0: "SADNESS",
    1: "JOY",
    2: "LOVE",
    3: "ANGER",
    4: "FEAR",
    5: "SUPRISE"
}

etiqueta_a_id = {
    "SADNESS": 0,
    "JOY": 1,
    "LOVE": 2,
    "ANGER": 3,
    "FEAR": 4,
    "SUPRISE": 5
}

### Fine tuning del modelo

In [10]:
BATCH_SIZE = 16
NUM_EPOCHS = 1

In [11]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert/distilbert-base-uncased',
    num_labels=len(etiqueta_a_id),
    id2label=id_a_etiqueta,
    label2id=etiqueta_a_id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="test_trainer",
    eval_strategy="epoch",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=NUM_EPOCHS,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [18]:
print(set(dades_tokenitzades['unsupervised']['label']))



{-1}


In [20]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dades_tokenitzades['train'],
    eval_dataset=dades_tokenitzades['test'], # REVISAR AÇÓ (Al principi posava validation pero petava))
    compute_metrics=compute_metrics,
)

trainer.train()

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
print(classifier("Suddenly, I'm not half the man I used to be, There's a shadow hanging over me, Oh, yesterday came suddenly"))
print(classifier("Don't stop me now. I'm havin' such a good time, I'm havin' a ball. If you wanna have a good time, just give me a call"))
print(classifier("Remember those who win the game. Lose the love they sought to gain. In debentures of quality. And dubious integrity. Their small town eyes will gape at you. In dull surprise when payment due. Exceeds accounts received. At seventeen"))
print(classifier("You got your bitches with the silicone injections. Crystal meth and yeast infections. Bleached blond hair, collagen lip injections. Who are you to criticize my intentions?. Got your subtle, manipulative devices. Just like you, I got my vices. I got a thought that would be nice. I'd like to crush your head, tight in my vice. Pain"))