# Introducció a l'anàlisis de sentiments amb Pytorhc i Transformers

**Assignatura**: Models d'intel·ligència artificial

**Professor**: Ramon Mateo Navarro

En aquest notebook farem una introducció a la llibreria Transformers de HugginFace que està implementada amb Pytorch. Aprendrem les bases per crear el nostre primer tokenitzador, transformer i el provarem per veure el seu rendiment.

Farem servir el dataset Reddit data que el tindreu a la carpeta de l'assignatura.

**Codi extret de**: [texto del enlace](https://medium.com/@vishwajeetv2003/transforming-sentiments-training-a-transformer-text-classifier-on-reddit-data-8d295bcdeab5)

## Instal·lacions necessaries

In [1]:
!pip install datasets
!pip install transformers[torch]
!pip install accelerate -U

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.23.0-py3-none-any

## Imports

In [2]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification


## Netejant el dataset

In [3]:
# Load the dataset
df = pd.read_csv("Reddit_Data.csv")

# Map labels to numerical values: -1 -> 0, 0 -> 1, 1 -> 2
label_data_mapping = {-1: 0, 0: 1, 1: 2}
df['category'] = df['category'].map(label_data_mapping)

# Rename columns
df.columns = ['sentence', 'label']

# Drop duplicates and NaN values
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)

# Save preprocessed data to a new CSV file
df.to_csv("data.csv", index=None)

# Load data using the Hugging Face Datasets library
data = load_dataset("csv", data_files="data.csv")

# Split the dataset into train and test sets
split = data['train'].train_test_split(seed=42, test_size=0.3)

Generating train split: 0 examples [00:00, ? examples/s]

## Tokenizer

Aquest codi defineix una funció tokenizer_fn que tokenitza les frases d'un lot aplicant truncament i padding per uniformitzar la longitud. Utilitza un tokenitzador preentrenat de **DistilBERT ("distilbert-base-uncased")**, carregat amb `AutoTokenizer.from_pretrained`. Finalment, el conjunt de dades es tokenitza aplicant la funció `tokenizer_fn` a tot el conjunt de dades en lots mitjançant la funció map amb l'opció `batched=True`.


In [4]:
def tokenizer_fn(batch):
    return tokenizer(batch['sentence'] , truncation=True , padding=True)

# Load tokenizer and tokenize the dataset
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_dataset = split.map(tokenizer_fn, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25759 [00:00<?, ? examples/s]

Map:   0%|          | 0/11040 [00:00<?, ? examples/s]

In [5]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Entrenant

Ara toca entrenar el model. La llibreria Transformers ofereix TrainingArguments que ajuda a definir els arguments per l'entrenament indicant on volem el model generat, el número de epochs, batchsize i l'estratègia d'avaluació.

In [6]:
from transformers import Trainer, TrainingArguments

# Definim els arguments de l'entrenament
training_args = TrainingArguments(
    output_dir="training_dir",
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_eval_batch_size=64,
    per_device_train_batch_size=16,
)

# Podem definir mètriques custom
def compute_metric(logits_and_labels):
    logits, labels = logits_and_labels
    predictions = np.argmax(logits, axis=-1)
    accuracy = np.mean(predictions == labels)
    return {"accuracy": accuracy}

# Creem el training i el comencem a entrenar
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metric
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.266,0.222365,0.930525


KeyboardInterrupt: 

In [7]:
from transformers import pipeline

# Load the trained model for inference
sentiment_classifier = pipeline("text-classification", device=0, model="model")

# Test the model with sample sentences
print(sentiment_classifier("I don't know if I should be happy or sad for my friend's birthday"))
print(sentiment_classifier("I am happy"))



OSError: model is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`