# HuggingFace

És recomanable executar aquest quadern amb Google Colab, ja que farem ús de la GPU que Colab ens dona. 

Runtime > Change Runtime Type > T4 GPU.

Primer instalem alguns paquets

In [None]:
!pip install transformers datasets evaluate accelerate -U -q
!pip install transformers[sentencepiece] sentencepiece -q

Un cop instalats reiniciem el kernel per tenir-hi accés.

In [None]:
# Només per màquina local
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=1
%env NCCL_P2P_DISABLE=1
%env NCCL_IB_DISABLE=1

In [None]:
from pprint import pprint

import evaluate
import matplotlib.pyplot as plt
import numpy as np
from datasets import DatasetDict, load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, pipeline

## Pipeline

HuggingFace disposa d'una API anomenada *pipeline*.
És l'objecte més potent de la llibreria. Encapsula totes les altres funcions.
Especificant la tasca que volem realitzar, el *pipeline*, internament, cridarà totes les funcions i models necessaris, i retornarà el resultat directament.

*pipeline* té funcions per processament de text, àudio i imatge.

### Resum de textos

In [None]:
# Poden senzillament donar la tasca que ens interessa i es carrega un model per defecte
# device indica si fem servir CPU o GPU (cuda), si apareix un error borreu aquest argument
summarizer = pipeline(task="summarization", device="cuda")

In [None]:
# https://www.theguardian.com/world/article/2024/jul/11/intruder-climbs-up-to-dome-of-florence-cathedral-overnight-for-selfie
text = """
A teenager has been reported to police after allegedly sneaking around Florence’s Cathedral of Santa Maria del Fiore overnight and climbing up to its Cupola del Brunelleschi to take a selfie.
Wearing a black hoodie, jeans and trainers, a person filmed himself walking up an inside stairwell of the world heritage site before reaching the dome level, stepping on to a small platform outside and taking a picture of himself.
Coverage of the stunt was posted on @dedelate, an Instagram account with more than 227,000 followers. It is believed the person had hidden in the cathedral before it closed. An accomplice filmed the alleged exploit from outside the cathedral, apparently capturing the protagonist fleeing the building.
The teenager has not been officially identified but reports in the Italian press said he was a 17-year-old from Lombardy who was known for taking on similar selfie-driven challenges, including one at Milan Cathedral in May, for which he was reportedly charged.
Luca Bagnoli, the president of the Opera di Santa Maria del Fiore, said a complaint had been made to police.
“We have learned about the unauthorised access to Brunelleschi’s dome,” he told La Nazione newspaper. “The cathedral of Florence is a sacred and monumental place, a world heritage site. But unfortunately for some it is also a playground, and this is saddening. The relevant authorities will take care of the rest.”
The suspect has previously allegedly recorded stunts on the roof of Milan’s San Siro stadium and the Ariston theatre in Sanremo, where the annual song festival is held.
Fabiola Minoletti, a vice-president of a Milan residents’ committee, told Il Giorno newspaper: “Dede continues his exploits. Why can’t we contain these people, with such arrogant attitudes? He not only risks his life each time, but [the stunts] could generate dangerous emulations.”
One of the suspect’s followers has now challenged him to climb to the top of St Peter’s Basilica at the Vatican.
"""
summarizer(text)

### Anàlisi de sentiments

In [None]:
classifier = pipeline(task="sentiment-analysis", device="cuda")

In [None]:
text = "I love animals!"

preds = classifier(text)
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
print(preds)

Pipeline admet directament una llista

In [None]:
text = ["I love animals!", "I hate you"]

preds = classifier(text)
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
print(preds)

Aquests exemple són molt òbvis, anem a buscar dades reals.

Farem servir la llibreria `datasets`

In [None]:
# Dades de https://huggingface.co/datasets/carblacac/twitter-sentiment-analysis
# Agafem els 100 primers casos del train split
twitter_sent = load_dataset("carblacac/twitter-sentiment-analysis", split="train[:100]")

In [None]:
twitter_sent

In [None]:
# tenim textos i sentiment positiu (0) o negatiu (1)
twitter_sent[:5]

In [None]:
# Podem comparar els resultats amb les anotacions reals del corpus
preds = classifier(twitter_sent["text"])
labels = [(twitter_sent["feeling"][i], pred["label"]) for i, pred in enumerate(preds)]
labels[:10]

### Exercici

Crea un array de numpy amb tots els labels de `preds`.

A partir de màscares, fes que els valors "POSITIVE" siguin 0 i els "NEGATIVE" 1

Compara el resultat amb `twitter_sent["feeling"]`, quants valors ha encertat?

#### Solució

In [None]:
preds = np.array([pred["label"] for pred in preds])
preds[preds == "POSITIVE"] = 0
preds[preds == "NEGATIVE"] = 1

In [None]:
sum(preds.astype(int) == twitter_sent["feeling"]), len(preds)

### Named Entity Recognition (NER)

In [None]:
pipe = pipeline("ner", model="projecte-aina/roberta-base-ca-v2-cased-ner", device="cuda")

B-/I-/O- signifiquen Beginning, Inside i Outside

In [None]:
example = "Em dic Pol, soc de Banyoles i treballo a la Universitat de Barcelona."

ner_results = pipe(example)
ner_results

### Traducció automàtica

In [None]:
# podem especificar el model que volem fer servir:
translator = pipeline(task="translation", model="t5-small", device="cuda")

T5 és un model multi-tasca. Va ser dels primers (sinó el primer) a fer servir "prompts": una breu descripció de la tasca que ha de fer.
Per exemple pot traduir entre "English", "French", "German" i "Romanian"

In [None]:
text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning."
translator(text)

### Exercici

1. Cerqueu a https://huggingface.co/Helsinki-NLP un model de traducció automàtica de les dues llengües que vulgueu

2. Especifiqueu el nom del model al pipeline

3. Passeu al pipeline més d'una frase per traduir (en format llista). En quin format d'estructura de Python es retornen les traduccions?

## Fine-tuning

**Exemple** de codi per una tasca d'IberLEF 2024: https://github.com/clic-ub/DETESTS-Dis/blob/main/beto_baselines.ipynb

Farem sevir les dades de https://huggingface.co/datasets/stanfordnlp/imdb

In [None]:
# triem els splits que ens interessen
imdb = load_dataset("stanfordnlp/imdb", split={"train": "train", "test": "test"})

In [None]:
# Fes una ullada a les dades. Com s'organitzen?
# Quins valors té label? Estan equilibrats?
imdb

In [None]:
imdb["train"][0]

No tenim "validation", així que el creem

In [None]:
test_valid = imdb["test"].train_test_split(test_size=0.3, seed=42)
ds = DatasetDict({"train": imdb["train"], "test": test_valid["train"], "valid": test_valid["test"]})
ds

Farem servir un model en anglès per aprendre a classificar les ressenyes segons si són positives o negatives.

Els models tipus transformer fan servir un _tokentizador_ que separa el text en talls per processar.

In [None]:
model_id = "distilbert/distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

In [None]:
tokenizer(ds["train"][0]["text"])

In [None]:
# per tokenitzar tots els textos de cop creem una funció
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length")


# I l'apliquem amb map (similar a apply de Pandas)
tokenized_imdb = ds.map(preprocess_function, batched=True)

In [None]:
accuracy = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
training_args = TrainingArguments(
    output_dir="imdb",
    eval_strategy="epoch",
    logging_strategy="epoch",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["valid"],
    compute_metrics=compute_metrics,
)

trainer.train()

In [None]:
predictions = trainer.predict(tokenized_imdb["test"])
results = np.argmax(predictions[0], axis=1)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(ds["test"]["label"], results))