<a href="https://colab.research.google.com/github/ManelSoengas/NLP_Curs/blob/main/Utilitzant_Transfomers_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Utilitzant transfomers**

---

Un pipeline és com una “eina llesta per fer servir” per a una tasca concreta amb models de Transformers.

Quan utilitzem un pipeline a transformers, aquest aglutina tres passos clau que habitualment faríem per separat.

1. Fa servir un **tokenizer** per convertir el text en nombres (tokens).

2. Afegeix tokens especials ([CLS], [SEP], etc.).

3. Fa padding o truncament si cal.



*   **Padding** afegeix tokens buits (com 0 o [PAD]) al final de les frases més curtes perquè tinguin la mateixa longitud que la més llarga.
*   Els models tenen una longitud màxima de tokens que poden processar (per exemple, 512 tokens per a BERT). Si una frase és massa llarga, cal tallar-la.
**Truncation** vol dir retallar els tokens sobrants per ajustar-se a aquest límit.



4. Crea la attention_mask per indicar quines parts del text són rellevants.

Un cop tenim els tensors d’entrada, s’envien al model:

1. El model fa inferència (predicció).

2. Genera vectors de sortida (logits, embeddings, prediccions...).


*   Els logits són els valors numèrics que surten del model abans de fer softmax (és a dir, abans de convertir-se en probabilitats). Són com “puntuacions” que indiquen la confiança del model per a cada opció. No són directament interpretables fins que s’apliquen transformacions com softmax




Finalment, el **pipeline**:

1. Interpreta la sortida del model.

2. Transforma els **logits** o vectors en valors humans (etiquetes, probabilitats, text...).

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [2]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [3]:
raw_inputs = [
    "I am very interested in understanding how a transformer works.",
    "I really hate not understanding things!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
print(inputs)

# shape=(2, 14)
# Hi han 2 frases (batch size = 2)
# Cada frase ha estat convertida a una seqüència de 14 tokens


{'input_ids': <tf.Tensor: shape=(2, 14), dtype=int32, numpy=
array([[  101,  1045,  2572,  2200,  4699,  1999,  4824,  2129,  1037,
        10938,  2121,  2573,  1012,   102],
       [  101,  1045,  2428,  5223,  2025,  4824,  2477,   999,   102,
            0,     0,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 14), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]], dtype=int32)>}


In [5]:
from transformers import TFAutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModel.from_pretrained(checkpoint)

outputs = model(inputs)
print(outputs.last_hidden_state.shape)

# Grandària de l'embedding ocult (hidden state) de cada token és 768
# Cada paraula (token) en cada frase es representa internament amb un vector de 768 dimensions.
# Aquestes dimensions contenen informació semàntica complexa que el model ha après durant el preentrenament i el fine-tuning.

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


(2, 14, 768)


In [6]:
from transformers import TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(inputs)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [7]:
print(outputs.logits.shape)

# Nombre de frases (batch size = 2)
# Nombre de classes, 2,  (en aquest cas: sentiment POSITIVE o NEGATIVE)

(2, 2)


In [8]:
print(outputs.logits)

tf.Tensor(
[[-3.9192057  4.191283 ]
 [ 4.1876225 -3.361947 ]], shape=(2, 2), dtype=float32)


El nostre model va predir [-3.9192057  4.191283] per a la primera frase i [4.1876225 -3.361947] per a la segona. No són probabilitats sinó logits, les puntuacions brutes i no normalitzades que emet l'última capa del model. Per convertir-se en probabilitats, han de passar per una capa SoftMax

In [9]:
import tensorflow as tf

predictions = tf.math.softmax(outputs.logits, axis=-1)
print(predictions)

tf.Tensor(
[[3.002818e-04 9.996997e-01]
 [9.994740e-01 5.260597e-04]], shape=(2, 2), dtype=float32)


Ara podem veure que el model va predir [0,0003, 0,9996] per a la primera frase i [0,9994, 0,0005] per a la segona. Aquestes són puntuacions de probabilitat reconeixibles.

In [10]:
model.config.id2label


{0: 'NEGATIVE', 1: 'POSITIVE'}

1. Primera frase: NEGATIU: 0,0003, POSITIU: 0,9996
2. Segona frase: NEGATIU: 0,9994, POSITIU: 0,0005