<a href="https://colab.research.google.com/github/JoseASotoP/Introduction_DL_Master/blob/main/HugginFace_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1 - Pipeline

In [1]:
!pip install transformers -q


[Librería Transformers](https://github.com/huggingface/transformers)

In [2]:
from transformers import pipeline

In [3]:
# tarea de calsficiación
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [5]:
from google.colab import output
output.enable_custom_widget_manager()

Support for third party widgets will remain active for the duration of the session. To disable support:

In [6]:
from google.colab import output
output.disable_custom_widget_manager()

In [7]:
res = classifier("Me encantan las clases de Nechu, explica genial")
print(res)

[{'label': 'POSITIVE', 'score': 0.8972380757331848}]


In [8]:
res = classifier("El profesor es bueno todo será sencillo")
print(res)

[{'label': 'NEGATIVE', 'score': 0.6551797986030579}]


#### Selección del modelo

In [9]:
# seleccionamos el mismo modelo que tenemos por defecto
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


In [10]:
res = classifier("El profesor es bueno todo será sencillo")
print(res)

[{'label': 'NEGATIVE', 'score': 0.6551797986030579}]


Existe una amplia variedad de 'pipelines': [lista de pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline)

### 2 - Modelo y Tokenizer

El modelo lo hemos ejecutado con apenas una línea, pero realmente hay bastantes etapas que ocurren por debajo. En el siguiente código vamos a ver las más importantes.

In [11]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [12]:
# Primero veamos cuales son las etapas anteriores con el mismo modelo
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model_name = "pysentimiento/robertuito-sentiment-analysis"

In [13]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer
)

res = classifier("El profesor es bueno todo será sencillo")
print(res)

tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/925 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/435M [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POS', 'score': 0.9181905388832092}]


In [14]:
from google.colab import output
output.enable_custom_widget_manager()

Support for third party widgets will remain active for the duration of the session. To disable support:

In [15]:
from google.colab import output
output.disable_custom_widget_manager()

### 3 - ¿Para qué sirve el tokenizer?

Codificación (antes del LLM)

In [16]:
secuencia = "El profesor es bueno todo será sencillo"
res = tokenizer(secuencia)
print(res)

{'input_ids': [0, 459, 5934, 442, 1220, 658, 1504, 9764, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


Paso a paso

In [17]:
tokens = tokenizer.tokenize(secuencia)
print(tokens)

['▁el', '▁profesor', '▁es', '▁bueno', '▁todo', '▁será', '▁sencillo']


In [18]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[459, 5934, 442, 1220, 658, 1504, 9764]


Decodificar (después del LLM

In [19]:
tokenizer.decode(res['input_ids'])

'<s> el profesor es bueno todo será sencillo</s>'

### 4 - Guardar modelo y tokenizer en local

In [20]:
model_path = ("./modelo")
tokenizer.save_pretrained(model_path)
model.save_pretrained(model_path)

In [21]:
tokenizer_local = AutoTokenizer.from_pretrained(model_path)
model_local = AutoModelForSequenceClassification.from_pretrained(model_path)

### 5 - Pytorch

También compatible con tensorflow

In [22]:
import torch
import torch.nn.functional as F

In [23]:
sentences = [
    "He estado deseando un curso así toda mi vida",
    "Me encanta HuggingFace"
]

Tokenizer

In [24]:
batch = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)
print(batch)

{'input_ids': tensor([[    0,   723,  1524, 12667,   471,  4095,   816,  1001,   507,   837,
             2],
        [    0,   474,  2479,   925, 24020, 23912,     2,     1,     1,     1,
             1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}


Modelo

In [26]:
with torch.no_grad():
    # Move batch to the same device as the model
    batch = {k: v.to(model.device) for k, v in batch.items()}
    outputs = model(**batch)
    predictions = F.softmax(outputs.logits, dim=1)
    labels = torch.argmax(predictions, dim=1)

In [27]:
# Logits
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-0.6811, -0.0606,  0.9667],
        [-1.8555, -0.3444,  2.4575]], device='cuda:0'), hidden_states=None, attentions=None)


In [28]:
# Neg, Neu, Pos
print(predictions)

tensor([[0.1241, 0.2309, 0.6450],
        [0.0125, 0.0565, 0.9310]], device='cuda:0')


In [29]:
# etiquetas
print(labels)

tensor([2, 2], device='cuda:0')


In [30]:
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(30002, 768, padding_idx=1)
      (position_embeddings): Embedding(130, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         