<a href="https://colab.research.google.com/github/MateoProjects/Transformers/blob/main/Transformers_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers

Tutorial de HugginFaces

## Pipelines

Puc utilitzar models ja entrenats sense necessitat de crear-los o entrenar-los gràcies a les **pipelines**


In [None]:
!pip install transformers

In [3]:
from transformers import pipeline

In [4]:
## Obting el classificador utilitzant una pipeline##
classifier = pipeline("sentiment-analysis")
result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

label: NEGATIVE, with score: 0.9991


Exemple de fer una classificació de seqüències utilitzant un model per determinar si dues seqüències són paràfrasis entre si. El procés és el següent:
1. Instancia un tokenizer i un model a partir del nom del punt de control. El model s'identifica com un model BERT i el carrega amb els pesos emmagatzemats al punt de control.
2. Construeix una seqüència a partir de les dues frases, amb els separadors específics del model correctes, els identificadors de tipus de testimoni i les màscares d'atenció (que seran creades automàticament pel tokenizer).
3. Passeu aquesta seqüència pel model de manera que es classifiqui en una de les dues classes disponibles: 0 (no és una paràfrasi) i 1 (és una paràfrasi).
4. Calculeu el softmax del resultat per obtenir probabilitats sobre les classes.
5. Imprimeix els resultats.


In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

not paraphrase: 10%
is paraphrase: 90%
not paraphrase: 94%
is paraphrase: 6%


## Preprocessing



In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

El codi d'abaix tokenitza la frase. Genera uns inputs ID i la seva attention mask. 

In [7]:
encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)

{'input_ids': [101, 8667, 117, 146, 112, 182, 170, 1423, 5650, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Puc decodificar el encoded_input fent el següent:

In [8]:
tokenizer.decode(encoded_input["input_ids"])

"[CLS] Hello, I'm a single sentence! [SEP]"

### Exemple 

Tokenitzar múltiples frases seguides. 

In [9]:
batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102], [101, 1262, 1330, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}


Si el que vui a l'enviar diverses frases alhora al tokenizer és crear un lot per alimentar un model, hauré de fer el següent:

* **Padding**: Per completar cada frase amb la longitud màxima que hi hagi al paquet.

* **Truncate**: Per truncar cada frase a la longitud màxima que el model pugui acceptar (si escau).
* **Obtenir tensors**: Per retornar tensors.

Per fer-ho puc fer servir les següents opcions:

In [10]:
batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(batch)

{'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
        [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
        [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0]])}


Puc tokenitzar dos llistes diferents en un mateix pas. 

In [11]:
batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"]
batch_of_second_sentences = [
    "I'm a sentence that goes with the first sentence",
    "And I should be encoded with the second sentence",
    "And I go with the very last one",
]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
print(encoded_inputs)

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


In [12]:
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")

A tenir en compte:

* El padding pot ser boolea o string 
  
  1. **True or 'longest'** to pad to the longest sequence in the batch (doing no padding if you only provide a single sequence).
  2. **'max_length'** to pad to a length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). If you only provide a single sequence, padding will still be applied to it.
  3. **False or 'do_not_pad'** to not pad the sequences. As we have seen before, this is the default behavior.

* Truncation pot ser boolea o string

  1. **True or 'longest_first'** truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached.
  2. **'only_second'** truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will only truncate the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
  3.**'only_first'** truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will only truncate the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
  4. **False or 'do_not_truncate'** to not truncate the sequences. As we have seen before, this is the default behavior.


* Max length

  1. max_length to control the length of the padding/truncation. It can be an **integer or None**, **in which case it will default to the maximum length the model can accept**. If the model has no specific maximum input length, truncation/padding to max_length is deactivated.

### Inputs ja tokenitzats



Quan el meu input ja le tokenitzat anteriorment no cal tokenitzar-lo de nou. Sols cal fer la conversió a id etc...

In [13]:
encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
print(encoded_input)

{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
