<a href="https://colab.research.google.com/github/LCaravaggio/NLP/blob/main/09_Transformers/HuggingFace_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## HuggingFace overview

Sobre tokenizers, modelos, datasets y pipelines. 

In [1]:
!pip install transformers==4.29.2 datasets==2.12.0 bertviz==1.4.0 watermark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.29.2
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.12.0
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bertviz==1.4.0
  Downloading bertviz-1.4.0-py3-none-any.whl (157 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.6/157.6 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting watermark
  Downloading watermark-2.4.2-py2.py3-none-any.whl (7.5 kB)
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers==4.29.2)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [2]:
import json
import numpy as np
import pandas as pd
import torch

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification, AutoModelForMaskedLM,
    AutoModelForTokenClassification, AutoModel, pipeline
)
from datasets import load_dataset, DatasetDict
from bertviz import head_view, model_view
from bertviz.neuron_view import show
from scipy.spatial.distance import cosine


In [3]:
%load_ext watermark

In [4]:
%watermark -vp transformers,datasets,pandas,numpy

Python implementation: CPython
Python version       : 3.10.11
IPython version      : 7.34.0

transformers: 4.29.2
datasets    : 2.12.0
pandas      : 1.5.3
numpy       : 1.22.4



## Tokenizadores

Los modelos preentrenados se desarrollan junto con **tokenizadores**: toman strings sin procesar como inputs, y producen diccionarios con los inputs del modelo como output.

In [5]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
print(tokenizer)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)


In [6]:
input_str = "These pretzels are making me thirsty!"
tokenized_input = tokenizer(input_str)

print("> Tokenizer input:")
print(input_str)
print("-"*70)
print("> Tokenizer output:")
print(tokenized_input)
print("-"*70)
print("> Tokenizer output (input IDs):")
print(tokenized_input["input_ids"])


# Tokenizer input:
These pretzels are making me thirsty!
----------------------------------------------------------------------
# Tokenizer output:
{'input_ids': [101, 1636, 3073, 5745, 5999, 1132, 1543, 1143, 26190, 1183, 106, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
----------------------------------------------------------------------
# Tokenizer output (input IDs):
[101, 1636, 3073, 5745, 5999, 1132, 1543, 1143, 26190, 1183, 106, 102]


Veamos lo que sucede under-the-hood paso a paso

In [7]:
def tokenize_step_by_step(input_str):
    cls = [tokenizer.cls_token_id]
    sep = [tokenizer.sep_token_id]
    input_tokens = tokenizer.tokenize(input_str)
    input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
    input_ids_special_tokens = cls + input_ids + sep
    decoded_str = tokenizer.decode(input_ids_special_tokens)
    print("input:                  ", input_str)
    print("tokenize:               ", input_tokens)
    print("convert_tokens_to_ids:  ", input_ids)
    print("add special tokens:     ", input_ids_special_tokens)
    print("-"*70)
    print("decode (IDs to strings):", decoded_str)

tokenize_step_by_step(input_str)

input:                   These pretzels are making me thirsty!
tokenize:                ['These', 'pre', '##tz', '##els', 'are', 'making', 'me', 'thirst', '##y', '!']
convert_tokens_to_ids:   [1636, 3073, 5745, 5999, 1132, 1543, 1143, 26190, 1183, 106]
add special tokens:      [101, 1636, 3073, 5745, 5999, 1132, 1543, 1143, 26190, 1183, 106, 102]
----------------------------------------------------------------------
decode (IDs to strings): [CLS] These pretzels are making me thirsty! [SEP]


In [8]:
# Los fast tokenizers tienen más funcionalidades (por default ahora casi todos son fast)
if tokenizer.is_fast:
    inputs = tokenizer._tokenizer.encode(input_str)
    print("Input:".ljust(20), input_str)
    print("IDs:".ljust(20), inputs.ids)
    print("Tokens:".ljust(20), inputs.tokens)
    print("Special tokens mask:".ljust(20), inputs.special_tokens_mask)
    print()
    char_idx = 8
    token_idx = inputs.char_to_token(char_idx)
    print(f"El caracter nro {char_idx + 1} es '{input_str[char_idx]}'")
    print(f"Está en el token nro {token_idx}, '{inputs.tokens[token_idx]}'")

# Ver tambien: token_to_word(), char_to_word()

Input:               These pretzels are making me thirsty!
IDs:                 [101, 1636, 3073, 5745, 5999, 1132, 1543, 1143, 26190, 1183, 106, 102]
Tokens:              ['[CLS]', 'These', 'pre', '##tz', '##els', 'are', 'making', 'me', 'thirst', '##y', '!', '[SEP]']
Special tokens mask: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

El caracter nro 9 es 'e'
Está en el token nro 2, 'pre'


Para entrenar modelos vamos a querer: 

* obtener tensores de pytorch en lugar de listas
* pasar múltiples secuencias como input (para hacer inferencia más rápido)
    * esto implica truncar según max_length y hacer padding (a la derecha porque los position embeddings van de 1 a max_length)

In [9]:
model_inputs = tokenizer(input_str, return_tensors="pt")
print(model_inputs)

{'input_ids': tensor([[  101,  1636,  3073,  5745,  5999,  1132,  1543,  1143, 26190,  1183,
           106,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [10]:
input_strings = [
    "These pretzels are making me thirsty!",
    "I am speechless! I am without speech.",
    "No more soup for you!",
    "I'm a wealthy industrialist and philanthropist and a bicyclist."
]

model_inputs = tokenizer(input_strings, return_tensors="pt", padding=True, truncation=True)

print(f"Pad token: {tokenizer.pad_token}")
print(f"Pad token id: {tokenizer.pad_token_id}")
print("-"*70)
print("Batch encode:")
print(model_inputs)
print("-"*70)
print("Batch decode:")
print(*tokenizer.batch_decode(model_inputs.input_ids, skip_special_tokens=False), sep="\n")

Pad token: [PAD]
Pad token id: 0
----------------------------------------------------------------------
Batch encode:
{'input_ids': tensor([[  101,  1636,  3073,  5745,  5999,  1132,  1543,  1143, 26190,  1183,
           106,   102,     0,     0,     0,     0,     0],
        [  101,   146,  1821,  4055,  2008,   106,   146,  1821,  1443,  4055,
           119,   102,     0,     0,     0,     0,     0],
        [  101,  1302,  1167, 13128,  1111,  1128,   106,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0],
        [  101,   146,   112,   182,   170,  6822, 24916,  1105, 16581,  1105,
           170, 16516,  3457,  1665,  7276,   119,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
--------------------------------------------------

## Modelos

Los modelos tienen "body" y "head". El "head" son los pesos adicionales que dependen de la tarea específica que estamos resolviendo. HF ya se encarga automáticamente de devolvernos la arquitectura correcta que necesitamos para nuestra tarea.

Solo necesitamos especificar alguna de: 

```
AutoModel # (solo hidden states, sin head)
AutoModelForCausalLM
AutoModelForMaskedLM
AutoModelForSequenceClassification
AutoModelForTokenClassification
# etc
```

Desde luego, los modelos son aptos para determinados tipos de tareas (e.g. no podemos usar GPT-2 para masked LM!). 

[Acá](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification.from_pretrained) tenemos todos los modelos disponibles para sequence classification. Vamos a cargar BERT "destilado" para clasificar secuencias entre 0 y 1.

In [11]:
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2)
# num_labels=2 --> la head va a tener 2 salidas (0 y 1)

Downloading pytorch_model.bin:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bia

El warning nos dice que los pesos del classification head todavía no fueron entrenados. Es decir, necesitamos hacer fine-tuning (la clase que viene!)

Veamos cómo hacer inferencia.

In [12]:
input_str

'These pretzels are making me thirsty!'

In [13]:
model_inputs = tokenizer(input_str, return_tensors="pt")
model.eval() # eval mode: desactiva componentes random como dropout
model_outputs = model(**model_inputs)

print(model_inputs)
print("-"*70)
print(model_outputs)
print("-"*70)
print(f"Probabilidades: {torch.softmax(model_outputs.logits, dim=1)}")

# obtenemos 2 valores porque HF usa cross-entropy Loss para K clases (en este caso K=2)
# podríamos devolver un único valor con sigmoidea pero esto es lo estándar

{'input_ids': tensor([[  101,  1636,  3073,  5745,  5999,  1132,  1543,  1143, 26190,  1183,
           106,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
----------------------------------------------------------------------
SequenceClassifierOutput(loss=None, logits=tensor([[-0.0600,  0.0131]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
----------------------------------------------------------------------
Probabilidades: tensor([[0.4817, 0.5183]], grad_fn=<SoftmaxBackward0>)


Si tuviésemos labels, podemos calcular la loss.

In [14]:
tags = ["NEG", "POS"]
model_inputs['labels'] = torch.tensor([1])
with torch.no_grad(): # deshabilita computo de gradientes (ahorra mem)
    model.eval()
    model_outputs = model(**model_inputs)

print(model_outputs)
print("-"*70)
print(f"Probabilidades: {torch.softmax(model_outputs.logits, dim=1)}")
print(f"Predicciones: {tags[model_outputs.logits.argmax()]}")
print(f"Loss: {model_outputs.loss:.4f}")

SequenceClassifierOutput(loss=tensor(0.6573), logits=tensor([[-0.0600,  0.0131]]), hidden_states=None, attentions=None)
----------------------------------------------------------------------
Probabilidades: tensor([[0.4817, 0.5183]])
Predicciones: POS
Loss: 0.6573


In [15]:
# y podemos usar pytorch (para hacer entrenamiento custom, por ej)
# ejemplo: update de params con: forward --> loss --> backward
model_outputs = model(**model_inputs)
label = torch.tensor([1])
loss = torch.nn.functional.cross_entropy(model_outputs.logits, label)
print("Loss:")
print(loss.item())
loss.backward() # computa gradientes
# optimizer.step() # to update params
print("NN params:")
list(model.named_parameters())[0]

Loss:
0.6572797298431396
NN params:


('distilbert.embeddings.word_embeddings.weight',
 Parameter containing:
 tensor([[-2.5130e-02, -3.3044e-02, -2.4396e-03,  ..., -1.0848e-02,
          -4.6824e-02, -9.4855e-03],
         [-4.8244e-03, -2.1486e-02, -8.7145e-03,  ..., -2.6029e-02,
          -3.7862e-02, -2.4103e-02],
         [-1.6531e-02, -1.7862e-02,  1.0596e-03,  ..., -1.6371e-02,
          -3.5670e-02, -3.1419e-02],
         ...,
         [-9.6466e-03,  1.4814e-02, -2.9182e-02,  ..., -3.7873e-02,
          -4.6263e-02, -1.6803e-02],
         [-1.3170e-02,  6.5378e-05, -3.7222e-02,  ..., -4.3558e-02,
          -1.1252e-02, -2.2152e-02],
         [ 1.1905e-02, -2.3293e-02, -2.2506e-02,  ..., -2.7136e-02,
          -4.3556e-02,  1.0529e-04]], requires_grad=True))

## Datasets

HF también tiene [datasets](https://huggingface.co/datasets). 

Hay [muchas funcionalidades](https://huggingface.co/docs/datasets/process) para modificar la estructura y el contenido de un dataset (e.g. split de datos, reordenar filas, cambiar nombres de columnas, eliminar columnas, transformar ejemplos, concatenar datasets, etc.)

Acá vamos a cargar el dataset de [reviews de imdb](https://huggingface.co/datasets/imdb) y vamos a truncar los documentos. 

In [16]:
imdb_dataset = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [17]:
# filas y columnas
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [18]:
imdb_dataset["train"][3]

{'text': "This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2/10.",
 'label': 0}

In [19]:
# unsupervised no tiene labels
imdb_dataset["unsupervised"][:3]

{'text': ['This is just a precious little diamond. The play, the script are excellent. I cant compare this movie with anything else, maybe except the movie "Leon" wonderfully played by Jean Reno and Natalie Portman. But... What can I say about this one? This is the best movie Anne Parillaud has ever played in (See please "Frankie Starlight", she\'s speaking English there) to see what I mean. The story of young punk girl Nikita, taken into the depraved world of the secret government forces has been exceptionally over used by Americans. Never mind the "Point of no return" and especially the "La femme Nikita" TV series. They cannot compare the original believe me! Trash these videos. Buy this one, do not rent it, BUY it. BTW beware of the subtitles of the LA company which "translate" the US release. What a disgrace! If you cant understand French, get a dubbed version. But you\'ll regret later :)',
  'When I say this is my favourite film of all time, that comment is not to be taken lightly

In [20]:
def truncate(example):
    """Conservar las primeras 50 words. 
    Return un dict para alterar la estructura del dataset.
    """
    return {
        'text': " ".join(example['text'].split()[:50]),
        'label': example['label']
    }

In [21]:
# Tomamos ejemplos random y truncamos:
small_dataset = DatasetDict(
    train=imdb_dataset['train'].shuffle(seed=33).select(range(256)).map(truncate),
    val=imdb_dataset['train'].shuffle(seed=33).select(range(256, 256+128)).map(truncate),
)

Map:   0%|          | 0/256 [00:00<?, ? examples/s]



Map:   0%|          | 0/128 [00:00<?, ? examples/s]

In [22]:
small_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 256
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 128
    })
})

In [23]:
small_dataset['train'][0]

{'text': "I'm surprised that anyone involved with the production of this series would actually admit responsibility. The script is so unfunny it must have been written by someone who failed the entrance exam for the Canadian Comedy Writers' Union (and that's saying something!). Get out your binoculars if you want, but",
 'label': 0}

In [24]:
# Podemos procesar los ejemplos en batches (por ej para armar batches de entrenamiento) 
# (HF ofrece otras funcionalidades para hacer esto)
small_tokenized_dataset = small_dataset.map(
    lambda example: tokenizer(example['text'], padding=True, truncation=True),
    batched=True,
    batch_size=16
)

small_tokenized_dataset = small_tokenized_dataset.remove_columns(["text"])
small_tokenized_dataset = small_tokenized_dataset.rename_column("label", "labels")
small_tokenized_dataset.set_format("torch")

Map:   0%|          | 0/256 [00:00<?, ? examples/s]

Map:   0%|          | 0/128 [00:00<?, ? examples/s]

In [25]:
small_tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 256
    })
    val: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 128
    })
})

In [26]:
small_tokenized_dataset['train'][0]

{'labels': tensor(0),
 'input_ids': tensor([  101,   146,   112,   182,  3753,  1115,  2256,  2017,  1114,  1103,
          1707,  1104,  1142,  1326,  1156,  2140,  5890,  4812,   119,  1109,
          5444,  1110,  1177,  8362, 14703, 15863,  1122,  1538,  1138,  1151,
          1637,  1118,  1800,  1150,  2604,  1103,  3448, 12211,  1111,  1103,
          2122,  8909, 10269,   112,  1913,   113,  1105,  1115,   112,   188,
          2157,  1380,   106,   114,   119,  3949,  1149,  1240,  9055, 13335,
          5552,  1116,  1191,  1128,  1328,   117,  1133,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [27]:
# truncamos considerando el largo maximo de cada batch:
for i in range(0, 20):
    print(f"{i})", small_tokenized_dataset['train'][i]["input_ids"].__len__())

0) 100
1) 100
2) 100
3) 100
4) 100
5) 100
6) 100
7) 100
8) 100
9) 100
10) 100
11) 100
12) 100
13) 100
14) 100
15) 100
16) 88
17) 88
18) 88
19) 88


## Pipelines

Hay tareas estándar de NLP para las que ya hay modelos preentrenados **y fine-tuned**. HF los disponibiliza a través de la interfaz de [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines). 

Por ejemplo para sentiment classification:

In [28]:
sentiment_analysis = pipeline("sentiment-analysis", model="siebert/sentiment-roberta-large-english")

Downloading (…)lve/main/config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [29]:
print(*input_strings, sep="\n")
print("-"*70)
outputs = sentiment_analysis(input_strings)
print(*outputs, sep="\n")

These pretzels are making me thirsty!
I am speechless! I am without speech.
No more soup for you!
I'm a wealthy industrialist and philanthropist and a bicyclist.
----------------------------------------------------------------------
{'label': 'NEGATIVE', 'score': 0.998259961605072}
{'label': 'NEGATIVE', 'score': 0.9994136095046997}
{'label': 'NEGATIVE', 'score': 0.9969514608383179}
{'label': 'POSITIVE', 'score': 0.9964011907577515}


O para [NER](https://huggingface.co/dslim/bert-base-NER):

In [30]:
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
ner = pipeline("ner", model=model, tokenizer=tokenizer)

Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [31]:
print(ner.model.config.id2label)

{0: 'O', 1: 'B-MISC', 2: 'I-MISC', 3: 'B-PER', 4: 'I-PER', 5: 'B-ORG', 6: 'I-ORG', 7: 'B-LOC', 8: 'I-LOC'}


In [32]:
ner_string = "Did George Washington go to Washington? Will the real Slim Shady please stand up?"

In [33]:
outputs = ner(ner_string)
for entity in ner(ner_string):
    print(entity)

{'entity': 'B-PER', 'score': 0.99944705, 'index': 2, 'word': 'George', 'start': 4, 'end': 10}
{'entity': 'I-PER', 'score': 0.99628437, 'index': 3, 'word': 'Washington', 'start': 11, 'end': 21}
{'entity': 'B-LOC', 'score': 0.99970907, 'index': 6, 'word': 'Washington', 'start': 28, 'end': 38}
{'entity': 'B-PER', 'score': 0.9991824, 'index': 11, 'word': 'Slim', 'start': 54, 'end': 58}
{'entity': 'I-PER', 'score': 0.9974842, 'index': 12, 'word': 'S', 'start': 59, 'end': 60}
{'entity': 'I-PER', 'score': 0.9952407, 'index': 13, 'word': '##hady', 'start': 60, 'end': 64}


## Qué mirás, BERT?

Como vimos, BERT fue entrenado para Masked Language Modeling (_aka_ [fill-mask](https://huggingface.co/tasks/fill-mask) en HF).

Vamos a ver cómo le va en eso. 


In [34]:
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", fast=True)
bert = AutoModelForMaskedLM.from_pretrained(
    "bert-base-cased", output_attentions=True, output_hidden_states=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [35]:
print(bert_tokenizer.mask_token)

[MASK]


In [36]:
input_mlm = [
    "These [MASK] are making me thirsty!",
    "These pretzels are making me [MASK]!",
    "I am [MASK]! I am without speech.",
    "[MASK] more soup for you! NEXT!",
    "I'm a [MASK] industrialist and philanthropist and a bicyclist."
]    

In [37]:
def predict_mask(input_str):
    """Tomamos el camino largo en lugar de usar pipeline
    """
    inputs = bert_tokenizer(input_str, return_tensors="pt")
    mask_index = np.where(inputs['input_ids'] == bert_tokenizer.mask_token_id)
    # .eval() to set dropout and batch normalization layers to evaluation mode
    bert.eval()
    outputs = bert(**inputs)
    top_5_predictions = torch.softmax(outputs.logits[mask_index], dim=1).topk(5)
    for i in range(5):
        token = bert_tokenizer.decode(top_5_predictions.indices[0, i])
        prob = top_5_predictions.values[0, i]
        print(f" {i+1}) {token:<20} {prob:.3f}")

In [38]:
for x in input_mlm:
    print(x)
    out = predict_mask(x)
    print("-"*70)

These [MASK] are making me thirsty!
 1) things               0.227
 2) people               0.098
 3) guys                 0.034
 4) men                  0.033
 5) thoughts             0.025
----------------------------------------------------------------------
These pretzels are making me [MASK]!
 1) sick                 0.579
 2) crazy                0.102
 3) mad                  0.050
 4) nervous              0.022
 5) cry                  0.017
----------------------------------------------------------------------
I am [MASK]! I am without speech.
 1) not                  0.042
 2) nothing              0.033
 3) alone                0.031
 4) crying               0.027
 5) deaf                 0.026
----------------------------------------------------------------------
[MASK] more soup for you! NEXT!
 1) No                   0.287
 2) Get                  0.109
 3) Have                 0.075
 4) Make                 0.056
 5) Some                 0.051
----------------------------

También podemos analizar los hidden states y los attention scores! 

Para esto está muy bueno [BertViz](https://github.com/jessevig/bertviz) pero también lo podemos hacer a mano. 

In [39]:
print(bert)

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_a

In [40]:
# podemos consultar todos los pesos del modelo con:
state_dict = bert.state_dict()
list(state_dict.keys())[:5]

# o con named_parameters()

['bert.embeddings.position_ids',
 'bert.embeddings.word_embeddings.weight',
 'bert.embeddings.position_embeddings.weight',
 'bert.embeddings.token_type_embeddings.weight',
 'bert.embeddings.LayerNorm.weight']

In [41]:
print("Token embeddings tensor shape:")
print(state_dict["bert.embeddings.word_embeddings.weight"].shape)
print("Position embeddings tensor shape:")
print(state_dict["bert.embeddings.position_embeddings.weight"].shape)

Token embeddings tensor shape:
torch.Size([28996, 768])
Position embeddings tensor shape:
torch.Size([512, 768])


In [42]:
input_str = '"I voted for Obama because he was most aligned with my values", Mary said.'

In [43]:
model_inputs = bert_tokenizer(input_str, return_tensors="pt")
bert.eval()
with torch.no_grad():
    model_output = bert(**model_inputs)

In [44]:
print(f"# hidden states = {len(model_output.hidden_states)}")
# initial embeddings + 12 transf. blocks

# hidden states = 13


In [45]:
print("Size of each hidden state:")
print(model_output.hidden_states[0].shape) # (bsz, tokens, dim)

Size of each hidden state:
torch.Size([1, 20, 768])


In [46]:
print("Size of each attention tensor:") 
print(model_output.attentions[0].shape) # (bsz, head, query_word, key_word)

Size of each attention tensor:
torch.Size([1, 12, 20, 20])


Veamos cómo extraer los contextual word embeddings (CWE) -- sin el [feature extractor de HF](https://huggingface.co/tasks/feature-extraction).

In [47]:
print(type(model_output.hidden_states))
print(model_output.hidden_states[0].shape)

<class 'tuple'>
torch.Size([1, 20, 768])


In [50]:
def get_cwes(model_output):
    """Contextual embeddings como la suma de last 4 layers 
    """
    # stack los 13 states en un solo tensor
    embeddings = torch.stack(model_output.hidden_states, dim=0)
    #print(embeddings.shape)
    # drop dimension de batches:
    embeddings = torch.squeeze(embeddings, dim=1)
    #print(embeddings.shape)
    # sum last 4 layers
    embeddings = embeddings[-4:].sum(dim=0)
    #print(embeddings.shape)
    return embeddings

def extract_bert_cwe(input_str, target_word):
    """Extract BERT CWE of a specific token in input_str
    """
    model_inputs = bert_tokenizer(input_str, return_tensors="pt")
    target_position = model_inputs.tokens().index(target_word)
    bert.eval()
    with torch.no_grad():
        model_output = bert(**model_inputs)
    embedding = get_cwes(model_output)[target_position]
    return embedding

In [51]:
input_strings = [
    '"I voted for Obama because he was most aligned with my values", Mary said.',
    'Find the values of x and y in x+y=8',
    'I believe in the values of liberal democracy.',
]

In [52]:
target_embeddings = []
for input_ in input_strings:
    emb_ = extract_bert_cwe(input_, "values")
    target_embeddings.append(emb_)

In [53]:
cos_ = torch.cosine_similarity(target_embeddings[0], target_embeddings[1], dim=0).item()
print(f'Cosine sim. entre "values" de')
print(f"  {input_strings[0]}")
print(f"  {input_strings[1]}")
print(f"{cos_:.4f}")

Cosine sim. entre "values" de
  "I voted for Obama because he was most aligned with my values", Mary said.
  Find the values of x and y in x+y=8
0.7382


In [54]:
cos_ = torch.cosine_similarity(target_embeddings[0], target_embeddings[2], dim=0).item()
print(f'Cosine sim. entre "values" de')
print(f"  {input_strings[0]}")
print(f"  {input_strings[2]}")
print(f"{cos_:.4f}")

Cosine sim. entre "values" de
  "I voted for Obama because he was most aligned with my values", Mary said.
  I believe in the values of liberal democracy.
0.9002


De acuerdo a [What Does BERT Look At? (Clark et al, 2019)](https://arxiv.org/abs/1906.04341) las correferencias tienden a estar captadas en los heads 4-5.


In [None]:
# attention from one token (left) to another (right)
tokens = bert_tokenizer.convert_ids_to_tokens(model_inputs.input_ids[0])
head_view(model_output.attentions, tokens)

<IPython.core.display.Javascript object>

In [None]:
# BertViz show() está buenisimo pero no funciona bien para cualquier modelo
# ver https://colab.research.google.com/drive/1hXIQ77A4TYS4y3UthWF-Ci7V7vVUoxmQ?usp=sharing#scrollTo=-QnRteSLP0Hm

## Referencias

Generales:

* [HuggingFace Docs](https://huggingface.co/docs/transformers/index)
* [HuggingFace Course](https://huggingface.co/course/)
* [HuggingFace Book](https://transformersbook.com/) (Tunstall et al, 2022)

Específicas:

* HuggingFace tutorial de [Stanford CS224n](http://web.stanford.edu/class/cs224n/)
* [Entrena tu propio tokenizer](https://huggingface.co/docs/tokenizers/quicktour)
* [Carga tu propio dataset](https://huggingface.co/docs/datasets/loading)
* [Streaming de large datasets](https://huggingface.co/course/chapter5/4?fw=pt)
* [HF pipeline overview](https://huggingface.co/course/chapter2/2?fw=pt)
 