<a href="https://colab.research.google.com/github/LCaravaggio/NLP/blob/main/09_Transformers/SequenceClf_FeatureExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Transfer Learning

Vamos a usar BERT como feature extractor para resolver un problema de clasificación. 

Una vez que obtenemos una representación vectorial de la secuencia de input, entrenamos un clasificador que podemos usar para predecir en datos nuevos.

In [None]:
!pip install transformers datasets watermark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m52.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting watermark
  Downloading watermark-2.4.2-py2.py3-none-any.whl (7.5 kB)
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)

In [None]:
import numpy as np
import pandas as pd
import torch
import datasets
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoModel
from IPython.display import display, HTML
from sklearn.linear_model import LogisticRegression

In [None]:
%reload_ext watermark

In [None]:
%watermark -vp torch,transformers,datasets,sklearn

Python implementation: CPython
Python version       : 3.10.11
IPython version      : 7.34.0

torch       : 2.0.1+cu118
transformers: 4.29.2
datasets    : 2.12.0
sklearn     : 1.2.2



In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


## Dataset

Vamos a resolver una de las tasks de GLUE:

[CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability). El objetivo es determinar is una oración es gramaticalmente correcta (1) o no (0).

In [None]:
full_dataset = load_dataset("glue", "cola")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
full_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [None]:
full_dataset["train"].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [None]:
def show_random_elements(dataset, num_examples=10):
    picks = []
    for _ in range(num_examples):
        pick = np.random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = np.random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(full_dataset["train"], num_examples=10)

Unnamed: 0,sentence,label,idx
0,The witch turned him from a prince.,unacceptable,2294
1,"The more contented for us to pretend to be became possible, the more angry we grew at the doctors.",unacceptable,1756
2,Paula hit the fence with the stick.,acceptable,2339
3,None of these men wants to be president.,acceptable,4178
4,"The more she looked at pictures, the angrier Mary got.",acceptable,161
5,Who did Gilgamesh believe to have kissed Aphrodite?,acceptable,7873
6,Lucy seems to have been mugged.,acceptable,6213
7,John promised Mary to cut the grass.,acceptable,7654
8,Sandra hates reading about herself in the tabloids.,acceptable,6292
9,Agamemnon seems to have left.,acceptable,7771


In [None]:
print("Distribucion de clases:")
for k in full_dataset.keys():
    print(k)
    print(pd.Series(full_dataset[k]["label"]).value_counts())
    print("-"*70)

Distribucion de clases:
train
1    6023
0    2528
dtype: int64
----------------------------------------------------------------------
validation
1    721
0    322
dtype: int64
----------------------------------------------------------------------
test
-1    1063
dtype: int64
----------------------------------------------------------------------


In [None]:
# test no tiene labels --> es lo que se sube al benchmark!
full_dataset["test"][:3]

{'sentence': ['Bill whistled past the house.',
  'The car honked its way down the road.',
  'Bill pushed Harry off the sofa.'],
 'label': [-1, -1, -1],
 'idx': [0, 1, 2]}

In [None]:
print("Sentence length:")
for k in full_dataset.keys():
    print(k)
    largos = pd.Series(full_dataset[k]["sentence"]).str.len()
    print(np.quantile(largos, q=np.arange(0, 1.1, .1)).astype(int))
    print("-"*70)

Sentence length:
train
[  6  21  26  30  33  37  41  46  52  65 231]
----------------------------------------------------------------------
validation
[  9  20  25  29  33  36  42  47  56  69 157]
----------------------------------------------------------------------
test
[  7  20  25  29  33  36  41  46  53  66 152]
----------------------------------------------------------------------


## Tokenización y feature extraction

Vamos a cargar un modelo sin head porque solo nos interesa BERT para extraer features del texto.

In [None]:
model_checkpoint = "distilbert-base-cased"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
print("max length:", tokenizer.model_max_length)
print("Vocab size:", tokenizer.vocab_size)

max length: 512
Vocab size: 28996


In [None]:
# vamos a truncar segun la maxima longitud de train/val/test
def tokenize_fn(examples):
    return tokenizer(examples["sentence"], truncation=True, padding=True, return_tensors="pt")

In [None]:
# vamos a truncar segun la maxima longitud de train/val/test
tokenized_dataset = full_dataset.map(tokenize_fn, batched=True, batch_size=None)

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Map:   0%|          | 0/1063 [00:00<?, ? examples/s]

In [None]:
# map ignores tensor formatting while writing a cache file 
# --> convertimos a tensores y en GPU
tokenized_dataset.set_format(
    "torch", columns=["input_ids", "attention_mask", "label"], device=device)

In [None]:
tokenized_dataset["train"][:3]

{'label': tensor([1, 1, 1], device='cuda:0'),
 'input_ids': tensor([[  101,  3458,  2053,  1281,   112,   189,  4417,  1142,  3622,   117,
           1519,  2041,  1103,  1397,  1141,  1195, 17794,   119,   102,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0],
         [  101,  1448,  1167, 23563,  1704,  2734,  1105,   146,   112,   182,
           2368,  1146,   119,   102,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0],
         [  101,  1448,  1167, 23563,  1704,  2734,  1137,   146,   112,   182,
           2368,  1146,   119,   102,     0,     0,     0,     0,     0,     0,
              0,     

In [None]:
# ya truncamos segun la maxima longitud de train/val/test:
for split, ds in tokenized_dataset.items():
    ejemplos = ds[:3]["input_ids"] 
    print(split)
    print([len(x) for x in ejemplos])

train
[47, 47, 47]
validation
[36, 36, 36]
test
[38, 38, 38]


In [None]:
# del full_dataset

In [None]:
# automodel no agrega ninguna capa (head) al modelo (body)
model = AutoModel.from_pretrained(model_checkpoint)
_ = model.to(device)

Downloading pytorch_model.bin:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# hacemos el forward pass en batches
batch_size = 10

In [None]:
# extraemos el embedding de CLS en un batch de prueba
batch_prueba = {
    "attention_mask": tokenized_dataset["train"][:batch_size]["attention_mask"],
    "input_ids": tokenized_dataset["train"][:batch_size]["input_ids"]
}
with torch.inference_mode(): # como no_grad() pero mejor https://pytorch.org/docs/stable/generated/torch.inference_mode.html
    output_prueba = model(**batch_prueba)
cls_token_output = output_prueba.last_hidden_state[:, 0]

print(output_prueba.last_hidden_state.shape)
print(cls_token_output.shape)

torch.Size([10, 47, 768])
torch.Size([10, 768])


In [None]:
def get_embeddings(examples):
    """Usamos embedding de CLS para representar cada secuencia
    """
    inputs = {key: tensor for key, tensor in examples.items() 
                                    if key in ['input_ids', 'attention_mask']}
    with torch.inference_mode():
        output = model(**inputs).last_hidden_state[:, 0]
    return {"features": output}

In [None]:
model.eval()
featurized_dataset = tokenized_dataset.map(
    get_embeddings, batched=True, batch_size=batch_size)

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Map:   0%|          | 0/1063 [00:00<?, ? examples/s]

In [None]:
# features as np.ndarray en CPU
featurized_dataset.set_format("np", columns=["features"])

In [None]:
featurized_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask', 'features'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask', 'features'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask', 'features'],
        num_rows: 1063
    })
})

In [None]:
featurized_dataset["train"]["features"], featurized_dataset["train"]["features"].shape

(array([[ 0.37860557,  0.02673244, -0.11231051, ..., -0.07937334,
          0.07191737, -0.00513747],
        [ 0.48164803,  0.10374558,  0.17797749, ..., -0.14352037,
          0.22598322,  0.09944223],
        [ 0.49786124,  0.08733833,  0.1604664 , ..., -0.13326415,
          0.24070376,  0.13152146],
        ...,
        [ 0.45239744, -0.03763199,  0.00905416, ..., -0.14944486,
          0.18974587,  0.03487243],
        [ 0.41455457,  0.11551192,  0.02540738, ..., -0.28557944,
          0.15668266, -0.18898442],
        [ 0.32122087,  0.21777813, -0.11808535, ..., -0.0501947 ,
          0.06539746, -0.0524657 ]], dtype=float32),
 (8551, 768))

In [None]:
# usamos arrays de numpy para entrenar/evaluar el modelo
X_train = np.array(featurized_dataset["train"]["features"])
y_train = np.array(featurized_dataset["train"]["label"])

X_val = np.array(featurized_dataset["validation"]["features"])
y_val = np.array(featurized_dataset["validation"]["label"])

X_test = np.array(featurized_dataset["test"]["features"])
y_test = np.array(featurized_dataset["test"]["label"])

## Modelo

Entrenado sobre los BERT embeddings ya extraidos.

Vamos a hacer _error analysis_ (inspeccionar los ejemplos peor puntuados por el modelo).

In [None]:
mod = LogisticRegression(max_iter=1000)
mod.fit(X_train, y_train)

In [None]:
metric = load_metric('glue', "cola") # matthews corr coefficient

  metric = load_metric('glue', "cola") # matthews corr coefficient


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

In [None]:
scores_train = mod.predict_proba(X_train)[:, 1]
pred_train = scores_train.round() # clf con argmax (no ideal)
metric.compute(predictions=pred_train, references=y_train)

{'matthews_correlation': 0.3900163712436295}

In [None]:
scores_val = mod.predict_proba(X_val)[:, 1]
pred_val = scores_val.round()
metric.compute(predictions=pred_val, references=y_val)

{'matthews_correlation': 0.2741631371163882}

In [None]:
df_val = pd.DataFrame({"y": y_val, "score": scores_val, "idx": featurized_dataset["validation"]["idx"]})

In [None]:
# falsos positivos más groseros (y=0 --> no aceptable)
top_fp = df_val.query("y == 0").sort_values("score", ascending=False).head(5)
top_fp

Unnamed: 0,y,score,idx
674,0,0.969696,674
78,0,0.953569,78
202,0,0.946012,202
218,0,0.932591,218
659,0,0.931256,659


In [None]:
featurized_dataset["validation"].select(top_fp["idx"])["sentence"]

["Gould's performance of Bach on the piano doesn't please me anywhere as much as Ross's on the harpsichord.",
 'Drowning cats, which is against the law, are hard to rescue.',
 'My heart is pounding me.',
 'John offers many advice.',
 'Millie will send the President an obscene telegram, and Paul, the Secretary a rude letter.']

In [None]:
# falsos negativos mas groseros (y=1 --> aceptable)
top_fn = df_val.query("y == 1").sort_values("score", ascending=True).head(5)
top_fn

Unnamed: 0,y,score,idx
995,1,0.154073,995
398,1,0.154372,398
692,1,0.185511,692
332,1,0.194302,332
407,1,0.201128,407


In [None]:
featurized_dataset["validation"].select(top_fn["idx"])["sentence"]

['John counted on Bill to get there on time.',
 'The man who Mary loves and Sally hates computed my tax.',
 "This is the senator to whose mother's friend's sister's I sent the letter.",
 'With no job would John be happy.',
 'She asked was Alison coming to the party.']

## Referencias

* [Notebooks de rasbt](https://github.com/rasbt/deeplearning-models#transformers)
* [Notebooks de HuggingFace](https://huggingface.co/docs/transformers/notebooks)