<a href="https://colab.research.google.com/github/LCaravaggio/NLP/blob/main/09_Transformers/SequenceClf_FeatureExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Transfer Learning

Vamos a usar BERT como feature extractor para resolver un problema de clasificación.

Una vez que obtenemos una representación vectorial de la secuencia de input, entrenamos un clasificador que podemos usar para predecir en datos nuevos.

In [1]:
!pip install transformers datasets watermark

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting watermark
  Downloading watermark-2.4.3-py2.py3-none-any.whl (7.6 kB)
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m108.4 MB/s[0m eta [

In [2]:
import numpy as np
import pandas as pd
import torch
import datasets
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoModel
from IPython.display import display, HTML
from sklearn.linear_model import LogisticRegression

In [3]:
%reload_ext watermark

In [4]:
%watermark -vp torch,transformers,datasets,sklearn

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

torch       : 2.1.0+cu118
transformers: 4.34.1
datasets    : 2.14.6
sklearn     : 1.2.2



In [5]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


## Dataset

Vamos a resolver una de las tasks de GLUE:

[CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability). El objetivo es determinar is una oración es gramaticalmente correcta (1) o no (0).

In [6]:
full_dataset = load_dataset("glue", "cola")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

In [7]:
full_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [8]:
full_dataset["train"].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [9]:
def show_random_elements(dataset, num_examples=10):
    picks = []
    for _ in range(num_examples):
        pick = np.random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = np.random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(full_dataset["train"], num_examples=10)

Unnamed: 0,sentence,label,idx
0,Which poem did you go to hear a recital of last night?,acceptable,7877
1,Doug cleared at the table of dishes.,unacceptable,2631
2,There are likely to be no student absent.,unacceptable,4358
3,The money that you gave me disappeared last night.,acceptable,4200
4,I wonder you ate how much.,unacceptable,244
5,"When in came Aunt Norris, Fanny stopped talking.",unacceptable,6762
6,Bob proved this set is recursive.,acceptable,1236
7,I read every his book.,unacceptable,1001
8,The boys swim.,unacceptable,4149
9,It is beans that I don't like.,acceptable,1503


In [10]:
print("Distribucion de clases:")
for k in full_dataset.keys():
    print(k)
    print(pd.Series(full_dataset[k]["label"]).value_counts())
    print("-"*70)


Distribucion de clases:
train
1    6023
0    2528
dtype: int64
----------------------------------------------------------------------
validation
1    721
0    322
dtype: int64
----------------------------------------------------------------------
test
-1    1063
dtype: int64
----------------------------------------------------------------------


In [11]:
# test no tiene labels --> por qué?
full_dataset["test"][:3]

{'sentence': ['Bill whistled past the house.',
  'The car honked its way down the road.',
  'Bill pushed Harry off the sofa.'],
 'label': [-1, -1, -1],
 'idx': [0, 1, 2]}

In [12]:
print("Sentence length:")
for k in full_dataset.keys():
    print(k)
    largos = pd.Series(full_dataset[k]["sentence"]).str.len()
    print(np.quantile(largos, q=np.arange(0, 1.1, .1)).astype(int))
    print("-"*70)

Sentence length:
train
[  6  21  26  30  33  37  41  46  52  65 231]
----------------------------------------------------------------------
validation
[  9  20  25  29  33  36  42  47  56  69 157]
----------------------------------------------------------------------
test
[  7  20  25  29  33  36  41  46  53  66 152]
----------------------------------------------------------------------


## Tokenización y feature extraction

Vamos a cargar un modelo sin head porque solo nos interesa BERT para extraer features del texto.

In [13]:
model_checkpoint = "distilbert-base-cased"

In [15]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [16]:
print("max length:", tokenizer.model_max_length)
print("Vocab size:", tokenizer.vocab_size)

max length: 512
Vocab size: 28996


In [17]:
# Cuando lo apliquemos, esto va a truncar segun la longitud maxima del batch
def tokenize_fn(examples):
    return tokenizer(examples["sentence"], truncation=True, padding=True, return_tensors="pt")

In [25]:
# Aplicamos con batches iguales a cada particion (train, val, test) i.e. train es un gran batch
# Entonces cada ejemplo va a tener length = max length de su particion
# Hacemos esto porque solo vamos a hacer inferencia, no entrenar
tokenized_dataset = full_dataset.map(tokenize_fn, batched=True, batch_size=None)

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

In [26]:
# map ignores tensor formatting while writing a cache file
# --> convertimos a tensores y en GPU
tokenized_dataset.set_format(
    "torch", columns=["input_ids", "attention_mask", "label"], device=device)

In [27]:
tokenized_dataset["train"][0]

{'label': tensor(1, device='cuda:0'),
 'input_ids': tensor([  101,  3458,  2053,  1281,   112,   189,  4417,  1142,  3622,   117,
          1519,  2041,  1103,  1397,  1141,  1195, 17794,   119,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0], device='cuda:0'),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        device='cuda:0')}

In [28]:
tokenized_dataset["train"][:3]

{'label': tensor([1, 1, 1], device='cuda:0'),
 'input_ids': tensor([[  101,  3458,  2053,  1281,   112,   189,  4417,  1142,  3622,   117,
           1519,  2041,  1103,  1397,  1141,  1195, 17794,   119,   102,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0],
         [  101,  1448,  1167, 23563,  1704,  2734,  1105,   146,   112,   182,
           2368,  1146,   119,   102,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0],
         [  101,  1448,  1167, 23563,  1704,  2734,  1137,   146,   112,   182,
           2368,  1146,   119,   102,     0,     0,     0,     0,     0,     0,
              0,     

In [29]:
# ya truncamos segun la maxima longitud de train/val/test:
for split, ds in tokenized_dataset.items():
    ejemplos = ds[:3]["input_ids"]
    print(split)
    print([len(x) for x in ejemplos])

train
[47, 47, 47]
validation
[36, 36, 36]
test
[38, 38, 38]


In [30]:
# del full_dataset

In [31]:
# automodel a secas no agrega ninguna capa (head) al modelo (body)
model = AutoModel.from_pretrained(model_checkpoint)
_ = model.to(device)

Downloading model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

In [32]:
# hacemos el forward pass en batches
batch_size = 10

In [33]:
# Representamos cada input con el embedding CLS
# --> extraemos el embedding de CLS en un batch de prueba
batch_prueba = {
    "attention_mask": tokenized_dataset["train"][:batch_size]["attention_mask"],
    "input_ids": tokenized_dataset["train"][:batch_size]["input_ids"]
}
with torch.inference_mode(): # como no_grad() pero mejor https://pytorch.org/docs/stable/generated/torch.inference_mode.html
    output_prueba = model(**batch_prueba)
cls_token_output = output_prueba.last_hidden_state[:, 0]

print(output_prueba.last_hidden_state.shape)
print(cls_token_output.shape)

torch.Size([10, 47, 768])
torch.Size([10, 768])


In [34]:
def get_embeddings(examples):
    """Usamos embedding de CLS para representar cada secuencia
    """
    inputs = {key: tensor for key, tensor in examples.items()
                                    if key in ['input_ids', 'attention_mask']}
    with torch.inference_mode():
        output = model(**inputs).last_hidden_state[:, 0]
    return {"features": output}

In [35]:
model.eval()
featurized_dataset = tokenized_dataset.map(
    get_embeddings, batched=True, batch_size=batch_size)

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Map:   0%|          | 0/1063 [00:00<?, ? examples/s]

In [36]:
# features as np.ndarray en CPU
featurized_dataset.set_format("np", columns=["features"])

In [37]:
featurized_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask', 'features'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask', 'features'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask', 'features'],
        num_rows: 1063
    })
})

In [38]:
featurized_dataset["train"]["features"], featurized_dataset["train"]["features"].shape

(array([[ 0.37860557,  0.02673244, -0.11231051, ..., -0.07937334,
          0.07191737, -0.00513747],
        [ 0.48164803,  0.10374558,  0.17797749, ..., -0.14352037,
          0.22598322,  0.09944223],
        [ 0.49786124,  0.08733833,  0.1604664 , ..., -0.13326415,
          0.24070376,  0.13152146],
        ...,
        [ 0.45239744, -0.03763199,  0.00905416, ..., -0.14944486,
          0.18974587,  0.03487243],
        [ 0.41455457,  0.11551192,  0.02540738, ..., -0.28557944,
          0.15668266, -0.18898442],
        [ 0.32122087,  0.21777813, -0.11808535, ..., -0.0501947 ,
          0.06539746, -0.0524657 ]], dtype=float32),
 (8551, 768))

In [39]:
# usamos arrays de numpy para entrenar/evaluar el modelo
X_train = np.array(featurized_dataset["train"]["features"])
y_train = np.array(featurized_dataset["train"]["label"])

X_val = np.array(featurized_dataset["validation"]["features"])
y_val = np.array(featurized_dataset["validation"]["label"])

X_test = np.array(featurized_dataset["test"]["features"])
y_test = np.array(featurized_dataset["test"]["label"])

# Repaso -- por qué usaríamos tres sets?

## Modelo

Entrenado sobre los BERT embeddings ya extraidos.

Vamos a hacer _error analysis_ (inspeccionar los ejemplos peor puntuados por el modelo).

In [40]:
mod = LogisticRegression(max_iter=1000)
mod.fit(X_train, y_train)

In [41]:
metric = load_metric('glue', "cola") # matthews corr coefficient

  metric = load_metric('glue', "cola") # matthews corr coefficient


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

In [42]:
scores_train = mod.predict_proba(X_train)[:, 1]
pred_train = scores_train.round() # clf con argmax (no ideal)
metric.compute(predictions=pred_train, references=y_train)

{'matthews_correlation': 0.3900163712436295}

In [43]:
scores_val = mod.predict_proba(X_val)[:, 1]
pred_val = scores_val.round()
metric.compute(predictions=pred_val, references=y_val)

{'matthews_correlation': 0.2741631371163882}

In [44]:
df_val = pd.DataFrame({"y": y_val, "score": scores_val, "idx": featurized_dataset["validation"]["idx"]})

In [45]:
# falsos positivos más groseros (y=0 --> no aceptable)
top_fp = df_val.query("y == 0").sort_values("score", ascending=False).head(5)
top_fp

Unnamed: 0,y,score,idx
674,0,0.969696,674
78,0,0.953569,78
202,0,0.946012,202
218,0,0.932591,218
659,0,0.931256,659


In [46]:
featurized_dataset["validation"].select(top_fp["idx"])["sentence"]

["Gould's performance of Bach on the piano doesn't please me anywhere as much as Ross's on the harpsichord.",
 'Drowning cats, which is against the law, are hard to rescue.',
 'My heart is pounding me.',
 'John offers many advice.',
 'Millie will send the President an obscene telegram, and Paul, the Secretary a rude letter.']

In [47]:
# falsos negativos mas groseros (y=1 --> aceptable)
top_fn = df_val.query("y == 1").sort_values("score", ascending=True).head(5)
top_fn

Unnamed: 0,y,score,idx
995,1,0.154073,995
398,1,0.154372,398
692,1,0.185511,692
332,1,0.194302,332
407,1,0.201128,407


In [48]:
featurized_dataset["validation"].select(top_fn["idx"])["sentence"]

['John counted on Bill to get there on time.',
 'The man who Mary loves and Sally hates computed my tax.',
 "This is the senator to whose mother's friend's sister's I sent the letter.",
 'With no job would John be happy.',
 'She asked was Alison coming to the party.']

## Referencias

* [Notebooks de rasbt](https://github.com/rasbt/deeplearning-models#transformers)
* [Notebooks de HuggingFace](https://huggingface.co/docs/transformers/notebooks)