<a href="https://colab.research.google.com/github/LCaravaggio/NLP/blob/main/09_Transformers/SequenceClf_FineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Transfer Learning

Vamos a hacer fine-tuning de BERT pre-entrenado para clasificar secuencias.  

Vamos a ajustar solamente los pesos de las últimas capas y congelar el resto de la red.

In [None]:
!pip install transformers==4.28.0 datasets watermark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import numpy as np
import pandas as pd
import torch
import datasets
from datasets import load_dataset, load_metric
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
)
from IPython.display import display, HTML
from sklearn.linear_model import LogisticRegression

In [None]:
%reload_ext watermark

In [None]:
%watermark -vp torch,transformers,datasets,sklearn

Python implementation: CPython
Python version       : 3.10.11
IPython version      : 7.34.0

torch       : 2.0.1+cu118
transformers: 4.28.0
datasets    : 2.12.0
sklearn     : 1.2.2



In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


## Dataset

Vamos a resolver una de las tasks de GLUE:

[CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability). El objetivo es determinar is una oración es gramaticalmente correcta (1) o no (0).

In [None]:
full_dataset = load_dataset("glue", "cola")



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
full_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [None]:
def show_random_elements(dataset, num_examples=10):
    """Copiado de notebook HF :)
    """
    picks = []
    for _ in range(num_examples):
        pick = np.random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = np.random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(full_dataset["train"], num_examples=6)

Unnamed: 0,sentence,label,idx
0,The clown amused the children.,acceptable,2990
1,"Once Janet left, Fred became much crazier.",acceptable,227
2,He talked to a girl about swimming.,acceptable,3705
3,I thought of the moon,acceptable,8256
4,Ray found the outcome frustrating.,acceptable,5062
5,The bear sniffs,acceptable,8055


In [None]:
print("distribucion de clases:")
for k in full_dataset.keys():
    print(k)
    print(pd.Series(full_dataset[k]["label"]).value_counts())
    print("-"*70)

distribucion de clases:
train
1    6023
0    2528
dtype: int64
----------------------------------------------------------------------
validation
1    721
0    322
dtype: int64
----------------------------------------------------------------------
test
-1    1063
dtype: int64
----------------------------------------------------------------------


In [None]:
print("Sentence length:")
for k in full_dataset.keys():
    print(k)
    largos = pd.Series(full_dataset[k]["sentence"]).str.len()
    print(np.quantile(largos, q=np.arange(0, 1.1, .1)).astype(int))
    print("-"*70)

Sentence length:
train
[  6  21  26  30  33  37  41  46  52  65 231]
----------------------------------------------------------------------
validation
[  9  20  25  29  33  36  42  47  56  69 157]
----------------------------------------------------------------------
test
[  7  20  25  29  33  36  41  46  53  66 152]
----------------------------------------------------------------------


## Tokenización y modelo

In [None]:
model_checkpoint = "distilbert-base-cased"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [None]:
def tokenize_fn(examples):
    """Sin aplicar padding --> lo aplicamos luego en cada batch de entrenamiento
    """
    return tokenizer(examples["sentence"], truncation=True)

In [None]:
tokenize_fn(full_dataset['train'][:3])

{'input_ids': [[101, 3458, 2053, 1281, 112, 189, 4417, 1142, 3622, 117, 1519, 2041, 1103, 1397, 1141, 1195, 17794, 119, 102], [101, 1448, 1167, 23563, 1704, 2734, 1105, 146, 112, 182, 2368, 1146, 119, 102], [101, 1448, 1167, 23563, 1704, 2734, 1137, 146, 112, 182, 2368, 1146, 119, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [None]:
tokenized_dataset = full_dataset.map(tokenize_fn, batched=True, batch_size=32)



In [None]:
# map ignores tensor formatting while writing a cache file
tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [None]:
# del full_dataset

In [None]:
# modelo con head de clf inicializado random
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)
model.to(device)

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifi

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

## Fine-tuning

Tenemos que definir una métrica para evaluar nuestro modelo en validación durante el entrenamiento.

Como el mejor modelo puede no ser el del final del entrenamiento, vamos a usar el mejor modelo guardado según nuestra métrica en validación al final del entrenamiento.

No hacemos búsqueda de hiperparámetros (como learning rate, regularización L2, etc.). Ver esto en [la notebook de HF](https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb).

In [None]:
# freeze todas las capas
for param in model.parameters():
    param.requires_grad = False

In [None]:
# descongelar las ultimas capas
for param in model.pre_classifier.parameters():
    param.requires_grad = True
for param in model.classifier.parameters():
    param.requires_grad = True
# y el ultimo transformer block:
for param in model.distilbert.transformer.layer[-1].parameters():
    param.requires_grad = True

# lo mas usual es ajustar todas (no congelar ninguna)

In [None]:
metric_name = "matthews_correlation"
metric = load_metric(metric_name)

  metric = load_metric(metric_name)


In [None]:
model_name = model_checkpoint.split("/")[-1]

In [None]:
args = TrainingArguments(
    f"{model_name}-finetuned-cola",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=False,
    seed=33,
)

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    #print(predictions.mean())
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# pasamos el tokenizer para que aplique el padding en cada batch
# la alternativa es un usar un data_collator propio 
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5907,0.563367,0.261176
2,0.5277,0.532736,0.316955
3,0.5078,0.539981,0.332989
4,0.4771,0.509664,0.344765
5,0.4825,0.557605,0.328545
6,0.4702,0.525885,0.345437
7,0.4496,0.534345,0.35531
8,0.4448,0.53796,0.356128
9,0.4415,0.537954,0.364505
10,0.4319,0.538513,0.362097


TrainOutput(global_step=5350, training_loss=0.47920118955808266, metrics={'train_runtime': 141.0807, 'train_samples_per_second': 606.107, 'train_steps_per_second': 37.922, 'total_flos': 465498976814988.0, 'train_loss': 0.47920118955808266, 'epoch': 10.0})

In [None]:
# corremos evaluate() sobre validation data para verificar que se conservó el 
# modelo de mejor performance
trainer.evaluate()

{'eval_loss': 0.5379540324211121,
 'eval_matthews_correlation': 0.3645047284614709,
 'eval_runtime': 0.7927,
 'eval_samples_per_second': 1315.764,
 'eval_steps_per_second': 83.26,
 'epoch': 10.0}

In [None]:
# vemos performance en train:
trainer.evaluate(tokenized_dataset["train"])

{'eval_loss': 0.41475990414619446,
 'eval_matthews_correlation': 0.5088327055331026,
 'eval_runtime': 8.8485,
 'eval_samples_per_second': 966.382,
 'eval_steps_per_second': 60.462,
 'epoch': 10.0}

### Error analysis

Ejemplos con mayor loss

In [None]:
data_collator = trainer.data_collator

def loss_per_example(examples):
    """Agrega a un batch la proba, prediccion y loss de cada ejemplo 
    """
    examples = {k: v for k, v in examples.items() if k in ['label', 'input_ids', 'attention_mask']}
    batch = data_collator(examples)
    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    labels = batch["labels"].to(device)
    with torch.inference_mode():
        output = model(input_ids, attention_mask)
        batch["proba"] = torch.softmax(output.logits, dim=1)[:, 1]
        batch["predicted_label"] = torch.argmax(output.logits, axis=1)
    # reduction="none" --> loss por example
    loss = torch.nn.functional.cross_entropy(output.logits, labels, reduction="none")
    batch["loss"] = loss
#    # antes datasets requeria list of NumPy array data types
#    for k, v in batch.items():
#        batch[k] = v.cpu().numpy()
    return batch

In [None]:
model.eval()
errors_dataset = tokenized_dataset['validation'].map(
    loss_per_example, batched=True, batch_size=16)

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

In [None]:
errors_dataset.set_format('pandas')
errors_df = errors_dataset[:][['label', 'proba', 'predicted_label', 'loss']]
# El trainer elimina in-place cualquier feature de tipo str
# --> recuperamos la columna
#errors_df['sentence'] = full_dataset['validation']['sentence']

In [None]:
pd.set_option("display.max_colwidth", None)

In [None]:
# falsos positivos
errors_df.query("label == 0").sort_values("loss", ascending=False).head()

Unnamed: 0,label,proba,predicted_label,loss
631,0,0.98743,1,4.376444
1040,0,0.98586,1,4.25876
433,0,0.980958,1,3.961104
480,0,0.977518,1,3.795057
648,0,0.975879,1,3.724689


In [None]:
# falsos negativos
errors_df.query("label == 1").sort_values("loss", ascending=False).head()

Unnamed: 0,label,proba,predicted_label,loss
1001,1,0.111266,0,2.195834
544,1,0.11972,0,2.122598
407,1,0.123313,0,2.093026
652,1,0.124679,0,2.082011
995,1,0.129437,0,2.044559


## Referencias

* [Notebooks de rasbt](https://github.com/rasbt/deeplearning-models#transformers)
* [Notebooks de HuggingFace](https://huggingface.co/docs/transformers/notebooks)
* [Blog de Lewis Tunstall](https://lewtun.github.io/blog/til/nlp/huggingface/transformers/2021/01/01/til-data-collator.html)