<a href="https://colab.research.google.com/github/LCaravaggio/NLP/blob/main/notebooks/05-CausalLMFinetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Vamos a hacer fine-tuning de un LM causal con [**GPT-2**](https://huggingface.co/docs/transformers/model_doc/gpt2):

* Es un LM (causal) de transformers
* Datos de entrenamiento: _WebText_ (scraping de links que salen de reddit con al menos 3 upvotes)
* Tokenizador: subword tokenization con BPE (Byte Pair Encoding)

Aunque en realidad vamos a usar una versión _destilada_: **distilled-GPT2**.

_Knowledge distillation_ es un proceso que entrena una versión reducida de un modelo más grande al que se intenta imitar, con el objetivo de acelerar el procesamiento y el finetuning en tareas específicas, sacrificando poca performance (ver https://arxiv.org/pdf/1910.01108v4.pdf y https://arxiv.org/pdf/2006.05525.pdf).

-----------------------

Tarea: entender todo el código y responder donde dice **PREGUNTA**

## Configuración del entorno

In [1]:
!pip install -qU datasets transformers watermark
# !pip install -qU datasets==2.19.0 transformers==4.40.1 watermark # wandb
#!pip install datasets==2.19.0 transformers==4.40.1 accelerate==0.29.3 watermark # wandb

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m481.3/491.2 kB[0m [31m24.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/183.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
%load_ext watermark

In [3]:
%watermark -vp transformers,datasets,pandas,numpy

Python implementation: CPython
Python version       : 3.11.12
IPython version      : 7.34.0

transformers: 4.51.3
datasets    : 3.5.0
pandas      : 2.2.2
numpy       : 2.0.2



## Data

Cargamos [reviews de yelp](https://huggingface.co/datasets/yelp_review_full). Vamos a usar solo algunos ejemplos para trabajar más rápido.

Para cargar un dataset propio ver https://huggingface.co/docs/datasets/loading.

In [4]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/299M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [6]:
print(*dataset["train"].features.items(), sep="\n")

('label', ClassLabel(names=['1 star', '2 star', '3 stars', '4 stars', '5 stars'], id=None))
('text', Value(dtype='string', id=None))


In [7]:
# 10k train, 2k validation, 5k test
from datasets import DatasetDict

small_dataset = DatasetDict(
    train=dataset["train"].shuffle(seed=33).select(range(0, 10_000)),
    val=dataset["train"].shuffle(seed=33).select(range(10_000, 12_000)),
    test=dataset["test"].shuffle(seed=33).select(range(5_000)),
)

**PREGUNTA**: ¿Por qué podríamos necesitar tres sets?

In [8]:
print(small_dataset["train"][0]["text"])

Love this place. Stayed in February for 4 days and after spending 4 days at Red Rock, I decided I had to write a review to compliment the service and amenities of this hotel\n\nPluses: \nClean, nice front desk staff, great kitchenette, spacious, quiet, no smell of smoke because the casinos are down the walkway at the MGM, nice little bar for late night drinks (better if it was open later), good breakfast at the sandwich shop downstairs, convenient casino/restaurants/shopping at the MGM but none of the noise and seediness because the Signature is set apart. Also, reasonably priced even booking through the hotel.\n\nMinuses:\nWould be nice to have a little store with kitchen/breakfast essentials in the hotel to make good use of the kitchenette.\n\nI just stayed at the Red Rock where the front desk service was abysmal. When you only have 4 days at a hotel, it makes a big difference when the staff smile and make you feel welcome and try to address concerns efficiently and effectively.


In [9]:
import re

def clean_text(example):
    """Corrige caracteres raros segun la doc de yelp
    """
    texto = re.sub(r'\\n', '\n', example["text"]) # real newlines
    texto = re.sub(r'\\"', '"', texto) # comillas de verdad
    example["text"] = texto
    return example

In [10]:
small_dataset = small_dataset.map(clean_text)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [11]:
print(small_dataset["train"][0]["text"])

Love this place. Stayed in February for 4 days and after spending 4 days at Red Rock, I decided I had to write a review to compliment the service and amenities of this hotel

Pluses: 
Clean, nice front desk staff, great kitchenette, spacious, quiet, no smell of smoke because the casinos are down the walkway at the MGM, nice little bar for late night drinks (better if it was open later), good breakfast at the sandwich shop downstairs, convenient casino/restaurants/shopping at the MGM but none of the noise and seediness because the Signature is set apart. Also, reasonably priced even booking through the hotel.

Minuses:
Would be nice to have a little store with kitchen/breakfast essentials in the hotel to make good use of the kitchenette.

I just stayed at the Red Rock where the front desk service was abysmal. When you only have 4 days at a hotel, it makes a big difference when the staff smile and make you feel welcome and try to address concerns efficiently and effectively.


## Tokenización y modelo

El max_length admitido por el modelo es 1024 pero esto puede consumir mucha memoria. Entonces vamos a trabajar con un max_length de 128 tokens.

En particular, vamos a partir cada documento en pedazos de 128 tokens. Vamos a tener algunos pedazos con menos de 128 porque hay documentos que no llegan a esta cantidad, y también por los pedazos que queden al final de documentos largos.

Para poder hacer un procesamiento en batches vamos a necesitar _padding_: completar con un token especial hasta llegar al max_length o a la máxima longitud del batch.

Una alternativa es truncar los documentos con más de 128 tokens pero si tenemos muchos documentos largos esto puede implicar tirar mucha información.

Vamos a cargar el tokenizador y los pesos de un modelo pre-entrenado: a esto se le llama **checkpoint**. En este caso, la arquitectura es GPT-2 Distilled, mientras que el checkpoint (los pesos específicos) se llama `distilgpt2`.

Vamos a cargar tokenizer y modelo con `AutoClass`es que permiten cargar checkpoints de cualquier arquitectura rápidamente.

In [12]:
model_checkpoint = "distilgpt2"

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# https://huggingface.co/docs/transformers/main_classes/tokenizer#tokenizer

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [14]:
tokenizer.model_max_length # Hay solo model_max_length embeddings de posicion

1024

In [15]:
# context_length = tokenizer.model_max_length
context_length = 128

In [16]:
# veamos cómo funciona la tokenización en 3 ejemplos
ejemplos = small_dataset["train"][:3]
ejemplos

{'label': [4, 2, 1],
 'text': ['Love this place. Stayed in February for 4 days and after spending 4 days at Red Rock, I decided I had to write a review to compliment the service and amenities of this hotel\n\nPluses: \nClean, nice front desk staff, great kitchenette, spacious, quiet, no smell of smoke because the casinos are down the walkway at the MGM, nice little bar for late night drinks (better if it was open later), good breakfast at the sandwich shop downstairs, convenient casino/restaurants/shopping at the MGM but none of the noise and seediness because the Signature is set apart. Also, reasonably priced even booking through the hotel.\n\nMinuses:\nWould be nice to have a little store with kitchen/breakfast essentials in the hotel to make good use of the kitchenette.\n\nI just stayed at the Red Rock where the front desk service was abysmal. When you only have 4 days at a hotel, it makes a big difference when the staff smile and make you feel welcome and try to address concerns e

In [17]:
outputs_ = tokenizer(
    ejemplos["text"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True, # tokeniza doc y lo parte en pedazos
    return_length=True, # computa length de cada doc
)

In [18]:
# como ouput obtenemos token_ids y attention_mask
# por el momento solo vamos a usar token_ids
outputs_

{'input_ids': [[18565, 428, 1295, 13, 16160, 276, 287, 3945, 329, 604, 1528, 290, 706, 4581, 604, 1528, 379, 2297, 4631, 11, 314, 3066, 314, 550, 284, 3551, 257, 2423, 284, 19370, 262, 2139, 290, 35468, 286, 428, 7541, 198, 198, 3646, 2664, 25, 220, 198, 32657, 11, 3621, 2166, 6915, 3085, 11, 1049, 9592, 5857, 11, 40894, 11, 5897, 11, 645, 8508, 286, 7523, 780, 262, 39855, 389, 866, 262, 2513, 1014, 379, 262, 49182, 11, 3621, 1310, 2318, 329, 2739, 1755, 11758, 357, 27903, 611, 340, 373, 1280, 1568, 828, 922, 12607, 379, 262, 20433, 6128, 34624, 11, 11282, 21507, 14, 2118, 2899, 1187, 14, 1477, 33307, 379, 262, 49182, 475, 4844, 286, 262, 7838, 290, 9403, 1272, 780, 262, 34894, 318, 900, 5475, 13, 4418, 11, 13025], [19744, 772, 25452, 832, 262, 7541, 13, 198, 198, 9452, 2664, 25, 198, 17353, 307, 3621, 284, 423, 257, 1310, 3650, 351, 9592, 14, 9032, 7217, 41954, 287, 262, 7541, 284, 787, 922, 779, 286, 262, 9592, 5857, 13, 198, 198, 40, 655, 9658, 379, 262, 2297, 4631, 810, 262, 2166, 

In [19]:
print(f"Cantidad de chunks: {len(outputs_['input_ids'])}")
print(f"Tokens en cada chunk: {(outputs_['length'])}")
print(f"Mapping chunk-doc: {outputs_['overflow_to_sample_mapping']}")

Cantidad de chunks: 6
Tokens en cada chunk: [128, 91, 46, 128, 128, 25]
Mapping chunk-doc: [0, 0, 1, 2, 2, 2]


In [20]:
# con tokenize() obtenemos la separación en subwords
tokens_ = tokenizer.tokenize(ejemplos["text"][0])
print(tokens_)

['Love', 'Ġthis', 'Ġplace', '.', 'ĠStay', 'ed', 'Ġin', 'ĠFebruary', 'Ġfor', 'Ġ4', 'Ġdays', 'Ġand', 'Ġafter', 'Ġspending', 'Ġ4', 'Ġdays', 'Ġat', 'ĠRed', 'ĠRock', ',', 'ĠI', 'Ġdecided', 'ĠI', 'Ġhad', 'Ġto', 'Ġwrite', 'Ġa', 'Ġreview', 'Ġto', 'Ġcompliment', 'Ġthe', 'Ġservice', 'Ġand', 'Ġamenities', 'Ġof', 'Ġthis', 'Ġhotel', 'Ċ', 'Ċ', 'Pl', 'uses', ':', 'Ġ', 'Ċ', 'Clean', ',', 'Ġnice', 'Ġfront', 'Ġdesk', 'Ġstaff', ',', 'Ġgreat', 'Ġkitchen', 'ette', ',', 'Ġspacious', ',', 'Ġquiet', ',', 'Ġno', 'Ġsmell', 'Ġof', 'Ġsmoke', 'Ġbecause', 'Ġthe', 'Ġcasinos', 'Ġare', 'Ġdown', 'Ġthe', 'Ġwalk', 'way', 'Ġat', 'Ġthe', 'ĠMGM', ',', 'Ġnice', 'Ġlittle', 'Ġbar', 'Ġfor', 'Ġlate', 'Ġnight', 'Ġdrinks', 'Ġ(', 'better', 'Ġif', 'Ġit', 'Ġwas', 'Ġopen', 'Ġlater', '),', 'Ġgood', 'Ġbreakfast', 'Ġat', 'Ġthe', 'Ġsandwich', 'Ġshop', 'Ġdownstairs', ',', 'Ġconvenient', 'Ġcasino', '/', 'rest', 'aur', 'ants', '/', 'sh', 'opping', 'Ġat', 'Ġthe', 'ĠMGM', 'Ġbut', 'Ġnone', 'Ġof', 'Ġthe', 'Ġnoise', 'Ġand', 'Ġseed', 'iness', 'Ġbe

In [21]:
# el tokenizer de gpt2 trata a los espacios como parte de las palabras,
# entonces codifica distinto a las palabras en el medio vs el principio de la
# secuencia
# https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Tokenizer

print(tokenizer.tokenize("Love this place"))
print(tokenizer("Love this place")['input_ids'])
print(tokenizer.tokenize(" Love this place"))
print(tokenizer(" Love this place")['input_ids'])

['Love', 'Ġthis', 'Ġplace']
[18565, 428, 1295]
['ĠLove', 'Ġthis', 'Ġplace']
[5896, 428, 1295]


In [22]:
def tokenize_fn(example):
    """Tokeniza `text` de examples de un dataset.
    Returns only input_ids.
    """
    outputs = tokenizer(
        example["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    return {"input_ids": outputs["input_ids"]}

In [23]:
# Aplicamos la tokenizacion en batches y 4 procesos para acelerar la corrida
    # descartamos el resto de columnas
tokenized_dataset = small_dataset.map(
    tokenize_fn, batched=True, num_proc=4,
    remove_columns=small_dataset["train"].column_names)

# NOTE: si queremos conservar mas columnas, tenemos que generar la misma
# cantidad de datos que en el output (esta tokenizacion genera mas samples
# que la cantidad inicial de examples)

Map (num_proc=4):   0%|          | 0/10000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/2000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/5000 [00:00<?, ? examples/s]

In [24]:
small_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 10000
    })
    val: Dataset({
        features: ['label', 'text'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 5000
    })
})

In [25]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 18555
    })
    val: Dataset({
        features: ['input_ids'],
        num_rows: 3799
    })
    test: Dataset({
        features: ['input_ids'],
        num_rows: 9280
    })
})

**PREGUNTA**: ¿qué representa cada _row_ de tokenized_dataset?

In [26]:
# Cargamos el modelo
    # Usamos el EOS token as PAD token to avoid warnings (GPT2 does not have a PAD token)
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_checkpoint, pad_token_id=tokenizer.eos_token_id)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [27]:
model_size = sum(t.numel() for t in model.parameters())
print(f"Model size: {model_size/1000**2:.1f}M parameters")
# numel: number of elements in tensor

# gpt3 tiene 175B params, gpt4 tiene 1T...

Model size: 81.9M parameters


In [28]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


## Entrenamiento

Un "collator" es una función que forma batches de datos.

Vamos a usar un "collator" que arma batches de ejemplos con padding. `DataCollatorForLanguageModeling` está diseñado específicamente para language models.

En particular se encarga de:

* armar los targets del modelo (los tokens desplazados) _on the fly_ durante el entrenamiento sin duplicar los input_ids.
* Agregar padding donde corresponda

Usamos `mlm=False` para usar **Causal Language Modeling** en lugar de Masked Language Modeling.

Podemos loguear métricas durante el entrenamiento con tensorboard, wandb, etc.

In [29]:
# el padding se hace con el EOS token
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [30]:
# vemos un ejemplo con un batch de 3 docs
out = data_collator([tokenized_dataset["train"][i] for i in range(3)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

input_ids shape: torch.Size([3, 128])
attention_mask shape: torch.Size([3, 128])
labels shape: torch.Size([3, 128])


In [31]:
# hay padding:
out["input_ids"][1]

tensor([19744,   772, 25452,   832,   262,  7541,    13,   198,   198,  9452,
         2664,    25,   198, 17353,   307,  3621,   284,   423,   257,  1310,
         3650,   351,  9592,    14,  9032,  7217, 41954,   287,   262,  7541,
          284,   787,   922,   779,   286,   262,  9592,  5857,    13,   198,
          198,    40,   655,  9658,   379,   262,  2297,  4631,   810,   262,
         2166,  6915,  2139,   373,   450,   893,  7617,    13,  1649,   345,
          691,   423,   604,  1528,   379,   257,  7541,    11,   340,  1838,
          257,  1263,  3580,   618,   262,  3085,  8212,   290,   787,   345,
         1254,  7062,   290,  1949,   284,  2209,  4786, 18306,   290,  6840,
           13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256])

In [32]:
# attention mask para no hacer attention sobre pad_tokens:
out["attention_mask"][1]

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])

In [33]:
# usamos solo el nombre del modelo para el nuevo nombre (no el usuario)
pretrained_model_name = model_checkpoint.split("/")[-1]
finetuned_model_name = f"{pretrained_model_name}-finetuned-yelp"
print(finetuned_model_name)

distilgpt2-finetuned-yelp


Si vamos a usar wandb, copiamos API key de https://wandb.ai/authorize

In [34]:
#!wandb login

In [35]:
#os.environ["WANDB_PROJECT"] = project_name

In [39]:
# definimos los parametros del entrenamiento
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    finetuned_model_name,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=5e-4,
    weight_decay=0.1, # forma de regularizacion (restringe el tamaño de updates de SGD)
    warmup_ratio=0.1, # # warmup evita divergencia de loss en primeros steps (10%)
    lr_scheduler_type="cosine",
    do_eval=True, # eval en validation set
    gradient_accumulation_steps=1, # acumula gradientes por N steps --> update cada N*32 samples
    # sirve cuando batches grandes no entran en memoria y tenemos muchos samples
    eval_strategy="steps", # eval en validation set
    eval_steps=50,
    save_strategy="steps",
    load_best_model_at_end=True, # conserva mejor modelo segun eval loss
    save_total_limit=2, # save max 2 models including best one
    save_steps=50, # checkpoint model every N steps
    logging_dir='./logs', # logging
    logging_strategy="steps",
    logging_steps=1,
    fp16=True, # float16 en training (only on CUDA)
    push_to_hub=False,
#    report_to="wandb",  # enable logging to W&B
   report_to="none",
    save_safetensors=False # por un bug
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"], #.select(range(0, 128)),
    eval_dataset=tokenized_dataset["val"], #.select(range(0, 128)),
)

**PREGUNTA**: ¿qué es el parámetro de learning_rate?

In [40]:
#!rm -rf ./logs # para wandb/tensorboard

In [41]:
#%reload_ext tensorboard
#%tensorboard --logdir logs

# para wandb/tensorboard

In [42]:
# Entrenamos!
train_output = trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss,Validation Loss
50,4.0437,3.932843
100,4.0261,3.932244
150,3.9204,3.89332
200,3.783,3.856563
250,3.8665,3.813824
300,4.074,3.782219
350,4.0116,3.756212
400,3.685,3.729402
450,3.734,3.713219
500,3.7231,3.701094


In [43]:
# to save model:
trainer.save_model()

## Evaluation

In [44]:
train_output

TrainOutput(global_step=580, training_loss=3.853935789239818, metrics={'train_runtime': 374.7762, 'train_samples_per_second': 49.51, 'train_steps_per_second': 1.548, 'total_flos': 606045150904320.0, 'train_loss': 3.853935789239818, 'epoch': 1.0})

In [45]:
# volvemos a calcular loss en train porque train_output.training_loss
# se calcula con criterio distinto a trainer.evaluate()
train_results = trainer.evaluate(tokenized_dataset["train"])
val_results = trainer.evaluate()
test_results = trainer.evaluate(tokenized_dataset["test"])

In [46]:
train_results

{'eval_loss': 3.4332311153411865,
 'eval_runtime': 60.3162,
 'eval_samples_per_second': 307.629,
 'eval_steps_per_second': 9.616,
 'epoch': 1.0}

In [47]:
val_results

{'eval_loss': 3.6965413093566895,
 'eval_runtime': 12.3324,
 'eval_samples_per_second': 308.051,
 'eval_steps_per_second': 9.649,
 'epoch': 1.0}

In [49]:
import numpy as np

print("Perplexity:")
print(f"Train: {np.exp(train_results['eval_loss']):.2f}")
print(f"Validation: {np.exp(val_results['eval_loss']):.2f}")
print(f"Test: {np.exp(test_results['eval_loss']):.2f}")

Perplexity:
Train: 30.98
Validation: 40.31
Test: 39.93


In [50]:
# comparamos con el GPT2 no fine-tuneado
    # un poco hackoso, instanciamos un trainer pero no vamos a entrenar
    # es solo para replicar exactamente la evaluacion anterior, sería
    # mejor armar una funcion adhoc
model_original = AutoModelForCausalLM.from_pretrained(
    model_checkpoint, pad_token_id=tokenizer.eos_token_id)
trainer_aux = Trainer(
    model=model_original,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"], #.select(range(0, 128)),
    eval_dataset=tokenized_dataset["test"], #.select(range(0, 128)),
)

In [51]:
test_results_original = trainer_aux.evaluate(tokenized_dataset["test"])

In [52]:
print("Perplexity (no fine-tuning):")
print(f"Test: {np.exp(test_results_original['eval_loss']):.2f}")

Perplexity (no fine-tuning):
Test: 64.79


**PREGUNTA** ¿por qué la versión fine-tuned tiene menos perplexity que sin fine-tuning?

### Text generation

In [53]:
import torch

device = f"cuda:{torch.cuda.current_device()}" if torch.cuda.is_available() else "cpu"

In [70]:
def generate(
    prompt=None, max_length=100, greedy=True, model=model, tokenizer=tokenizer, device=device
):
    """Generar texto con sampling (greedy=False) o greedy search (greedy=True)

    prompt=None stands for beggining of sequence.

    NOTE si bien parece que GPT2 puede generar a partir de BOS token, la
    documentacion es poco clara. Ademas hicimos nuestro finetuning sin BOS token.
    Entonces solo vamos a usar la funcion pasandole un contexto.

    Ver:
    https://github.com/huggingface/transformers/issues/3311#issuecomment-601264426
    https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/generate_unconditional_samples.py#L60
    """
    do_sample = False if greedy else True
    # model.eval() to set dropout and batch normalization layers to evaluation mode before running inference
    model.eval()
    with torch.inference_mode():
        if prompt:
            input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
            outputs = model.generate(input_ids, do_sample=do_sample, max_length=max_length, pad_token_id=tokenizer.eos_token_id)
        else:
            outputs = model.generate(do_sample=do_sample, max_length=max_length, pad_token_id=tokenizer.eos_token_id)
    # pad_token_id=tokenizer.eos_token_id to suppress warning
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [71]:
res_ = generate('I loved "El Topo" because')
print(res_[0])

I loved "El Topo" because it was a great place to go for a late night meal.  I had the chicken and the chicken.  The chicken was good, but the chicken was not good.  The chicken was not good.  The chicken was not good.  The chicken was not good.  The chicken was not good.  The chicken was not good.  The chicken was not good.  The chicken was not good.  The chicken was not good. 


**PREGUNTA**: ¿con un mismo prompt vamos a obtener siempre la misma generación?

In [72]:
torch.manual_seed(33)
res_ = generate('I loved "El Topo" because', greedy=False)
print(res_[0])

I loved "El Topo" because I've been searching for good tacos. Well on the one that opened I found the place on the road near Scottsdale with some good, tasty fare. 

After eating here for a couple years (for some reason) my tacos came to $12 for the entire Mexican food. I was disappointed with the food since it was a little small when I was the first to order on the road. The quality of tacos was not bad. For a


In [73]:
torch.manual_seed(0)
res_ = generate('I loved "El Topo" because', greedy=False)
print(res_[0])

I loved "El Topo" because I liked what it tasted like.  My husband and I stayed at the Luxor for over two years from 2008 to 2008.  It was always nice to have a nice view, and the outdoor patio was a relaxing spot for a relaxing day.

We went for lunch on two nights one night and it was all pretty cool...we shared our own menu and got a decent bottle of wine for our second visit.  Everyone was quite nice and was


In [74]:
torch.manual_seed(33)
res_ = generate('I loved "El Topo" because', greedy=False, model=model_original)
print(res_[0])

I loved "El Topo" because he knows better than even the stars of his film.



"It's always funny and you're always on my side. In my mind, it's always cool. A lot of these people who talk about it are like, "Oh, I'm sure you haven't already seen El Topo but I'll try to see if I do see he."

"Maybe I'm just playing the great show. When people think I'm


**PREGUNTA** ¿por qué el formato y contenido del texto generado con el modelo sin fine-tuning es tan distinto al modelo fine-tuned?

In [75]:
torch.manual_seed(23)
res_ = generate('I hated the cake from "El Topo" because', greedy=False)
print(res_[0])

I hated the cake from "El Topo" because I thought the cookies and chocolate on the cake are the best. So it was about the size of a few little things and left me craving for what the real deal was. I tried to try the apple muffin (and the best of them too). The chocolate was cooked way in a great way with a creamy crust. No "secret" or "magic". That took my life to order. The waitress asked me what the difference between the


In [76]:
generate('It was the worst day ever because', greedy=False)

['It was the worst day ever because your first experience with a store did not have very much to offer.  You were going to spend a lot of money looking for something to carry but when they asked with a couple of the customers what they did and how they were doing the deal they were looking for they were just talking at the register.  We were in the store and our husband grabbed a different thing and walked in.  In the middle of the store we had a customer come by to tell']

## Referencias

* [Causal LM from sratch](https://huggingface.co/course/chapter7/6?#training-a-causal-language-model-from-scratch)

* [LM finetuning](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/language_modeling.ipynb)

* [Customized training](https://huggingface.co/course/chapter3/4#a-full-training)

* [Text generation](https://github.com/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb)

* [Scripts para entrenar y finetunear modelos](https://github.com/huggingface/transformers/tree/main/examples/pytorch)

* [Sobre GPT-2](https://huggingface.co/gpt2)

* [Autoclasses](https://huggingface.co/docs/transformers/autoclass_tutorial)

* [Hugging Face + wandb](https://docs.wandb.ai/guides/integrations/huggingface) (no logré hacerlo andar bien en colab 😞)

* [Howard & Gugger (2020) - Deep learning for coders with fastai and PyTorch](https://dl.ebooksworld.ir/books/Deep.Learning.for.Coders.with.fastai.and.PyTorch.Howard.Gugger.OReilly.9781492045526.EBooksWorld.ir.pdf) -- temas generales de fine-tuning y DL