# Transfer learning: Fine-tuning of DistilGPT2 on Recipes dataset

<img src="https://jalammar.github.io/images/xlnet/transformer-decoder-intro.png" width=900 height=450>



### Kaggle troubleshooting
 
If you plan to run the notebook on Kaggle, and cannot run the `load_dataset` function, try to run these commands as it seems it doesn't recognize `datasets` and `transformers`.

In [5]:
!pip freeze | grep "transformers"

sentence-transformers==4.1.0
transformers==4.53.3


In [7]:
!pip freeze | grep "datasets"

datasets==4.4.1
tensorflow-datasets==4.9.9
vega-datasets==0.9.0


# Load the dataset and split it into train and test

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

dataset = load_dataset("corbt/all-recipes", split="train[:10000]") # Out of more than 2M rows

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 2147248/2147248 [00:02<00:00, 857533.05 examples/s] 


In [3]:
dataset = dataset.train_test_split(test_size=0.2, seed=42)
dataset

DatasetDict({
    train: Dataset({
        features: ['input'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['input'],
        num_rows: 2000
    })
})

Let's check an example:

In [4]:
print(dataset['train']['input'][0])

Chicken Pizza

Ingredients:
- 1 (8 oz.) pkg. refrigerated crescent dinner rolls
- 2 whole chicken breasts, split, skinned and boned
- 1/4 c. vegetable oil
- 1 large onion, sliced into thin rings
- 1 large green bell pepper, sliced into thin rings
- 1/2 lb. fresh mushrooms, sliced
- 1/2 c. pitted ripe olives, sliced
- 1 (10 1/2 oz.) can pizza sauce with cheese
- 1 tsp. garlic salt
- 1 tsp. dried oregano
- 1/4 c. grated Parmesan cheese
- 2 c. (8 oz.) shredded Mozzarella cheese

Directions:
- Separate rolls into 8 triangles.
- Press triangles into lightly oiled 12-inch pizza pan, covering it completely.
- Cut up chicken into bite size pieces.


Load the model and the corresponding tokenizer.

**NOTE**: Trained models work with their own tokenizer. If you plan to train a model from scratch (as in `exercise 5`), you can use any tokenizer (even pre-trained ones), but make sure to resize the token embedding matrices of the model.

In [None]:
# Check if the token is already in the vocabulary
model_name = "distilbert/distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Let's tokenize the dataset by taking advantage of the `map` method of the dataset

In [None]:
def tokenize_batch(examples):
    return tokenizer(examples['input'], padding=True, truncation=True, max_length=128)

# TODO: tokenize the datasets.Dataset properly and return PyTorch tensors for Trainer

Map: 100%|██████████| 8000/8000 [00:00<00:00, 13350.13 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 13411.84 examples/s]
Map: 100%|██████████| 8000/8000 [00:00<00:00, 16918.02 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 17960.84 examples/s]


# Solution

In [None]:
def tokenize_batch(examples):
    return tokenizer(examples['input'], padding=True, truncation=True, max_length=128)

# As the padding token is a special token, we can check the attribute `special_tokens_map` to verify whether it is present or not.
if 'eos_token' in tokenizer.special_tokens_map:
    print(True)
else:
    print('The special token `EOS` is not present!')
    print(f"Length of the vocabulary: {len(tokenizer.get_vocab())}")
    print('Adding the special token `EOS` to the vocabulary')
    tokenizer.add_special_tokens({'pad_token':'EOS'})
    print(f"Length of the new vocabulary: {len(tokenizer.get_vocab())}")
    print(f"Added vocabulary: {tokenizer.get_added_vocab()}")

tokenizer.pad_token = tokenizer.eos_token

# Tokenize the datasets.Dataset properly and return PyTorch tensors for Trainer
tokenized_ds = dataset.map(tokenize_batch, batched=True, remove_columns=dataset['train'].column_names)
# provide labels for Trainer (causal LM uses input_ids as labels)
tokenized_ds = tokenized_ds.map(lambda x: {"labels": x["input_ids"]}, batched=True) # Not necessary if passing mlm = False in data collator, but keep it for clarity
tokenized_ds.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


Let's add the [Data Collator](https://huggingface.co/docs/transformers/main_classes/data_collator) for dynamic padding. 

Import the `DataCollatorForLanguageModeling` and instantiate it.

**NOTE**: In <u>Causal Language Modeling</u>, we don't need the padding token. We will pad all the sequences with the end of sequence token (e.g., `EOS`).

In [None]:
# TODO: Import

# Check if the tokenizer has the eos token

# Set the padding token equal to the eos token

# Instantiate the data collator

# Solution

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


Let's see how the model behaves before the fine-tuning

In [None]:
import torch 
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')

model.to(DEVICE)

prompt = "Title: Pancakes\nIngredients:"
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
outputs = model.generate(**inputs, max_length=60)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Title: Pancakes
Ingredients:
1/2 cup of water
1/2 cup of water
1/2 cup of water
1/2 cup of water
1/2 cup of water
1/2 cup of water
1/2 cup of water
1/2


It seems we have to hydrate a lot. 

<img src="https://i.pinimg.com/originals/8e/fb/0c/8efb0cf59450a32e466848c5560d910d.png" width=150 height=150>


Let's see if after the fine-tuning!

# Fine-tuning

Run the fine-tuning on the given dataset for 10 epochs. Consider that a training of 10 epochs with a batch size of 64 recipes took ~30' on Kaggle 

In [None]:
# Define the training arguments
args = TrainingArguments(
    # TODO: 
    # Train for 10 epochs,
    # use the epoch as evaluation strategy,
    # Train with a batch size of 64. NOTE: If you plan to train it locally, make sure to have enough vRAM! If this is not enough, consider using the parameter `gradient_accumulation_steps`
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_steps=100,
    save_steps=200,
    report_to="none",
)
# Instantiate the Trainer by passing the model, the datasets, the collator, and the arguments we have above
trainer = Trainer(model=model, 
                  train_dataset=tokenized_ds['train'],
                  eval_dataset=tokenized_ds['test'],
                  data_collator=data_collator,
                  args=args)
# Train (easy right?)
trainer.train()

## Solution

In [None]:
# Define the training arguments
args = TrainingArguments(
    output_dir="./fine_tuning/reGiPT2",
    num_train_epochs=10,
    eval_strategy="epoch",
    per_device_train_batch_size=64,
    gradient_accumulation_steps=1,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_steps=100,
    save_steps=200,
    report_to="none",
)
# Instantiate the Trainer by passing the model, the datasets, the collator, and the arguments we have above
trainer = Trainer(model=model, 
                  train_dataset=tokenized_ds['train'],
                  eval_dataset=tokenized_ds['test'],
                  data_collator=data_collator,
                  args=args)
# Train (easy right?)
trainer.train()

# Evaluation
Let's measure the perplexity and study the `generate` method. By passing the parameters, you can change different things such as the decoding algorithm (i.e., greedy decoding, or beam search) or the temperature.

**NOTE**: The temperature is a strictly positive float used to control how deterministic or random is the  model's output. A temperature value < 1 will output most probable tokens, while a larger value (i.e., > 1), will sample from tokens with a lower probability 

### Beam search visualization reminder
<img src=https://pytorch.org/assets/images/fast-beam-search-decoding-in-pytorch-with-torchaudio-and-flashlight-text-1.jpeg>

In [16]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 4.72


In [None]:
model.to(DEVICE)

prompt = "Title: Pancakes\n\nIngredients:"
print(prompt)
print(30*'-')
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
# TODO: generate the output

# Solution

You can check the available generation approaches at this [link](https://huggingface.co/docs/transformers/v4.57.3/generation_strategies)

**NOTE**: the temperature parameter is taken into account only if we sample during the decoding (i.e., we have to set `do_sample` to `True`). You can check the [documentation](https://github.com/huggingface/transformers/blob/d08b98b965176ea9cf8c8e8b24995c955b7e2ec9/src/transformers/generation/logits_process.py#L244).

In [None]:
model.to(DEVICE)

prompt = "Title: Pancakes\n\nIngredients:"
print("PROMPT:\n",prompt)
print(30*'-')
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)

# output greedy, the temperature doesn't affect the greedy decoding
outputs = model.generate(**inputs, max_length=256)
print("Greedy decoding:\n")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(30*'-')

# output beam search decoding num_beams = 5, in this case, the temperature doesn't modify the 
outputs = model.generate(**inputs, max_length=256, num_beams=5, temperature=1.0)
print("Beam search:\n")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(30*'-')

# output beam search multinomial sampling num_beams = 5, T = 1.0
outputs = model.generate(**inputs, max_length=256, do_sample=True, num_beams=5, temperature=1.0)
print("Beam search with multinomial sampling and temperature = 1:\n")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(30*'-')

# output beam search multinomial sampling num_beams = 5, T = 2.5
outputs = model.generate(**inputs, max_length=256, do_sample=True, num_beams=5, temperature=2.5)
print("Beam search with multinomial sampling and temperature = 2:\n")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(30*'-')

This notebook was based on the [causal language modeling tutorial on HuggingFace](https://huggingface.co/docs/transformers/tasks/language_modeling)

Useful resources:
- [Illustrated GPT2](https://jalammar.github.io/illustrated-gpt2/)
- [Generation strategies](https://huggingface.co/docs/transformers/generation_strategies)
- [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) chapters 7 and 8