<a href="https://colab.research.google.com/github/Priscilla97/llm-rag-foundations/blob/main/02_fine_tuning/3_A_full_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A full training

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

## Prepare model and dataset

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Prepare for training
We will need to define a few objects.
1) the **dataloaders** we will use to iterate over batches.

But before, we need to apply a bit of **postprocessing** to our **tokenized_datasets**, to take care of some things that the Trainer did for us automatically.

Specifically, we need to:

- Remove the columns corresponding to values the model does not expect (like the sentence1 and sentence2 columns).

- Rename the column label to labels (because the model expects the argument to be named labels).

- Set the format of the datasets so they return PyTorch tensors instead of lists.

Our tokenized_datasets has one method for each of those steps:

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

We can then check that the result only has columns that our model will accept:

In [None]:
["attention_mask", "input_ids", "labels", "token_type_ids"]

Now that this is done, we can easily define our **dataloaders**:

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

To quickly check there is no mistake in the data processing, we can inspect a **batch** like this:

In [None]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 65]),
 'input_ids': torch.Size([8, 65]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 65])}

We instantiate the **model** exactly as we did in the previous section:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

To make sure that everything will go smoothly during training, we pass our batch to this model:

In [None]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.5441, grad_fn=<NllLossBackward>) torch.Size([8, 2])

All ðŸ¤— Transformers models will return the **loss** when labels are provided, and we also get the **logits** (two for each input in our batch, so a tensor of size 8 x 2).

Weâ€™re almost ready to write our training loop!

- an **optimizer** (eg AdamW)
- a learning rate scheduler.

**AdamW**, which is the same as Adam, but with a twist for weight decay regularization.

**Modern Optimization Tips**: For even better performance, you can try:

- AdamW with weight decay: AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)

- 8-bit Adam: Use bitsandbytes for memory-efficient optimization
Different learning rates: Lower learning rates (1e-5 to 3e-5) often work better for large models


ðŸš€ Optimization Resources: Learn more about optimizers and training strategies in the ðŸ¤— Transformers optimization guide: https://huggingface.co/docs/transformers/main/en/performance#optimizer

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

The **learning rate scheduler** used by default is just a linear decay from the maximum value (5e-5) to 0.

To properly define it, we need to know:
- the number of **training steps** we will take, (num_epochs we want to run)
- multiplied by the number of **training batches** (the length of our training dataloader).

The Trainer uses three epochs by default, so we will follow that:

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377

## The training loop
One last thing: we will want to use the **GPU** if we have access to one (on a CPU, training might take several hours instead of a couple of minutes).

To do this, we **define a device** we will put our model and our batches on:

In [None]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

We are now ready to train! To get some sense of when training will be finished, we add a progress bar over our number of training steps, using the tqdm library:

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Modern Training Optimizations: To make your training loop even more efficient, consider:

- **Gradient Clipping**: Add *torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)* before optimizer.step()

- **Mixed Precision**: Use *torch.cuda.amp.autocast()* and *GradScaler* for faster training

- **Gradient Accumulation**: Accumulate gradients over multiple batches to simulate larger batch sizes

- **Checkpointing**: Save model checkpoints periodically to resume training if interrupted

ðŸ”§ Implementation Guide: For detailed examples of these optimizations, see the ðŸ¤— Transformers efficient training guide(https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one) and the range of optimizers (https://huggingface.co/docs/transformers/main/en/optimizers).

## The evaluation loop
Weâ€™ve already seen the metric.compute() method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method add_batch().

Once we have accumulated all the batches, we can get the final result with metric.compute().

Hereâ€™s how to implement all of this in an evaluation loop:

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.8431372549019608, 'f1': 0.8907849829351535}

Again, your results will be slightly different because of the randomness in the model head initialization and the data shuffling, but they should be in the same ballpark.

## Supercharge your training loop with ðŸ¤— Accelerate

Using the ðŸ¤— Accelerate library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs.

Starting from the creation of the training and validation dataloaders, here is what our manual training loop looks like:

In [None]:
from transformers import AutoModelForSequenceClassification, get_scheduler
from torch.optim import AdamW

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

1) The **first** line to add is the import line.

2) The **second** line instantiates an **Accelerator** object that will look at the environment and initialize the proper distributed setup.

3) remove the lines that put the model on the device.

4) line that sends the dataloaders, the model, and the optimizer to **accelerator.prepare().**

5) The remaining changes to make are removing the line that puts the batch on the device (again, if you want to keep this you can just change it to use accelerator.device) and **replacing loss.backward() with accelerator.backward(loss).**

In [None]:
# 1) import
from accelerate import Accelerator
from transformers import AutoModelForSequenceClassification, get_scheduler
from torch.optim import AdamW

# 2) accelerator
accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

# 3) remove device and model.to(device)

# 4) accelerato.prepare:
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        # 5) accelerato.backward
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Putting this in a train.py script will make that script runnable on any kind of distributed setup. To try it out in your distributed setup, run the command:

  accelerate config

which will prompt you to answer a few questions and dump your answers in a configuration file used by this command:

  accelerate launch train.py

which will launch the distributed training.

If you want to try this in a Notebook (for instance, to test it with TPUs on Colab), just paste the code in a training_function() and run a last cell with:

In [None]:
from accelerate import notebook_launcher

notebook_launcher(training_function)