## Import Libraries

In [1]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

^C


In [None]:
!pip install tensorboard

In [2]:
!pip install tranformers

ERROR: Could not find a version that satisfies the requirement tranformers (from versions: none)
ERROR: No matching distribution found for tranformers


In [3]:
!pip install datasets



In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:28:36_Pacific_Standard_Time_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0


In [3]:
import torch
import os
import pandas as pd
import transformers as tr
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
if torch.cuda.is_available():
    print("GPU is available")
else:
    print("GPU is not available")

GPU is available


In [5]:
import tempfile

tmpdir = tempfile.TemporaryDirectory()
local_training_root = tmpdir.name

## 1- Data Preparation
The first step of the fine-tuning process is to identify a specific task and supporting dataset. In this notebook we will consider the specific task to be classifying movies reviews. The idea is generally simple task where a movie review is provided as plain-text and we would like to determine whether or not the review was positive or negative.

The [IMDB dataset](https://huggingface.co/datasets/imdb) can be leveraged as a supporting dataset for this task. The dataset conveniently provides both a training and a testing dataset with  labeled binary sentiments, as well as a dataset of unlabeled data.

In [6]:
imdb_ds = load_dataset("imdb")

# 2 - Select pre-trained model
The next step of the fine-tuning process is to select a pre-trained model. We will consider using the [T5](https://huggingface.co/docs/transformers/model_doc/t5) [[paper]](https://arxiv.org/pdf/1910.10683.pdf) family of models for our fine-tuning purposes. The T5 models are text-to-text transformers that have been trained on a multi-task mixture of unsupervised and supervised tasks. They are well suited for tasks such as summarizatin, translations, text classifications, question answering, and more.
The `t5-small` version of the T5 models has 60 million parameters. This slimmed down version will be sufficient for our purposes. 

In [7]:
model_checkpoint = "t5-small"

In [8]:
current_directory = os.getcwd()

cache_dir = os.path.join(current_directory, 'cache')

Hugging Face provides the [Auto*](https://huggingface.co/docs/transformers/model_doc/auto) suite of objects to conveniently instatiate the various componentes associated with a pre-trained model. Here, we use [AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) to load in the tokenizer that is associated with the `t5-small` model.

In [10]:
# load the tokenizer that was used for the t5-small model
tokenizer = tr.AutoTokenizer.from_pretrained(
    model_checkpoint, cache_dir=cache_dir
) # Use a pre-cached model

As mentioned above, the IMDB dataset is a binary sentiment dataset. Its labels therefore are encoded as (-1 = unknown; 0 = negative; 1 = positive) values. In order to use this dataset with a text-to-text model like T5, the label set needs to be represented as a string. We will simply translate each label id to its corresponding string value.


In [11]:
def to_tokens(tokenizer: tr.models.t5.tokenization_t5_fast.T5TokenizerFast, label_map: dict) -> callable:
    """
    Given a `tokenizer` this closure will iterate through `x` and return the result of `apply()`.
    This function is mapped to a dataset and returned with ids and attention mask.
    """

    def apply(x) -> tr.tokenization_utils_base.BatchEncoding:
        """From a formatted dataset `x` a batch encoding `token_res` is created."""
        target_labels = [label_map[y] for y in x["label"]]
        token_res = tokenizer(
            x["text"],
            text_target=target_labels,
            return_tensors="pt",
            truncation=True,
            padding=True,
        )
        return token_res

    return apply

imdb_label_lookup = {0: "negative", 1: "positive", -1: "unknown"}


In [12]:
imdb_to_tokens = to_tokens(tokenizer, imdb_label_lookup)
tokenized_dataset = imdb_ds.map(
    imdb_to_tokens, batched=True, remove_columns=["text", "label"]
)

Map: 100%|██████████| 25000/25000 [00:17<00:00, 1412.02 examples/s]


## 3- Setup Training

The model training process is highly configurable. The [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) class effectively exposes the configurable aspects of the process allowing one to customize them accordingly. Here, we will focus on setting up a training process that performs a single epoch of training with a batch size of 16. We will also leverage `adamw_torch` as the optimizer.


In [14]:
checkpoint_name = "test-trainer"
local_checkpoint_path = os.path.join(local_training_root, checkpoint_name)
training_args = tr.TrainingArguments(
    local_checkpoint_path,
    num_train_epochs=1,  # default number of epochs to train is 3
    per_device_train_batch_size=16,
    optim="adamw_torch",
    report_to=["tensorboard"],
)

The pre-trained `t5-small` model can be loaded using the [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSeq2SeqLM) class.


In [15]:
#load the pretrained model
model = tr.AutoModelForSeq2SeqLM.from_pretrained(
    model_checkpoint, cache_dir=cache_dir
) # Use a pre-cached model

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [18]:
# used to assist the trainer in batching the data
data_collator = tr.DataCollatorWithPadding(tokenizer=tokenizer)
trainer = tr.Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator, #data_collator will ensure data availability, avoiding the delay of making sure that batches are ready for the GPU to process
)

## 4- Train: Foundation model to Fine-tuned version of that model
Before starting the training process, let's turn on Tensorboard. This will allow us to monitor the training process as checkpoint logs are created.


In [19]:
tensorboard_display_dir = f"{local_checkpoint_path}/runs"

In [31]:
%load_ext tensorboard
%tensorboard --logdir '{tensorboard_display_dir}'

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6007 (pid 19732), started 0:21:15 ago. (Use '!kill 19732' to kill it.)

Start the fine-tuning process.

In [23]:
trainer.train()

# save model to the local checkpoint
trainer.save_model()
trainer.save_state()

 32%|███▏      | 500/1563 [03:01<06:03,  2.93it/s]

{'loss': 0.6184, 'grad_norm': 1.3956140279769897, 'learning_rate': 3.400511836212412e-05, 'epoch': 0.32}


 64%|██████▍   | 1000/1563 [05:57<03:16,  2.86it/s]

{'loss': 0.1405, 'grad_norm': 1.9732414484024048, 'learning_rate': 1.801023672424824e-05, 'epoch': 0.64}


 96%|█████████▌| 1500/1563 [08:56<00:22,  2.78it/s]

{'loss': 0.1304, 'grad_norm': 1.949489712715149, 'learning_rate': 2.015355086372361e-06, 'epoch': 0.96}


100%|██████████| 1563/1563 [09:25<00:00,  2.76it/s]


{'train_runtime': 565.2882, 'train_samples_per_second': 44.225, 'train_steps_per_second': 2.765, 'train_loss': 0.28974149231718505, 'epoch': 1.0}


In [24]:
# save fine-tuned model
final_model_path = os.path.join(current_directory, f'llm04_fine_tuning/{checkpoint_name}')
trainer.save_model(output_dir=final_model_path)

## 5- Predict

In [25]:
fine_tuned_model = tr.AutoModelForSeq2SeqLM.from_pretrained(final_model_path)

In [28]:
reviews = ["""
'Despicable Me' is a cute and funny movie, but the plot is predictable and the characters are not very well-developed. Overall, it's a good movie for kids, but adults might find it a bit boring.""",
""" 'The Batman' is a dark and gritty take on the Caped Crusader, starring Robert Pattinson as Bruce Wayne. The film is a well-made crime thriller with strong performances and visuals, but it may be too slow-paced and violent for some viewers.
""",
"""
The Phantom Menace is a visually stunning film with some great action sequences, but the plot is slow-paced and the dialogue is often wooden. It is a mixed bag that will appeal to some fans of the Star Wars franchise, but may disappoint others.
""",
"""
I'm not sure if The Matrix and the two sequels were meant to have a tigh consistency but I don't think they quite fit together. They seem to have a reasonably solid arc but the features from the first aren't in the second and third as much, instead the second and third focus more on CGI battles and more visuals. I like them but for different reasons, so if I'm supposed to rate the trilogy I'm not sure what to say.
""",
]
inputs = tokenizer(reviews, return_tensors="pt", truncation=True, padding=True)
pred = fine_tuned_model.generate(
    input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]
)



In [29]:
pdf = pd.DataFrame(
    zip(reviews, tokenizer.batch_decode(pred, skip_special_tokens=True)),
    columns=["review", "classification"],
)
display(pdf)

Unnamed: 0,review,classification
0,"\n'Despicable Me' is a cute and funny movie, b...",negative
1,'The Batman' is a dark and gritty take on the...,positive
2,\nThe Phantom Menace is a visually stunning fi...,positive
3,\nI'm not sure if The Matrix and the two seque...,negative
