# Fine-tuning LLMs (emotion dataset + google/flan-t5-base LLM)

In this section, we demonstrate how to fine-tune LLMs. Note that you will need to use a GPU for this section. You can do so by clicking "Runtime -> Change runtime type" and selecting a GPU.

Let's load all the necessary libraries:

In [1]:
! pip install transformers[torch] comet-ml comet-llm datasets evaluate sentencepiece --quiet

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-gpu 2.9.1 requires flatbuffers<2,>=1.12, but you have flatbuffers 23.3.3 which is incompatible.
tensorflow-gpu 2.9.1 requires keras<2.10.0,>=2.9.0rc0, but you have keras 2.11.0 which is incompatible.
tensorflow-gpu 2.9.1 requires tensorboard<2.10,>=2.9, but you have tensorboard 2.11.2 which is incompatible.
tensorflow-gpu 2.9.1 requires tensorflow-estimator<2.10.0,>=2.9.0rc0, but you have tensorflow-estimator 2.11.0 which is incompatible.
--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pip\_internal\utils\logging.py", line 177, in emit
    self.console.print(renderable, overflow="ignore", crop=False, style=style)
  File "C:\Users\user\AppData\Local\Pack

In [2]:
from transformers import AutoTokenizer
from datasets import load_dataset
import evaluate
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import Trainer, TrainingArguments
import transformers
transformers.set_seed(35)
from datasets import Features, Value, Dataset, DatasetDict
import comet_ml
import comet_llm
import os
import numpy as np
import pickle
import json
import pandas as pd
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


### Dataset Preparation

The code below loads the datasets and converts them into the proper format. We are also sampling the dataset. You can choose different sample sizes to run different experiments. More samples typically lead to a better performing model.

In [3]:
# loads the data from the jsonl files
emotion_dataset_train = pd.read_json(path_or_buf="https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/merged_training_sample_prepared_train.jsonl", lines=True)
emotion_dataset_val_temp = pd.read_json(path_or_buf="https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/merged_training_sample_prepared_valid.jsonl", lines=True)

# takes first half of samples from emotion_dataset_val_temp and make emotion_dataset_val
emotion_dataset_val = emotion_dataset_val_temp.iloc[:int(len(emotion_dataset_val_temp)/2)]

# takes second half of samples from emotion_dataset_val_temp and make emotion_dataset_test
emotion_dataset_test = emotion_dataset_val_temp.iloc[int(len(emotion_dataset_val_temp)/2):]

sample = True

if sample == True:
    final_ds = DatasetDict({
        "train": Dataset.from_pandas(emotion_dataset_train.sample(50)),
        "validation": Dataset.from_pandas(emotion_dataset_val.sample(50)),
        "test": Dataset.from_pandas(emotion_dataset_test.sample(50))
    })
else:
    final_ds = DatasetDict({
        "train": Dataset.from_pandas(emotion_dataset_train),
        "validation": Dataset.from_pandas(emotion_dataset_val),
        "test": Dataset.from_pandas(emotion_dataset_test)
    })

In [4]:
emotion_dataset_val

Unnamed: 0,prompt,completion
0,i feel it has only been agitated by the presen...,fear\n
1,i thought as i can often feel the rather unple...,sadness\n
2,i can t hear her with all the other kids and m...,fear\n
3,i am sure i will feel this longing again when ...,love\n
4,i had been having sexual feelings and romantic...,love\n
5,im better but i feel like im not resolved\n\n#...,joy\n
6,i spend a lot of my time here picking out how ...,sadness\n
7,i stood up on the scales only to feel stunned\...,surprise\n
8,i feel so positive all the time\n\n###\n\n,joy\n
9,i feel really lucky to be in this position to ...,joy\n


In [5]:
emotion_dataset_train

Unnamed: 0,prompt,completion
0,i also volunteered that if we were to marry th...,joy\n
1,i always feel a bit awkward doing this kind of...,sadness\n
2,i feel like this could be a long term romantic...,love\n
3,i couldnt help feeling a little dismayed as th...,sadness\n
4,i never feel your tender kiss again span style...,love\n
...,...,...
475,i sort of stood there feeling a bit dazed by w...,surprise\n
476,i know in my heart even when i m feeling bitte...,anger\n
477,i feel liked one touch on the right spot will ...,love\n
478,im feeling quite angry today\n\n###\n\n,anger\n


In [6]:
emotion_dataset_test

Unnamed: 0,prompt,completion
60,i feel very very disturbed right now i dont kn...,sadness\n
61,i feel make them the most dangerous and their ...,anger\n
62,i can feel sympathetic joy for my boyfriend an...,love\n
63,i found these emails from scott dale and just ...,fear\n
64,i won t lie and say there isn t a part of me t...,anger\n
65,i could feel that nothing awful was going to h...,sadness\n
66,i feel insulted and manipulated and though i h...,anger\n
67,i the only one to feel awkward like this i won...,sadness\n
68,i feel out of place where at any moment someon...,sadness\n
69,i am feeling really rebellious as i have the h...,anger\n


In [7]:
emotion_dataset_val_temp

Unnamed: 0,prompt,completion
0,i feel it has only been agitated by the presen...,fear\n
1,i thought as i can often feel the rather unple...,sadness\n
2,i can t hear her with all the other kids and m...,fear\n
3,i am sure i will feel this longing again when ...,love\n
4,i had been having sexual feelings and romantic...,love\n
...,...,...
115,i feel bashful discussing it i m a closet game...,fear\n
116,i was feeling agitated and giddy all at the sa...,fear\n
117,i feel restless though and know if i close my ...,fear\n
118,i report my feelings on the ex a movie about w...,fear\n


### Tokenize Dataset

The code below defines a tokenizer and uses the Hugging Face tokenizer to tokenize the datasets. This is the format the model expects so this is an important step.

In [8]:
# model checkpoint
model_checkpoint = "google/flan-t5-base"

# We'll create a tokenizer from model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False, cache_dir="E:\\VSCODE\\ollama_models")

# We'll need padding to have same length sequences in a batch
tokenizer.pad_token = tokenizer.eos_token

# prefix
prefix_instruction = "Classify the provided piece of text into one of the following emotion labels.\n\nEmotion labels: ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']"

# Define a tokenization function that first concatenates text and target
def tokenize_function(example):
    merged = prefix_instruction + "\n\n" + "Text: " + example["prompt"].strip("\n\n###\n\n") + "\n\n" + "Emotion output:" + example["completion"].strip(" ").strip("\n")
    print(merged)
    batch = tokenizer(merged, padding='max_length', truncation=True)
    print(batch)
    batch["labels"] = batch["input_ids"].copy()
    print(batch["labels"])
    return batch

# Apply it on our dataset, and remove the text columns
tokenized_datasets = final_ds.map(tokenize_function, remove_columns=["prompt", "completion"])
print(tokenized_datasets)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Classify the provided piece of text into one of the following emotion labels.

Emotion labels: ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']

Text: i dont know why most of my life ive been hurt i dont know why it continues to happen but i really am tired of it im tired of normal people having stomache problems when all i feel is my heart sinking and aching not to sound emo its actually true

Emotion output:sadness
{'input_ids': [4501, 4921, 8, 937, 1466, 13, 1499, 139, 80, 13, 8, 826, 13868, 11241, 5, 262, 7259, 11241, 10, 784, 31, 9, 9369, 31, 6, 3, 31, 89, 2741, 31, 6, 3, 31, 1927, 63, 31, 6, 3, 31, 5850, 15, 31, 6, 3, 31, 7, 9, 26, 655, 31, 6, 3, 31, 3042, 102, 7854, 31, 908, 5027, 10, 3, 23, 2483, 214, 572, 167, 13, 82, 280, 3, 757, 118, 4781, 3, 23, 2483, 214, 572, 34, 3256, 12, 1837, 68, 3, 23, 310, 183, 7718, 13, 34, 256, 7718, 13, 1389, 151, 578, 9883, 15, 982, 116, 66, 3, 23, 473, 19, 82, 842, 5067, 53, 11, 3, 12076, 59, 12, 1345, 3, 15, 51, 32, 165, 700, 1176, 262, 

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Classify the provided piece of text into one of the following emotion labels.

Emotion labels: ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']

Text: i ran around town trying to find different things to use i couldnt help but feel a little amazed that this

Emotion output:surprise
{'input_ids': [4501, 4921, 8, 937, 1466, 13, 1499, 139, 80, 13, 8, 826, 13868, 11241, 5, 262, 7259, 11241, 10, 784, 31, 9, 9369, 31, 6, 3, 31, 89, 2741, 31, 6, 3, 31, 1927, 63, 31, 6, 3, 31, 5850, 15, 31, 6, 3, 31, 7, 9, 26, 655, 31, 6, 3, 31, 3042, 102, 7854, 31, 908, 5027, 10, 3, 23, 4037, 300, 1511, 1119, 12, 253, 315, 378, 12, 169, 3, 23, 2654, 17, 199, 68, 473, 3, 9, 385, 16579, 24, 48, 262, 7259, 3911, 10, 3042, 102, 7854, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Classify the provided piece of text into one of the following emotion labels.

Emotion labels: ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']

Text: i was frustrated with my performance flabbergasted by my inability to create youth group topia and emotionally exhausted i cared deeply for the students and their families and wanted them to feel loved and know jesus

Emotion output:love
{'input_ids': [4501, 4921, 8, 937, 1466, 13, 1499, 139, 80, 13, 8, 826, 13868, 11241, 5, 262, 7259, 11241, 10, 784, 31, 9, 9369, 31, 6, 3, 31, 89, 2741, 31, 6, 3, 31, 1927, 63, 31, 6, 3, 31, 5850, 15, 31, 6, 3, 31, 7, 9, 26, 655, 31, 6, 3, 31, 3042, 102, 7854, 31, 908, 5027, 10, 3, 23, 47, 17144, 28, 82, 821, 5731, 115, 2235, 9, 6265, 57, 82, 16, 2020, 12, 482, 4192, 563, 420, 23, 9, 11, 17957, 21436, 3, 23, 124, 26, 7447, 21, 8, 481, 11, 70, 1791, 11, 1114, 135, 12, 473, 1858, 11, 214, 528, 7, 302, 262, 7259, 3911, 10, 5850, 15, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

### Finetuning Model

Once the datasets have been tokenized, it's time to finetune the model. We are using the HF Trainer to simplify the finetuning code. In the code below, it's also important to initialize a Comet project which allows tracking the experimental results to Comet. You can also set the `COMET_LOG_ASSETS` to `True` to store all artifacts to Comet.

In [9]:
# initialize comet_ml
comet_ml.init(project_name="emotion-classification")

# training an autoregressive language model from a pretrained checkpoint
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, cache_dir="E:\\VSCODE\\ollama_models").to(device)

# set this to log HF results and assets to Comet
os.environ["COMET_LOG_ASSETS"] = "True"

# HF Trainer
model_name = model_checkpoint.split("/")[-1]
training_args = Seq2SeqTrainingArguments(
    num_train_epochs=1,
    output_dir="./results",
    overwrite_output_dir=True,
    logging_steps=1,
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    weight_decay=0.01,
    save_total_limit=5,
    save_steps=7,
    auto_find_batch_size=True
)

# instantiate HF Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

# run trainer
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
[1;38;5;39mCOMET INFO:[0m Couldn't find a Git repository in 'e:\\VSCODE\\udacity\\llmadvanced' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/akshaykumarcp/emotion-classification/605d5cc72df8436b92b3eeaa66879be7



  0%|          | 0/7 [00:00<?, ?it/s]

  0%|          | 0/13 [00:01<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

{'loss': 2.9041, 'grad_norm': 51.546199798583984, 'learning_rate': 9.6e-05, 'epoch': 0.04}
{'loss': 0.4382, 'grad_norm': 4.410694122314453, 'learning_rate': 9.200000000000001e-05, 'epoch': 0.08}
{'loss': 0.3117, 'grad_norm': 1.7118523120880127, 'learning_rate': 8.800000000000001e-05, 'epoch': 0.12}
{'loss': 0.2049, 'grad_norm': 1.4652217626571655, 'learning_rate': 8.4e-05, 'epoch': 0.16}
{'loss': 0.1623, 'grad_norm': 1.449965000152588, 'learning_rate': 8e-05, 'epoch': 0.2}
{'loss': 0.1223, 'grad_norm': 1.2776374816894531, 'learning_rate': 7.6e-05, 'epoch': 0.24}
{'loss': 0.1005, 'grad_norm': 1.649294137954712, 'learning_rate': 7.2e-05, 'epoch': 0.28}
{'loss': 0.0598, 'grad_norm': 0.6387704014778137, 'learning_rate': 6.800000000000001e-05, 'epoch': 0.32}
{'loss': 0.0589, 'grad_norm': 0.7431287169456482, 'learning_rate': 6.400000000000001e-05, 'epoch': 0.36}
{'loss': 0.0389, 'grad_norm': 0.6015037894248962, 'learning_rate': 6e-05, 'epoch': 0.4}
{'loss': 0.03, 'grad_norm': 0.4914029240608

  0%|          | 0/7 [00:00<?, ?it/s]

{'eval_loss': 0.0005629315855912864, 'eval_runtime': 59.4724, 'eval_samples_per_second': 0.841, 'eval_steps_per_second': 0.118, 'epoch': 1.0}
{'train_runtime': 583.7356, 'train_samples_per_second': 0.086, 'train_steps_per_second': 0.043, 'train_loss': 0.18468489817343653, 'epoch': 1.0}


[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m Comet.ml Experiment Summary
[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m   Data:
[1;38;5;39mCOMET INFO:[0m     display_summary_level : 1
[1;38;5;39mCOMET INFO:[0m     url                   : https://www.comet.com/akshaykumarcp/emotion-classification/605d5cc72df8436b92b3eeaa66879be7
[1;38;5;39mCOMET INFO:[0m   Metrics [count] (min, max):
[1;38;5;39mCOMET INFO:[0m     epoch [27]                     : (0.04, 1.0)
[1;38;5;39mCOMET INFO:[0m     eval/loss                      : 0.0005629315855912864
[1;38;5;39mCOMET INFO:[0m     eval/runtime                   : 59.4724
[1;38;5;39mCOMET INFO:[0m     eval/samples_per_second        : 0.841
[1;38;5;39mCOMET INFO:[0m     eval/steps_per_second          : 0.118
[1;38;5;39mCOMET INFO:[0m     e

TrainOutput(global_step=25, training_loss=0.18468489817343653, metrics={'train_runtime': 583.7356, 'train_samples_per_second': 0.086, 'train_steps_per_second': 0.043, 'train_loss': 0.18468489817343653, 'epoch': 1.0})

The code below stores the results locally:

In [10]:
# save the model
trainer.save_model("./results")

---

### Register Model

The code below registers the model to Comet.

In [11]:
# set existing experiment
import os
from comet_ml import ExistingExperiment

COMET_API_KEY = "COMET_API_KEY"

experiment = ExistingExperiment(api_key=COMET_API_KEY, previous_experiment="097ab78e6e154f24b8090a1a7dd6abb8")
experiment.log_model("Emotion-T5-Base", "results/checkpoint-7")
experiment.register_model("Emotion-T5-Base")

[1;38;5;196mCOMET ERROR:[0m The given API key COMET_API_KEY is invalid on www.comet.com, please check it against the dashboard. Your experiment will not be logged 
[1;38;5;196mCOMET ERROR:[0m The given API key COMET_API_KEY is invalid on www.comet.com, please check it against the dashboard. Your experiment will not be logged 
Please double-check the directory path and the recursive parameter


---

### Deploy Model

The code below helps to download the model and specific version to whatever environment you are deploying from.

In [12]:
from comet_ml import API

api = API(api_key=COMET_API_KEY)
COMET_WORKSPACE = "COMET_WORKSPACE"

# model name
model_name = "emotion-flan-t5-base"

#get the Model object
model = api.get_model(workspace=COMET_WORKSPACE, model_name=model_name)

# Download a Registry Model:
model.download("1.0.0", "./deploy", expand=True)

InvalidAPIKey: ('COMET_API_KEY', 'https://www.comet.com')