# Fine-tuning example

A step-by-step guide for fine tuning a FLAN-T5 small language model to identify NetApp products by a description, using the transformers library, and running fast on a regular computer’s CPU.

FLAN-T5 is an enhanced version of the T5 (Text-to-Text Transfer Transformer) lanaguage model, developed by Google. It has been fine-tuned on a mixture of unsupervised and supervised tasks, making it a powerful encoder-decoder language model. FLAN-T5 can handle various language tasks, such as translation, coherence checking, sentence similarity assessment, and document summarization. You can use it for both personal and professional purposes. Imagine having an intelligent companion proficient in language understanding, assisting with various text-related tasks you encounter in your work.  This model is an in a class called Small Lanaguage Model (SLM). A SML is a compact AI model that uses a smaller neural network, fewer parameters, and less training data. Unlike large-scale language models (LLMs), SLMs achieve impressive performance while requiring fewer resources and less computing power. 

## Imports

- Hugging Face: This platform helps you access FLAN-T5 and makes it easy to download and use for fine-tuning.<br>
- Transformers: This tool simplifies loading the pre-trained FLAN-T5 model and gives you helpful functions for fine-tuning it.<br>
- Datasets: This is a collection of datasets ready to use, which is important for finding the right data to fine-tune FLAN-T5.<br>
- Sentencepiece: This tool helps with tokenization, especially for big and multilanguage text data.<br>
- Tokenizers: This library helps convert text into a format that works well for your specific task.<br>
- Evaluate: You can use this library to check how well your fine-tuned model is performing by measuring various metrics.<br>
- Rouge score: This is a specific metric used to see how good the text produced by FLAN-T5 is.<br>
- NLTK: This tool is handy for getting your data ready before fine-tuning, like breaking it into smaller pieces.<br>
- IProgress: A Python library that provides a text progress bar for long-running operations.<br>

In [None]:
%pip install huggingface-hub==0.23.4
%pip install transformers[torch]==4.41.2
%pip install datasets==2.20.0
%pip install sentencepiece==0.2.0
%pip install tokenizers==0.19.1
%pip install evaluate==0.4.2
%pip install rouge-score==0.1.2
%pip install nltk==3.8.1

In [None]:
import nltk
import evaluate
import numpy as np
from datasets import Dataset
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
import tqdm as notebook_tqdm

To minimize error reports, the logging level was set to only show errors. For more information, please see: https://huggingface.co/transformers/v3.1.0/main_classes/logging.html.

In [None]:
import transformers
transformers.logging.set_verbosity_error()

## Setting up the model

There are multiple formats of FLAN-T5 models are available on Hugging Face, from small to extra-large models, and the bigger the model, the more parameters it has.

- google/flan-t4-small: 80 million parameters, 300 MB memory required
- google/flan-t4-base: 250 million parameters, 990 MB memory required
- google/flan-t5-large: 780 million parameters, 1 GB memory required
- google/flan-t5-xl: 3 billion parameters, 12 GB memory required
- google/flan-t4-xxl: 11 billion parameters, 80 GB memory required

In [None]:
MODEL_NAME = "google/flan-t5-base"

Loads the model from the model directory and loads the tokenizer from the tokenizer directory.  Both the model and tokenizer are the same model. 

With the DataCollectorForSeq2Seq, a data collector is created to be used for the question-answering task: https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForSeq2Seq


In [None]:
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME, cache_dir="model/tokenizer")
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, cache_dir="model/pretrained")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

## Inferencing before fine-tuning

Perform an inferencing example to see how the base model performs

In [None]:
user_input = "Kubernetes management"
inputs = tokenizer(user_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {user_input}")
print(f"Answer: {output}")

Notice that the language model is outputing slop (https://www.cnet.com/tech/services-and-software/the-new-ai-buzzword-is-slop-and-its-messing-with-you-what-to-watch-out-for and https://www.nytimes.com/2024/06/11/style/ai-search-slop.html). <br> 

It is a nonsensical response.  

## Preparing the data set

Next you will load the csv with some sample data and create a dataset from it.  The csv has two columns: Input and Output.

In [None]:
import pandas as pd
path  = "netapp_products.csv" 
df = pd.read_csv(path, encoding="latin-1")
df = df.dropna()
dataset = Dataset.from_pandas(df)

Investigate the first 5 entries in the csv

In [None]:
df.head()

Split the data set into 2: 
- 80% for the training set and 
- 20% for the test set

In [None]:
dataset = dataset.train_test_split(test_size=0.2)

In [None]:
print(f"Keys of tokenized dataset: {list(dataset['train'].features)}")

# save datasets to disk for later easy loading
dataset["train"].save_to_disk("data/train")
dataset["test"].save_to_disk("data/eval")

Sequences longer than the MAX_OUTPUT_LENGTH this will be truncated, sequences shorter will be padded.

In [None]:
input_lenghts = [len(x) for x in dataset["train"]["Input"]]
# take 85 percentile of max length for better utilization
MAX_INPUT_LENGTH = int(np.percentile(input_lenghts, 85))
print(f"Max input length: {MAX_INPUT_LENGTH}")


target_lengths = [len(x) for x in dataset["train"]["Output"]]
# take 90 percentile of max length for better utilization
MAX_OUTPUT_LENGTH = int(np.percentile(target_lengths, 90))
print(f"Max output length: {MAX_OUTPUT_LENGTH}")

### Data formatting and tokenization

During the inference mode, the process of calling the model will be in this format:

“Match the NetApp product with the description: <USER_INPUT>”

Where the <USER_INPUT> is the input statement the user provides.

In [None]:
# Define the prefix
prefix = "Match the NetApp product with the description: "

To make sure the model knows to do this, we need to format the training data by adding the task as a prefix to the input sentences. This is done using a function called preprocess_function.

Along with formatting, this function also takes care of tokenizing the inputs and outputs using another function called tokenizer.

In [None]:
# Define the preprocessing function
def preprocess_function(examples):
    # Add prefix to the sentences
    inputs = [prefix + doc for doc in examples["Input"]]
    
    # Tokenize the input text
    model_inputs = tokenizer(inputs, 
                             max_length=MAX_INPUT_LENGTH, 
                             truncation=True)
    
    # Tokenize the output text and set labels
    labels = tokenizer(examples["Output"], 
                       max_length=MAX_OUTPUT_LENGTH, 
                       truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    
    return model_inputs


After we’ve defined the function, we apply it to the entire dataset using another function called map. This helps us quickly and efficiently process all the data in one go.

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

## Evaluation Strategy

Before diving into the training process, it is better to identify the metrics to evaluate the overall performance of the fine-tuning.

Good evaluation metrics are important in any deep learning and machine learning project to evaluate the performance of models, not only during training but also later in production.

Two of the most common metrics to evaluate the performance of a text generation model are BLEU and ROUGE, and in this case, to evaluate the quality of an answer by comparing it to a reference answer.

The focus of this tutorial is ROUGE, but this wikipedia article provides more information about the BLEU score: https://en.wikipedia.org/wiki/BLEU



### What is ROUGE score?
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. Some key components of ROUGE for question-answering include:

- ROUGE-L: Measures the longest common subsequence between the candidate and reference answers. This focuses on recall of the full text.
- ROUGE-1, ROUGE-2, ROUGE-SU4: Compare unigram, bigram, 4-gram overlaps between candidate and reference. Focus on recall of key parts/chunks
- Higher ROUGE scores generally indicate better performance for question answering. Scores close to or above 0.70+ are considered strong
- When using this metric, processing like stemming, and removing stopwords can help improve the overall performance

In [None]:
nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")

The following helper function compute_metrics can help compute the underlying ROUGE score.

In [None]:
def compute_metrics(eval_preds):
    preds, labels = eval_preds

    # Replace -100 in labels with pad token id
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    
    # Decode preds and labels
    #decoded_preds = [tokenizer.decode(pred, skip_special_tokens=True) for pred in preds]
    #decoded_labels = [tokenizer.decode(label, skip_special_tokens=True) for label in labels]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Split decoded preds and labels into sentences
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    # Compute metrics
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    return result

## Training

To trigger the fine-tuning, we need to set some hyperparameters and the main ones are given below:

- Learning rate: to control how quickly the model learns from the data and the typical values are 1e-5 to 5e-5, and for this use case, the value is set to 3e-4
- Batch size: the total number of samples processed before the update of the model’s weights. Using larger batches can speed up the process, but the downside is that it can lead to poor performance. We use 8 for this use case
- Per device train batch size: this one is similar to batch size, but it is specified per each device (GPU)
- Weight decay: the goal of using this is to prevent the model from overfitting. 0.01 is an acceptable value for weight size
- Save total limit: this is the total number of checkpoints to be saved during the training. The more saves there are, the higher the possibilities there are to roll back but uses more disk. We are performing 3 saves for this case
- Number of epochs: the total number of passes through the training dataset. The more epochs, the longer the training time, but could also improve the model performance. Typically, a value from 3 to 10 is chosen, and 3 is used for this use case.

For more information about the arguments, please see: https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments.

The above parameters are defined below, and used for setting up the model training arguments, and the overall training artifacts are saved in the folder results :

In [None]:
# Define global parameters
L_RATE = 3e-4
BATCH_SIZE = 4
WEIGHT_DECAY = 0.01
NUM_EPOCHS = 6
FINETUNEDMODEL = "model/finetuned"

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir=FINETUNEDMODEL,
    eval_strategy="epoch",
    learning_rate=L_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    weight_decay=WEIGHT_DECAY,
    num_train_epochs=NUM_EPOCHS,
    predict_with_generate=True,
    push_to_hub=False,
    generation_max_length=MAX_INPUT_LENGTH
)

Next, the trainer is set up to trigger the training process of the model.

In [None]:
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

Train and save the model.  This will take about 8 minutes.

In [None]:
trainer.train()
trainer.save_model(FINETUNEDMODEL)

## Inferencing after fine-tuning

Load the fine-tuned model and run a prediction

In [None]:
finetuned_model = T5ForConditionalGeneration.from_pretrained(FINETUNEDMODEL)
tokenizer = T5Tokenizer.from_pretrained(FINETUNEDMODEL)

user_input = "Kubernetes management"
inputs = tokenizer(user_input, return_tensors="pt")
outputs = finetuned_model.generate(**inputs, max_new_tokens=MAX_INPUT_LENGTH)
output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {user_input}")
print(f"Answer: {output}")

Notice the fine-tuned model should be able to correctly match the correct description with the NetApp product.