# Fine-tune FLAN-T5 for text classification



## INTRO
If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages. 

## Quick intro: FLAN-T5, just a better T5

FLAN-T5 released with the [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf) paper is an enhanced version of T5 that has been finetuned in a mixture of tasks. The paper explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. The paper discovers that overall instruction finetuning is a general method for improving the performance and usability of pretrained language models. 


* Paper: https://arxiv.org/abs/2210.11416
* Official repo: https://github.com/google-research/t5x
--- 


## 1. Setup Development Environment
Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages. 

In [1]:
## system commands
# sudo apt-get install git-lfs

!pip install pytesseract transformers==4.28.1 datasets evaluate rouge-score nltk tensorboard py7zr matplotlib --quiet
!pip install scikit-learn --quiet
!pip install torch --index-url https://download.pytorch.org/whl/cu121 --quiet


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.0/110.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m79.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m84.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.3/93.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.6/49.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.0/139.0 kB[0m [3

This example will use the [Hugging Face Hub](https://huggingface.co/models) as a remote model versioning service. To be able to push our model to the Hub, you need to register on the [Hugging Face](https://huggingface.co/join). 
If you already have an account, you can skip this step. 
After you have an account, we will use the `notebook_login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk. 

In [2]:
from huggingface_hub import login
login(token="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")

### Import dependencies

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import glob
from datasets import load_dataset
import datasets

## 2. Loading the dataset


In [14]:
dataset_id = "hebashakeel/dataset_wellness"

To load the dataset, we use the `load_dataset()` method from the 🤗 Datasets library.

In [24]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset(dataset_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

Train dataset size: 990
Test dataset size: 213


In [25]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Text', 'Aspect'],
        num_rows: 990
    })
    validation: Dataset({
        features: ['Text', 'Aspect'],
        num_rows: 212
    })
    test: Dataset({
        features: ['Text', 'Aspect'],
        num_rows: 213
    })
})

Lets checkout an example of the dataset.

In [26]:
dataset['test'][8]

{'Text': "I feel so frustrated and angry at my body that I can't do more. Every time I push just too far I end up in hospital and I don't want that. So many other people need that medical attention and I'm sick of being a burden!",
 'Aspect': 5}

To train our model we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means check out [chapter 6](https://huggingface.co/course/chapter6/1?fw=tf) of the Hugging Face Course.

In [27]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-base"

# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)

before we can start training we need to preprocess our data. Abstractive flan t5 is good for a text2text-generation task. This means our model will take a text as input and generate a text as output. For this we want to understand how long our input and output will be to be able to efficiently batch our data. 
- as result, we should to convert label from int to string

In [28]:
dataset['validation']

Dataset({
    features: ['Text', 'Aspect'],
    num_rows: 212
})

In [29]:
import pandas as pd
from datasets import Dataset
import random

dataset['train'] = dataset['train'].shuffle(seed=42)

train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])
val_df = pd.DataFrame(dataset['validation'])
dataset.clear()
train_df['Aspect'] = train_df['Aspect'].astype(str)
test_df['Aspect'] = test_df['Aspect'].astype(str)
val_df['Aspect'] = val_df['Aspect'].astype(str)

dataset['train'] = Dataset.from_pandas(train_df)
dataset['test'] = Dataset.from_pandas(test_df)
dataset['validation'] = Dataset.from_pandas(val_df)

In [30]:
# Check data types of columns
print(dataset['test'].features)

{'Text': Value(dtype='string', id=None), 'Aspect': Value(dtype='string', id=None)}


In [31]:
from datasets import concatenate_datasets

tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"], dataset["validation"]]).map(lambda x: tokenizer(x["Text"], truncation=True), batched=True, remove_columns=['Text', 'Aspect'])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"], dataset["validation"]]).map(lambda x: tokenizer(x["Aspect"], truncation=True), batched=True, remove_columns=['Text', 'Aspect'])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/1415 [00:00<?, ? examples/s]

Max source length: 138


Map:   0%|          | 0/1415 [00:00<?, ? examples/s]

Max target length: 3


In [33]:
def preprocess_function(sample, padding="max_length"):
    # add prefix to the input for t5
    inputs = [item for item in sample["Text"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["Aspect"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=['Text', 'Aspect'])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/990 [00:00<?, ? examples/s]

Map:   0%|          | 0/213 [00:00<?, ? examples/s]

Map:   0%|          | 0/212 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


## 3. Fine-tune and evaluate FLAN-T5

After we have processed our dataset, we can start training our model. Therefore we first need to load our [FLAN-T5](https://huggingface.co/models?search=flan-t5) from the Hugging Face Hub. In the example we are using a instance with a NVIDIA V100 meaning that we will fine-tune the `base` version of the model.


In [34]:
from transformers import AutoModelForSeq2SeqLM

# huggingface hub model id
model_id="google/flan-t5-base"

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

We want to evaluate our model during training. The `Trainer` supports evaluation during training by providing a `compute_metrics`.  

We are going to use `evaluate` library to evaluate the `rogue` score.

In [35]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("f1")

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

In [36]:
# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

In [37]:
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, average='macro')
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

Before we can start training is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the 🤗 Transformers library. 

In [38]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

The last step is to define the hyperparameters (`TrainingArguments`) we want to use for our training. We are leveraging the [Hugging Face Hub](https://huggingface.co/models) integration of the `Trainer` to automatically push our checkpoints, logs and metrics during training into a repository.

In [39]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Hugging Face repository id
repository_id = f"{model_id.split('/')[1]}-wellness-text-classification"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    learning_rate=3e-4,
    num_train_epochs=10,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="epoch",
    # logging_steps=1000,
    evaluation_strategy="no",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=False,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    report_to="tensorboard",
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=repository_id,
    hub_token=HfFolder.get_token(),
)

In [40]:
# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    compute_metrics=compute_metrics,
)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/hebashakeel/flan-t5-base-wellness-text-classification into local empty directory.


Download file pytorch_model.bin:   0%|          | 8.00k/945M [00:00<?, ?B/s]

Download file logs/events.out.tfevents.1739728202.f42d15cde534.31.3: 100%|##########| 7.60k/7.60k [00:00<?, ?B…

Download file logs/events.out.tfevents.1739342504.3388219dc09d.31.6: 100%|##########| 7.63k/7.63k [00:00<?, ?B…

Download file logs/events.out.tfevents.1739343270.3388219dc09d.31.9: 100%|##########| 6.83k/6.83k [00:00<?, ?B…

Download file logs/events.out.tfevents.1739731095.f42d15cde534.31.6: 100%|##########| 7.28k/7.28k [00:00<?, ?B…

Download file logs/events.out.tfevents.1739342036.3388219dc09d.31.3: 100%|##########| 6.86k/6.86k [00:00<?, ?B…

Download file spiece.model:   4%|4         | 32.0k/773k [00:00<?, ?B/s]

Download file logs/events.out.tfevents.1739726211.f42d15cde534.31.0: 100%|##########| 6.84k/6.84k [00:00<?, ?B…

Clean file logs/events.out.tfevents.1739342504.3388219dc09d.31.6:  13%|#3        | 1.00k/7.63k [00:00<?, ?B/s]

Clean file logs/events.out.tfevents.1739720667.eb6ea8d8c310.31.0:  15%|#5        | 1.00k/6.49k [00:00<?, ?B/s]

Download file logs/1739731095.0252554/events.out.tfevents.1739731095.f42d15cde534.31.7: 100%|##########| 6.09k…

Clean file logs/events.out.tfevents.1739713721.8f3747457b33.31.0:  15%|#5        | 1.00k/6.49k [00:00<?, ?B/s]

Clean file logs/1739728202.006114/events.out.tfevents.1739728202.f42d15cde534.31.4:  16%|#6        | 1.00k/6.0…

Clean file logs/1739713721.1644075/events.out.tfevents.1739713721.8f3747457b33.31.1:  16%|#6        | 1.00k/6.…

Clean file logs/1739731095.0252554/events.out.tfevents.1739731095.f42d15cde534.31.7:  16%|#6        | 1.00k/6.…

Download file logs/1739343270.5206656/events.out.tfevents.1739343270.3388219dc09d.31.10: 100%|##########| 6.09…

Download file logs/1739342504.1636693/events.out.tfevents.1739342504.3388219dc09d.31.7: 100%|##########| 6.09k…

Clean file spiece.model:   0%|          | 1.00k/773k [00:00<?, ?B/s]

Download file logs/1739342036.8181696/events.out.tfevents.1739342036.3388219dc09d.31.4: 100%|##########| 6.09k…

Download file logs/events.out.tfevents.1739341817.3388219dc09d.31.0: 100%|##########| 5.60k/5.60k [00:00<?, ?B…

Download file logs/1739341817.8633492/events.out.tfevents.1739341817.3388219dc09d.31.1: 100%|##########| 6.09k…

Download file logs/1739726211.4603071/events.out.tfevents.1739726211.f42d15cde534.31.1: 100%|##########| 6.09k…

Download file training_args.bin: 100%|##########| 4.12k/4.12k [00:00<?, ?B/s]

Download file logs/events.out.tfevents.1739343061.3388219dc09d.31.8: 100%|##########| 456/456 [00:00<?, ?B/s]

Download file logs/events.out.tfevents.1739729485.f42d15cde534.31.5: 100%|##########| 456/456 [00:00<?, ?B/s]

Download file logs/events.out.tfevents.1739342388.3388219dc09d.31.5: 100%|##########| 456/456 [00:00<?, ?B/s]

Download file logs/events.out.tfevents.1739343595.3388219dc09d.31.11: 100%|##########| 456/456 [00:00<?, ?B/s]

Download file logs/events.out.tfevents.1739341859.3388219dc09d.31.2: 100%|##########| 449/449 [00:00<?, ?B/s]

Download file logs/events.out.tfevents.1739727027.f42d15cde534.31.2: 100%|##########| 456/456 [00:00<?, ?B/s]

Download file logs/events.out.tfevents.1739727027.f42d15cde534.31.2: 100%|##########| 456/456 [00:00<?, ?B/s]

Clean file logs/1739343270.5206656/events.out.tfevents.1739343270.3388219dc09d.31.10:  16%|#6        | 1.00k/6…

Clean file logs/1739342504.1636693/events.out.tfevents.1739342504.3388219dc09d.31.7:  16%|#6        | 1.00k/6.…

Clean file logs/1739342036.8181696/events.out.tfevents.1739342036.3388219dc09d.31.4:  16%|#6        | 1.00k/6.…

Clean file logs/events.out.tfevents.1739341817.3388219dc09d.31.0:  18%|#7        | 1.00k/5.60k [00:00<?, ?B/s]

Clean file logs/1739341817.8633492/events.out.tfevents.1739341817.3388219dc09d.31.1:  16%|#6        | 1.00k/6.…

Clean file logs/1739726211.4603071/events.out.tfevents.1739726211.f42d15cde534.31.1:  16%|#6        | 1.00k/6.…

Clean file training_args.bin:  24%|##4       | 1.00k/4.12k [00:00<?, ?B/s]

Clean file logs/events.out.tfevents.1739343061.3388219dc09d.31.8: 100%|##########| 456/456 [00:00<?, ?B/s]

Clean file logs/events.out.tfevents.1739729485.f42d15cde534.31.5: 100%|##########| 456/456 [00:00<?, ?B/s]

Clean file logs/events.out.tfevents.1739342388.3388219dc09d.31.5: 100%|##########| 456/456 [00:00<?, ?B/s]

Clean file logs/events.out.tfevents.1739343595.3388219dc09d.31.11: 100%|##########| 456/456 [00:00<?, ?B/s]

Clean file logs/events.out.tfevents.1739341859.3388219dc09d.31.2: 100%|##########| 449/449 [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/945M [00:00<?, ?B/s]

We can start our training by using the `train` method of the `Trainer`.

In [41]:
# Start training
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
62,0.9412
124,0.4664
186,0.2925
248,0.1966
310,0.1142
372,0.0804
434,0.0568
496,0.0398
558,0.0246
620,0.0214




TrainOutput(global_step=620, training_loss=0.22339189610173626, metrics={'train_runtime': 798.8262, 'train_samples_per_second': 12.393, 'train_steps_per_second': 0.776, 'total_flos': 1906621253222400.0, 'train_loss': 0.22339189610173626, 'epoch': 10.0})

Nice, we have trained our model. 🎉 Lets run evaluate the best model again on the test set.


In [42]:
trainer.evaluate()



{'eval_loss': 1.2828923463821411,
 'eval_f1': 65.7304,
 'eval_gen_len': 2.0754716981132075,
 'eval_runtime': 8.1994,
 'eval_samples_per_second': 25.855,
 'eval_steps_per_second': 1.707,
 'epoch': 10.0}

In [49]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()

To https://huggingface.co/hebashakeel/flan-t5-base-wellness-text-classification
   80ceafc..0b5d9ae  main -> main



'https://huggingface.co/hebashakeel/flan-t5-base-wellness-text-classification/commit/0b5d9aef71ed3d27f2fe62051e33e0772b0c564b'

## 4. Run Inference and Classification Report

In [46]:
from tqdm.auto import tqdm

samples_number = len(dataset['test'])
progress_bar = tqdm(range(samples_number))
predictions_list = []
labels_list = []
for i in range(samples_number):
  text = dataset['test']['Text'][i]
  inputs = tokenizer.encode_plus(text, padding='max_length', max_length=512, return_tensors='pt').to('cuda')
  outputs = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=150, num_beams=4, early_stopping=True)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
  predictions_list.append(prediction)
  labels_list.append(dataset['test']['Aspect'][i])

  progress_bar.update(1)

  0%|          | 0/213 [00:00<?, ?it/s]

In [47]:
str_labels_list = []
for i in range(len(labels_list)):
    str_labels_list.append(str(labels_list[i]))

In [48]:
from sklearn.metrics import classification_report

report = classification_report(str_labels_list, predictions_list, zero_division=0)
print(report)

              precision    recall  f1-score   support

           0       0.70      0.37      0.48        19
           1       0.69      0.87      0.77        23
           2       0.42      0.48      0.45        27
           3       0.75      0.70      0.73        47
           4       0.73      0.84      0.78        64
           5       0.44      0.33      0.38        33

    accuracy                           0.65       213
   macro avg       0.62      0.60      0.60       213
weighted avg       0.64      0.65      0.64       213

