<a href="https://colab.research.google.com/github/NicolaCortinovis/MLOPS_Project/blob/main/research/finetuning_transoformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FINETUNE THE FLAN-T5-SMALL MODEL
--------------------------------------------------------------------------------
Hugging Face page of the model --> [link](https://huggingface.co/google/flan-t5-small)
--------------------------------------------------------------------------------

In this Colab Notebook we finetune the Flan-T5-small to perform QAG task.

## PREPARE THE DATASET

In [1]:
! pip install transformers datasets
! pip install transformers[torch]
! pip install --upgrade tensorflow

Collecting tensorflow
  Downloading tensorflow-2.13.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting absl-py>=1.0.0 (from tensorflow)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting flatbuffers>=23.1.21 (from tensorflow)
  Downloading flatbuffers-23.5.26-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow)
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting grpcio<2.0,>=1.24.3 (from tensorflow)
  Downloading grpcio-1.60.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting h5py>=2.9.0 (from tensorflow)
  Dow

In [2]:
from datasets import load_dataset
dataset = load_dataset("lmqg/qag_squad")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
dataset["train"]

Dataset({
    features: ['answers', 'questions', 'paragraph', 'questions_answers'],
    num_rows: 16462
})

In [4]:
dataset['validation']

Dataset({
    features: ['answers', 'questions', 'paragraph', 'questions_answers'],
    num_rows: 2067
})

In [5]:
dataset['test']

Dataset({
    features: ['answers', 'questions', 'paragraph', 'questions_answers'],
    num_rows: 2429
})

In [6]:
dataset['train'][0]

{'answers': ['4 Minutes',
  'Elvis Presley',
  'thirteenth',
  'Sticky & Sweet Tour',
  '$280 million,'],
 'questions': ["Which single was released as the album's lead single?",
  'Madonna surpassed which artist with the most top-ten hits?',
  "4 minutes became Madonna's which number one single in the UK?",
  'What is the name of the first tour with Live Nation?',
  'How much did Stick and Sweet Tour grossed?'],
 'paragraph': '"4 Minutes" was released as the album\'s lead single and peaked at number three on the Billboard Hot 100. It was Madonna\'s 37th top-ten hit on the chart—it pushed Madonna past Elvis Presley as the artist with the most top-ten hits. In the UK she retained her record for the most number-one singles for a female artist; "4 Minutes" becoming her thirteenth. At the 23rd Japan Gold Disc Awards, Madonna received her fifth Artist of the Year trophy from Recording Industry Association of Japan, the most for any artist. To further promote the album, Madonna embarked on th

Now we need a tokenizer to process the text and include a padding and truncation strategy tho handle any variable sequence lenths.

In [7]:
from transformers import AutoTokenizer

In [8]:
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

tokenizer_config.json: 100%|██████████| 2.54k/2.54k [00:00<00:00, 780kB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 2.91MB/s]
tokenizer.json: 100%|██████████| 2.42M/2.42M [00:00<00:00, 2.62MB/s]
special_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 3.26MB/s]


Check this [link](https://www.philschmid.de/fine-tune-flan-t5#2-load-and-prepare-samsum-dataset) for more info about this preprocessing

In [9]:
from datasets import concatenate_datasets

In [10]:
tokenized_inputs = concatenate_datasets([dataset["train"],dataset["validation"],dataset["test"]]).map(lambda x: tokenizer(x["paragraph"], truncation=True), batched=True, remove_columns=['answers', 'questions', 'paragraph', 'questions_answers'])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

tokenized_targets = concatenate_datasets([dataset["train"],dataset["validation"], dataset["test"]]).map(lambda x: tokenizer(x["questions_answers"], truncation=True), batched=True, remove_columns=['answers', 'questions', 'paragraph', 'questions_answers'])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/20958 [00:00<?, ? examples/s]

Map: 100%|██████████| 20958/20958 [00:03<00:00, 6921.76 examples/s]


Max source length: 512


Map: 100%|██████████| 20958/20958 [00:01<00:00, 11123.64 examples/s]


Max target length: 512


In [11]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["Generate question and answer: " + item for item in sample["paragraph"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["questions_answers"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs




In [12]:
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["paragraph", "questions_answers", "answers","questions"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map: 100%|██████████| 16462/16462 [00:12<00:00, 1371.59 examples/s]
Map: 100%|██████████| 2067/2067 [00:01<00:00, 1427.24 examples/s]
Map: 100%|██████████| 2429/2429 [00:01<00:00, 1479.75 examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']





## FINETUNING AND EVALUATION



Let's import the pretrained model

In [13]:
from transformers import AutoModelForSeq2SeqLM

In [14]:
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

config.json: 100%|██████████| 1.40k/1.40k [00:00<00:00, 459kB/s]
model.safetensors: 100%|██████████| 308M/308M [01:37<00:00, 3.17MB/s] 
generation_config.json: 100%|██████████| 147/147 [00:00<00:00, 39.2kB/s]


In [15]:
!pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


In [16]:
!pip install bert_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: bert_score
Successfully installed bert_score-0.3.13


Now we need an evaluation metric. I choose bert_score for this example. Try to find a better one!

In [17]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("bertscore")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result


2024-01-27 23:03:23.569767: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-27 23:03:23.611359: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[nltk_data] Downloading package punkt to /home/nicola/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Downloading builder script: 100%|██████████| 7.95k/7.95k [00:00<00:00, 22.5MB/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallel

In [18]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

We can set here the hyperparameters

In [19]:
from transformers import TrainingArguments

In [52]:
training_args = TrainingArguments(output_dir = "test_trainer", evaluation_strategy= "epoch",per_device_train_batch_size=4, per_device_eval_batch_size=4, num_train_epochs=10, weight_decay=0.01, save_total_limit=1, load_best_model_at_end=True, save_strategy="epoch")

Finally we create the Trainer object

In [53]:
from transformers import Trainer

In [54]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_dataset["train"],
    eval_dataset = tokenized_dataset["validation"],
    compute_metrics = compute_metrics,

)

In [55]:
import torch

In [35]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [56]:
trainer.train()

Epoch,Training Loss,Validation Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 9.31 GiB. GPU 0 has a total capacty of 8.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 10.93 GiB is allocated by PyTorch, and 2.52 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF