<a href="https://colab.research.google.com/github/NicolaCortinovis/MLOPS_Project/blob/main/research/finetuning_transoformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FINETUNE THE FLAN-T5-SMALL MODEL
--------------------------------------------------------------------------------
Hugging Face page of the model --> [link](https://huggingface.co/google/flan-t5-small)
--------------------------------------------------------------------------------

In this Colab Notebook we finetune the Flan-T5-small to perform QAG task.

## PREPARE THE DATASET

In [None]:
! pip install transformers datasets
! pip install transformers[torch]
! pip install --upgrade tensorflow

In [None]:
from datasets import load_dataset
dataset = load_dataset("lmqg/qag_squad")

In [3]:
dataset["train"]

Dataset({
    features: ['answers', 'questions', 'paragraph', 'questions_answers'],
    num_rows: 16462
})

In [4]:
dataset['validation']

Dataset({
    features: ['answers', 'questions', 'paragraph', 'questions_answers'],
    num_rows: 2067
})

In [5]:
dataset['test']

Dataset({
    features: ['answers', 'questions', 'paragraph', 'questions_answers'],
    num_rows: 2429
})

In [6]:
dataset['train'][0]

{'answers': ['4 Minutes',
  'Elvis Presley',
  'thirteenth',
  'Sticky & Sweet Tour',
  '$280 million,'],
 'questions': ["Which single was released as the album's lead single?",
  'Madonna surpassed which artist with the most top-ten hits?',
  "4 minutes became Madonna's which number one single in the UK?",
  'What is the name of the first tour with Live Nation?',
  'How much did Stick and Sweet Tour grossed?'],
 'paragraph': '"4 Minutes" was released as the album\'s lead single and peaked at number three on the Billboard Hot 100. It was Madonna\'s 37th top-ten hit on the chart—it pushed Madonna past Elvis Presley as the artist with the most top-ten hits. In the UK she retained her record for the most number-one singles for a female artist; "4 Minutes" becoming her thirteenth. At the 23rd Japan Gold Disc Awards, Madonna received her fifth Artist of the Year trophy from Recording Industry Association of Japan, the most for any artist. To further promote the album, Madonna embarked on th

Now we need a tokenizer to process the text and include a padding and truncation strategy tho handle any variable sequence lenths.

In [7]:
from transformers import AutoTokenizer

In [8]:
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

Check this [link](https://www.philschmid.de/fine-tune-flan-t5#2-load-and-prepare-samsum-dataset) for more info about this preprocessing

In [9]:
from datasets import concatenate_datasets

In [10]:
tokenized_inputs = concatenate_datasets([dataset["train"],dataset["validation"],dataset["test"]]).map(lambda x: tokenizer(x["paragraph"], truncation=True), batched=True, remove_columns=['answers', 'questions', 'paragraph', 'questions_answers'])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

tokenized_targets = concatenate_datasets([dataset["train"],dataset["validation"], dataset["test"]]).map(lambda x: tokenizer(x["questions_answers"], truncation=True), batched=True, remove_columns=['answers', 'questions', 'paragraph', 'questions_answers'])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Max source length: 512


Map:   0%|          | 0/20958 [00:00<?, ? examples/s]

Max target length: 512


In [11]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["Generate question and answer: " + item for item in sample["paragraph"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["questions_answers"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs




In [12]:
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["paragraph", "questions_answers", "answers","questions"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/2067 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


## FINETUNING AND EVALUATION



Let's import the pretrained model

In [13]:
from transformers import AutoModelForSeq2SeqLM

In [14]:
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

In [15]:
!pip install evaluate



In [16]:
!pip install bert_score



Now we need an evaluation metric. I choose bert_score for this example. Try to find a better one!

In [17]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("bertscore")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [18]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

We can set here the hyperparameters

In [19]:
from transformers import TrainingArguments

In [24]:
training_args = TrainingArguments(output_dir = "test_trainer", evaluation_strategy= "epoch",per_device_train_batch_size=4, per_device_eval_batch_size=4)

Finally we create the Trainer object

In [21]:
from transformers import Trainer

In [25]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_dataset["train"],
    eval_dataset = tokenized_dataset["validation"],
    compute_metrics = compute_metrics,

)

In [26]:
trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 