[__Source__](https://www.philschmid.de/fine-tune-flan-t5)

Model trained on Vastai

![title](assets/vastai_flant5.png)

In [1]:
from datasets import load_dataset
from huggingface_hub import login
from dotenv import load_dotenv
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM
from random import randrange
import os

In [2]:
load_dotenv()
hf_token = os.getenv("HF_API_TOKEN")
login(hf_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\arind\.cache\huggingface\token
Login successful


### Load Data for pre-processing

The dataset used in this exercise already has the train, test validation splits. <br> Each sample has a __dialogue__ followed by a __summary__. For pre-processing ( padding, tokenization etc ) we need to figure out the maximum dialogue length and summary length.

In [4]:
#!pip install py7zr evaluate nltk absl-py rouge_score tensorboardX

In [3]:
model_id = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

In [4]:
# Acquire the training data from Hugging Face
data_id= "samsum"
dataset = load_dataset(data_id, trust_remote_code=True)
dataset.keys()

dict_keys(['train', 'test', 'validation'])

Since the data has three partitions already, we can check the size of each partition and also see a random sample

In [5]:
print(f"Length of train dataset: {len(dataset['train'])}")
print(f"Length of val dataset: {len(dataset['validation'])}")
print(f"Length of test dataset: {len(dataset['test'])}")

Length of train dataset: 14732
Length of val dataset: 818
Length of test dataset: 819


In [6]:
sample = dataset['train'][randrange(len(dataset["train"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")
print(f"summary: \n{sample['summary']}\n---------------")

dialogue: 
Noe: hey girl, is everything good with you?
Laila: hii! Yes I am great! What about you? 
Noe: good good! So how is your new job? Apartment? Life! Tell me EVERYTHING! üëÄ
Laila: oh I freaking love it here in Amsterdam! It is less stressful than Paris, but you still have a lot of career opportunities with all the big brands being based here...
Noe: that sounds great, are you satisfied with your new job?
Laila: humm ‚Ä¶ I am still discovering all its aspects, and getting to know my boss better (hope she doesn‚Äôt turn out a bitch like the last one) üòÇ‚Ä¶ but so far so good with the colleagues.
Laila: and we have people from all over the world!
Noe: hahah you cracked me up! She was a real bitch though! Thank God I left at the same time as you, otherwise I would have gone crazy.
Laila: tell me me about it!  üò∑
Noe: and how is yoru roommate? do you get along?
Laila: yes, perfectly! I am so lucky this time thank God. She is German, I also get to practice it with her üòä
Laila

We concatenate the train and test set to figure out the max source(dialogue) length and max target(summary) length. This will be used for padding. 

In [7]:
from datasets import concatenate_datasets

tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Max source length: 512


Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Max target length: 95


### Pre-Processing

Now that we have the max length for the dialogues and summary, we take the following steps.

- Add a prefix to the dialogue.
- Pad the dialogues and the summary to the max length with the eos token.
- Tokenize to create __input_ids__ and __label__ fields in the dataset.
- Replace the eos token IDs in labels with -100, so that they are taken into account in the loss calculation.
- Remove other fields , that aren't needed anymore

This is illustrated in the image below

![title](assets/Paddingflan.png)

In [8]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]
 
    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)
 
    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)
 
    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]
  
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
 
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


### Load Model

In [11]:
from transformers import AutoModelForSeq2SeqLM
 
# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

In [14]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")
 
# Metric
metric = evaluate.load("rouge")
 
# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]
 
    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]
 
    return preds, labels
 
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
 
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
 
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [15]:
from transformers import DataCollatorForSeq2Seq
 
# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

In [17]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
 
# Hugging Face repository id
repository_id = f"{model_id.split('/')[1]}-{data_id}"
 
# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    learning_rate=5e-5,
    num_train_epochs=5,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=500,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=3,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    report_to="tensorboard",
    push_to_hub=False,
    hub_strategy="every_save",
    hub_model_id=repository_id,
    hub_token=HfFolder.get_token(),
)
 
# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

In [18]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.4697,1.396656,46.0153,22.2416,38.0741,42.3836,20.0
2,1.3504,1.388259,45.9788,22.0954,38.2199,42.4275,20.0
3,1.2901,1.383503,46.3456,22.2048,38.3737,42.5287,20.0
4,1.2331,1.388069,46.5584,22.8334,38.8828,43.0295,20.0
5,1.2102,1.389546,46.4492,22.512,38.6132,42.8449,20.0


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


TrainOutput(global_step=9210, training_loss=1.3138273092098007, metrics={'train_runtime': 5975.3743, 'train_samples_per_second': 12.327, 'train_steps_per_second': 1.541, 'total_flos': 5.043922658131968e+16, 'train_loss': 1.3138273092098007, 'epoch': 5.0})

In [19]:
trainer.evaluate()



{'eval_loss': 1.383502721786499,
 'eval_rouge1': 46.3456,
 'eval_rouge2': 22.2048,
 'eval_rougeL': 38.3737,
 'eval_rougeLsum': 42.5287,
 'eval_gen_len': 20.0,
 'eval_runtime': 44.391,
 'eval_samples_per_second': 18.45,
 'eval_steps_per_second': 2.32,
 'epoch': 5.0}

In [20]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()

events.out.tfevents.1722167089.a8cd146e4c7b:   0%|          | 0.00/12.7k [00:00<?, ?B/s]
events.out.tfevents.1722173139.a8cd146e4c7b:   0%|          | 0.00/40.0 [00:00<?, ?B/s][A



spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s][A[A[A[A




training_args.bin:   0%|          | 0.00/5.37k [00:00<?, ?B/s][A[A[A[A[A

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s][A[A


Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s][A[A[A

model.safetensors:   0%|          | 3.78M/990M [00:00<00:30, 32.0MB/s][A[A
events.out.tfevents.1722167089.a8cd146e4c7b: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12.7k/12.7k [00:00<00:00, 46.4kB/s]

training_args.bin: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5.37k/5.37k [00:00<00:00, 17.2kB/s][A[A
events.out.tfevents.1722173139.a8cd146e4c7b: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 40.0/40.0 [00:00<00:00, 94.9B/s]
events.out.tfevents.1722167089.a8cd146e4c7b: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 12.7k/12.7k [00:00<00:00, 25.3kB/s]



CommitInfo(commit_url='https://huggingface.co/Arindam1975/flan-t5-base-samsum/commit/48d0a9f793c713bfef5e855505eebbcef550ad09', commit_message='End of training', commit_description='', oid='48d0a9f793c713bfef5e855505eebbcef550ad09', pr_url=None, pr_revision=None, pr_num=None)

### Load the Trained Model for Inference

In [14]:
import torch as T
device = T.device('cuda:0' if T.cuda.is_available() else 'cpu')

In [15]:
from transformers import pipeline
from random import randrange
 
# load model and tokenizer from huggingface hub with pipeline
summarizer = pipeline("summarization", model="Arindam1975/flan-t5-base-samsum", device=device)
 
# select a random test sample
sample = dataset['test'][randrange(len(dataset["validation"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")
 
# summarize dialogue
res = summarizer(sample["dialogue"])
 
print(f"flan-t5-base summary:\n{res[0]['summary_text']}")

dialogue: 
Michael: hey, how are you
Kai: hey! I am fine, just working too much. what about you? you travel so much!
Michael: haha yes. At airport on my way back. looong trip
Kai: where have you been now?
Michael: argentina brazil and chile
Kai: wow! how long?
Michael: 2 weeks, lots of flights to make it work. I'm in Boston next weekend!
Kai: really??! how come?
Michael: just because I found a cheap ticket üòã
Kai: nice:) but it's cold
Michael: hmm well.. I can deal with the cold now
Kai: are you not tired of all this travelling?
Michael: hmm, a little bit but not really. I‚Äôm more scared to stay in London and do nothing, because I‚Äôm so bored of it
Kai: I see, a man full of energy :)
Michael: well sort of, for fun stuff, but tired of work. 
Kai: yes, I remember quite well üòã
Michael: Hahah. Thinking of resigning earlier than I was planning
Kai: and then?
Michael: I don‚Äôt have an answer to that one yet, and it‚Äôs not really a solution because I‚Äôd need to work 2 months notice 