# Fine-tune FLAN-T5 for Event Generation  
In this code, you will see how to fine-tune [google/flan-t5-base](https://huggingface.co/google/flan-t5-base) for event generation task using Hugging Face Transformers. Flan t5 is created based on the T5 language model. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages.

You will learn how to:

1. [Install requirements](#1-install-requirements)
2. [Load Corpus](#2-load-and-prepare-samsum-dataset)
3. [Fine-tune and evaluate FLAN-T5](#3-fine-tune-and-evaluate-flan-t5)
4. [Run Inference](#4-run-inference)

## FLAN-T5
FLAN-T5 released with the [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf) paper is an enhanced version of T5 that has been finetuned in a mixture of tasks. The paper explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. The paper discovers that overall instruction finetuning is a general method for improving the performance and usability of pretrained language models.

## 1. Install requirements

Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.

In [None]:
# python
!pip install pytesseract transformers==4.28.1 datasets evaluate rouge-score sentence_transformers nltk tensorboard py7zr --upgrade

In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

## Connect to Drive

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import glob
from datasets import load_dataset
import datasets

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. Load Corpus

In [None]:
import pickle

f = open("/content/drive/MyDrive/Corpus/CG_Corpus/event_extraction_3to1.dat", "rb")
dataset = pickle.load(f)
f.close()

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'Events'],
        num_rows: 415
    })
    test: Dataset({
        features: ['Sentence', 'Events'],
        num_rows: 146
    })
})

In [None]:
dataset['train'][65]

{'Sentence': 'B: You know? Now things are just like kind of mellow and I’m just n- you know wrapping things up so it’s not that bad.  ',
 'Events': "Event1: Now things are like kind of mellow\nEvent2: B is just wrapping things up so B' situation is not that bad\nEvent3: B is just wrapping things up\nEvent4: B's situation is not that bad\n"}

In [None]:
from random import randrange


sample = dataset['train'][randrange(len(dataset["train"]))]
print(f"Sentence: \n{sample['Sentence']}\n---------------")
print(f"Events: \n{sample['Events']}\n---------------")

Sentence: 
B: Well anyway.  
---------------
Events: 
Event1: B says Well anyway.  

---------------


Tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-base"

# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)


Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

text2text-generation task: our model will take a utterance as input and generate events as output. For this we want to understand how long our input and output will be to be able to efficiently batch our data.

In [None]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["Sentence"], truncation=True), batched=True, remove_columns=['Sentence', 'Events'])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["Events"], truncation=True), batched=True, remove_columns=['Sentence', 'Events'])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/561 [00:00<?, ? examples/s]

Max source length: 96


Map:   0%|          | 0/561 [00:00<?, ? examples/s]

Max target length: 252


In [None]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["Events: " + item for item in sample["Sentence"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["Events"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=['Sentence', 'Events'])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/415 [00:00<?, ? examples/s]

Map:   0%|          | 0/146 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


## 3. Fine-tune and evaluate FLAN-T5

In [None]:
from transformers import AutoModelForSeq2SeqLM

# huggingface hub model id
model_id="google/flan-t5-base"

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Trainer: We are going to use evaluate library to evaluate the rogue score.

In [None]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Before we can start training is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the 🤗 Transformers library.

In [None]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)


Hyperparameters and (`TrainingArguments`): we want to use for our training. We are leveraging the [Hugging Face Hub](https://huggingface.co/models) integration of the `Trainer` to automatically push our checkpoints, logs and metrics during training into a repository.

In [None]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Hugging Face repository id
repository_id = f"{model_id.split('/')[1]}-event-extraction-train-on-3-test-on-1new"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    learning_rate=5e-5,
    num_train_epochs=16,
    # logging & evaluation strategies
    # logging_dir=f"{repository_id}/logs",
    logging_strategy="epoch",
    # logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=False,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    report_to="tensorboard",
    push_to_hub=False,
    hub_strategy="every_save",
    # hub_model_id=repository_id,
    hub_token=HfFolder.get_token(),
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

We can start our training by using the `train` method of the `Trainer`.

In [None]:
# Start training
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,2.0352,1.266261,42.8582,24.9182,41.0987,41.0905,17.356164
2,1.4134,1.154586,44.0834,25.9814,42.0823,42.073,17.321918
3,1.2379,1.10706,44.2618,26.2147,41.9555,42.0339,17.075342
4,1.1412,1.109151,45.7297,28.739,43.5858,43.6439,17.267123
5,1.0387,1.100821,46.1337,29.6879,44.3288,44.4006,17.452055
6,0.9733,1.111677,46.3947,29.4918,44.7395,44.8695,17.239726
7,0.9158,1.124194,46.2706,29.9295,44.4214,44.5271,17.253425
8,0.8832,1.121005,46.7038,30.4227,44.8752,44.9476,17.226027
9,0.8324,1.130277,46.7824,30.7373,45.116,45.1671,17.212329
10,0.8045,1.14337,46.3601,29.8314,44.3208,44.3348,17.479452


TrainOutput(global_step=832, training_loss=0.9801508050698501, metrics={'train_runtime': 1102.8266, 'train_samples_per_second': 6.021, 'train_steps_per_second': 0.754, 'total_flos': 852522903797760.0, 'train_loss': 0.9801508050698501, 'epoch': 16.0})

Nice, we have trained our model. 🎉 Lets run evaluate the best model again on the test set.


In [None]:
trainer.evaluate()

{'eval_loss': 1.168958067893982,
 'eval_rouge1': 46.6846,
 'eval_rouge2': 30.826,
 'eval_rougeL': 45.0442,
 'eval_rougeLsum': 45.1453,
 'eval_gen_len': 17.26027397260274,
 'eval_runtime': 13.2772,
 'eval_samples_per_second': 10.996,
 'eval_steps_per_second': 1.431,
 'epoch': 16.0}

In [None]:
# # Save our tokenizer and create model card
# tokenizer.save_pretrained(repository_id)
# trainer.create_model_card()
# # Push the results to the hub
# trainer.push_to_hub()

## Test And Evaluate on Rouge and SBERT

### Rouge Score

In [None]:
from rouge_score import rouge_scorer

def calculate_rouge_score(reference, candidate):
  scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', ], use_stemmer=True)
  scores = scorer.score(reference, candidate)
  return scores['rougeL']

In [None]:
samples_number = len(dataset['test'])

SUM = 0
for sample in dataset['test']:
  TEXT = "Events: " + sample['Sentence']
  ground_truth = sample['Events']
  inputs = tokenizer(TEXT, return_tensors="pt").to('cuda')
  outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_length=512)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
  rouge = calculate_rouge_score(ground_truth, prediction)
  SUM += rouge[2] # rougeL fmeasure

rouge_avg = SUM/samples_number
print(f"\nRougeL average on test set with {samples_number} samples: {rouge_avg}")


RougeL average on test set with 146 samples: 0.487449564110134


### SBERT Score

In [None]:
from sentence_transformers import SentenceTransformer, util

sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

def calculate_sbert_score(sentences1, sentences2):
    # Compute embedding for both lists
    embeddings1 = sbert_model.encode(sentences1, convert_to_tensor=True)
    embeddings2 = sbert_model.encode(sentences2, convert_to_tensor=True)

    # ompute cosine-similarities
    cosine_scores = util.cos_sim(embeddings1, embeddings2)
    output = torch.tensor([cosine_scores])
    return round(output.item(), 4)

In [None]:
samples_number = len(dataset['test'])

SUM = 0
for sample in dataset['test']:
  TEXT = "Events: " + sample['Sentence']
  ground_truth = sample['Events']
  inputs = tokenizer(TEXT, return_tensors="pt").to('cuda')
  outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_length=512)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
  sbert_score = calculate_sbert_score(ground_truth, prediction)
  SUM += sbert_score

  if sbert_score<0.2: print(f"\n[-] Sentence:{TEXT} \nground_truth: {ground_truth} \nprediction: {prediction} \nsimilarity score: {sbert_score}", '\n------------------------------')
  if sbert_score>0.9: print(f"\n[+] Sentence:{TEXT} \nground_truth: {ground_truth} \nprediction: {prediction} \nsimilarity score: {sbert_score}", '\n------------------------------')

sbert_score_avg = SUM/samples_number
print(f"\n\n\nSBERT Score Cosine Similarity Average on test set with {samples_number} samples: {sbert_score_avg}")


[+] Sentence:Events: B: yeah.   
ground_truth: Event1: B says yeah.  
 
prediction: Event1: B says yeah.  
similarity score: 1.0 
------------------------------

[+] Sentence:Events: A: Darn it. I thought I was going to get to see everybody.   
ground_truth: Event1: A thought B was going to get to see everybody
Event2: B was going to get to see everybody
Event3: B got to see everybody
 
prediction: Event1: A thought B was going to get to see everybody Event2: B was going to get to see everybody  
similarity score: 0.9717 
------------------------------

[+] Sentence:Events: A: What kind of car do you have?  
ground_truth: Event1: A asks B what kind of car B has
Event2: B has a car
 
prediction: Event1: A asks B what kind of car do A have Event2: A has a car  
similarity score: 0.9801 
------------------------------

[+] Sentence:Events: B: Same car.   
ground_truth: Event1: B has the same car
 
prediction: Event1: B says Same car  
similarity score: 0.9378 
---------------------------

## 4. Run Inference

In [None]:
TEXT = "Events: A: What kind of car do you have now?"
inputs = tokenizer.encode_plus(TEXT, padding='max_length', max_length=512, return_tensors='pt').to('cuda')
outputs = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=150, num_beams=4, early_stopping=True)
prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Events Extracted with Flan-T5-Base LLM:\n{prediction}")

Events Extracted with Flan-T5-Base LLM:
Event1: A asks B what kind of car A has now Event2: A has now a car 


In [None]:
from random import randrange

sample = dataset['test'][randrange(len(dataset["test"]))]
print(f"Sentence: {sample['Sentence']}\nGround truth:\n{sample['Events']}\n---------------")
TEXT = "Events: " + sample['Sentence']
inputs = tokenizer.encode_plus(TEXT, padding='max_length', max_length=512, return_tensors='pt').to('cuda')
outputs = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=150, num_beams=4, early_stopping=True)
prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Events Extracted with Flan-T5-Base LLM:\n{prediction}")
calculate_rouge_score(sample['Events'], prediction)

Sentence: A: Right. When do you sleep then? 
Ground truth:
Event1: A asks B when B sleeps then
Event2: B sleeps

---------------
Events Extracted with Flan-T5-Base LLM:
Event1: A asks B when do B and A sleep Event2: B and A sleep 


Score(precision=0.6666666666666666, recall=0.9090909090909091, fmeasure=0.7692307692307692)

### Run This Block just for show all results on Test set

In [None]:
for sample in dataset['test']:
  print(f"Sentence: {sample['Sentence']}\nGround truth:\n{sample['Events']}\n")
  TEXT = "Events: " + sample['Sentence']
  inputs = tokenizer.encode_plus(TEXT, padding='max_length', max_length=512, return_tensors='pt').to('cuda')
  outputs = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=150, num_beams=4, early_stopping=True)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
  print(f"Events Extracted with Flan-T5-Base LLM:\n{prediction}")
  calculate_rouge_score(sample['Events'], prediction)
  print("-"*80)

## Save Pretrained Model

In [None]:
save_directory = "/content/drive/MyDrive/Common Ground Docs/Models/FlanT5_Event_Extraction_3_to_1"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

### load Model and Test pretrained model

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(save_directory)
pretrained_model = AutoModelForSeq2SeqLM.from_pretrained(save_directory)

In [None]:
pretrained_model.to('cuda')

Rouge

In [None]:
samples_number = len(dataset['test'])

SUM = 0
for sample in dataset['test']:
  TEXT = "Events: " + sample['Sentence']
  ground_truth = sample['Events']
  inputs = tokenizer(TEXT, return_tensors="pt").to('cuda')
  outputs = pretrained_model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_length=512)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
  rouge = calculate_rouge_score(ground_truth, prediction)
  SUM += rouge[2] # rougeL fmeasure

rouge_avg = SUM/samples_number
print(f"\nRougeL average on test set with {samples_number} samples: {rouge_avg}")


RougeL average on test set with 47 samples: 0.5899351297618687


SBERT

In [None]:
samples_number = len(dataset['test'])

SUM = 0
for sample in dataset['test']:
  TEXT = "Events: " + sample['Sentence']
  ground_truth = sample['Events']
  inputs = tokenizer(TEXT, return_tensors="pt").to('cuda')
  outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_length=512)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
  sbert_score = calculate_sbert_score(ground_truth, prediction)
  SUM += sbert_score

  if sbert_score<0.2: print(f"\n[-] Sentence:{TEXT} \nground_truth: {ground_truth} \nprediction: {prediction} \nsimilarity score: {sbert_score}", '\n------------------------------')
  if sbert_score>0.9: print(f"\n[+] Sentence:{TEXT} \nground_truth: {ground_truth} \nprediction: {prediction} \nsimilarity score: {sbert_score}", '\n------------------------------')

sbert_score_avg = SUM/samples_number
print(f"\n\n\nSBERT Score Cosine Similarity Average on test set with {samples_number} samples: {sbert_score_avg}")


[+] Sentence:Events: A: %um but although they solicit, they’re trying to solicit more throughout the, throughout the globe instead of just &Japan  
ground_truth: Event1: Although the company solicits, the company is trying to solicit more throughout the globe instead of just Japan
Event2: The company solicits
Event3: The company is trying to solicit more throughout the globe instead of just Japan
 
prediction: Event1: Although they solicit, they are trying to solicit more throughout the globe instead of just Japan Event2: They solicit more throughout the globe instead of just Japan  
similarity score: 0.914 
------------------------------

[+] Sentence:Events: A: although they’ve had most interest in &Japan.   
ground_truth: Event1: The company have had most interest in Japan
 
prediction: Event1: Although they have had most interest in Japan  
similarity score: 0.9059 
------------------------------

[+] Sentence:Events: A: And I teach probably two classes and then do administrative 