## 1. Setup Development Environment

Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages. 

In [None]:
# python
!pip install pytesseract transformers datasets rouge-score nltk tensorboard py7zr evaluate contractions accelerate --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# install git-fls for pushing model and logs to the hugging face hub
!sudo apt-get install git-lfs --yes

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


This example will use the [Hugging Face Hub](https://huggingface.co/models) as a remote model versioning service. To be able to push our model to the Hub, you need to register on the [Hugging Face](https://huggingface.co/join). 
If you already have an account, you can skip this step. 
After you have an account, we will use the `notebook_login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk. 

In [None]:
!git config --global user.email "themocktailengineer@gmail.com"
!git config --global user.name "MocktaiLEngineer"

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 2. Load and prepare QMSum dataset

To load the `QMSum` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.


In [None]:
from datasets import load_dataset

# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("MocktaiLEngineer/QMSum")


print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
print(f"Val dataset size: {len(dataset['validation'])}")



  0%|          | 0/3 [00:00<?, ?it/s]

Train dataset size: 162
Test dataset size: 35
Val dataset size: 35


In [None]:
from datasets import Dataset, DatasetDict
import text_preprocessing as tp

custom_stopwords = set(['Uh', 'uh', 'Um', 'um', 'Okay', 'okay', 'Hmm', 'Mm-hmm', 'hmm', 'mm-hmm', 'yeah', 'Yeah', 'vocalsound', 'disfmarker', 'gap'])
punctuations = '#,.$%&\'()*+-/;<=>@[\\]^_{|}~'

def clean_transcript(speaker_content):
    return tp.preprocess_text(speaker_content, [
        tp.expand_contraction,
        lambda text: tp.remove_punctuation(text, punctuations=punctuations),
        lambda text: tp.remove_stopword(text, stop_words=custom_stopwords)
    ])

def create_transcript(entry, query):
    summary = query['answer']
    text_span = query['relevant_text_span'][0]
    start, end = map(int, text_span)

    transcript = ''
    prev_speaker = None

    for j in range(start, end+1):
        speaker_content = entry['meeting_transcripts'][j]['content']
        cleaned_transcript = clean_transcript(speaker_content)

        if len(cleaned_transcript.split()) <= 4:
            continue

        speaker = entry['meeting_transcripts'][j]['speaker']

        # Ensure we add a space at the end of previous speaker's dialogue if it's not there.
        if transcript and not transcript.endswith((' ', '\n')):
            transcript += ' '

        if prev_speaker == speaker:
            transcript += cleaned_transcript
        else:
            # When there's a speaker change, add speaker name, colon and cleaned_transcript to a new line
            transcript += '\n' + speaker + ': ' + cleaned_transcript
            prev_speaker = speaker

    return transcript.strip(), summary


def process_split(split_dataset):
    dataset = {'meeting_transcript': [], 'summary': []}

    for i in range(len(split_dataset)):
        entry = split_dataset[i]
        for query in entry['specific_query_list']:
            transcript, summary = create_transcript(entry, query)
            dataset['meeting_transcript'].append(transcript)
            dataset['summary'].append(summary)

    return Dataset.from_dict(dataset)

processed_dataset = DatasetDict({split: process_split(dataset[split]) for split in ['train', 'validation', 'test']})


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Lets checkout an example of the dataset.

In [None]:
from random import randrange        


sample = processed_dataset['train'][randrange(len(processed_dataset["test"]))]
print(f"transcript: \n{sample['meeting_transcript']}\n---------------")
print(f"summary: \n{sample['summary']}\n---------------")

dialogue: 
Marketing: Can I ? So now the recent investigation we we have done fo of the remote control So the most important aspect for remote controls is to be fancy look and feel and not current functional look and feel And the second aspect is that the remote control should be technological innovative And the third most important aspect is to to is that the co remote control should be easy to use So are things we are we have speak about before And so you you can go after And there is a fashion watchers in Paris and Milan that have detected the following trends fruits and vegetables will be the most important theme for clothes shoes and furnitures So maybe if our remote control have to be a fruit form or vegetable form And the mm the material is expected to be spongy
---------------
summary: 
Marketing believed that the trend of fruits and vegetables that fashion watchers have detected in Milan and Paris is a good indication of what kind of style the remote should have. It could make

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "philschmid/flan-t5-base-samsum"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


In [None]:
processed_dataset

DatasetDict({
    train: Dataset({
        features: ['meeting_transcript', 'summary'],
        num_rows: 1095
    })
    validation: Dataset({
        features: ['meeting_transcript', 'summary'],
        num_rows: 237
    })
    test: Dataset({
        features: ['meeting_transcript', 'summary'],
        num_rows: 244
    })
})

In [None]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([processed_dataset["train"], processed_dataset["validation"], processed_dataset["test"]]).map(lambda x: tokenizer(x["meeting_transcript"], truncation=True), batched=True, remove_columns=["meeting_transcript", "summary"])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([processed_dataset["train"], processed_dataset["validation"], processed_dataset["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["meeting_transcript", "summary"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/1576 [00:00<?, ? examples/s]

Max source length: 512


Map:   0%|          | 0/1576 [00:00<?, ? examples/s]

Max target length: 262


In [None]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["meeting_transcript"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = processed_dataset.map(preprocess_function, batched=True, remove_columns=["meeting_transcript", "summary"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Map:   0%|          | 0/1095 [00:00<?, ? examples/s]

Map:   0%|          | 0/237 [00:00<?, ? examples/s]

Map:   0%|          | 0/244 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


## 3. Fine-tune and evaluate FLAN-T5

After we have processed our dataset, we can start training our model. Therefore we first need to load our [FLAN-T5](https://huggingface.co/models?search=flan-t5) from the Hugging Face Hub. 

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

We want to evaluate our model during training. The `Trainer` supports evaluation during training by providing a `compute_metrics`.  
The most commonly used metrics to evaluate summarization task is [rogue_score](https://en.wikipedia.org/wiki/ROUGE_(metric)) short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries

We are going to use `evaluate` library to evaluate the `rogue` score.

In [None]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Before we can start training is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the 🤗 Transformers library. 

In [None]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)


The last step is to define the hyperparameters (`TrainingArguments`) we want to use for our training. We are leveraging the [Hugging Face Hub](https://huggingface.co/models) integration of the `Trainer` to automatically push our checkpoints, logs and metrics during training into a repository.

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Hugging Face repository id

model_name = model_checkpoint.split("/")[-1]

batch_size = 4
num_train_epochs = 5

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-QMSum",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    learning_rate=5e-5,
    num_train_epochs=num_train_epochs,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=500,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=3,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    push_to_hub=False,
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

Cloning https://huggingface.co/MocktaiLEngineer/flan-t5-base-samsum-finetuned-QMSum-01 into local empty directory.


We can start our training by using the `train` method of the `Trainer`.

In [None]:
# Start training
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.508598,22.3096,7.3877,17.378,19.7119,19.0
2,2.602800,2.465805,22.6907,7.9177,17.9165,20.108,19.0
3,2.602800,2.443608,22.9879,7.9476,18.0347,20.3127,19.0
4,2.334600,2.441338,22.8732,8.0775,18.0966,20.3558,18.995902
5,2.334600,2.441553,23.0559,8.257,18.1721,20.6307,18.995902


TrainOutput(global_step=1370, training_loss=2.4232081949275774, metrics={'train_runtime': 1936.3638, 'train_samples_per_second': 2.827, 'train_steps_per_second': 0.708, 'total_flos': 3749046504652800.0, 'train_loss': 2.4232081949275774, 'epoch': 5.0})


![flan-t5-tensorboard](../assets/flan-t5-tensorboard.png)

Nice, we have trained our model. 🎉 Lets run evaluate the best model again on the test set.


In [None]:
trainer.evaluate()

{'eval_loss': 2.441338300704956,
 'eval_rouge1': 22.8732,
 'eval_rouge2': 8.0775,
 'eval_rougeL': 18.0966,
 'eval_rougeLsum': 20.3558,
 'eval_gen_len': 18.99590163934426,
 'eval_runtime': 55.1192,
 'eval_samples_per_second': 4.427,
 'eval_steps_per_second': 1.107,
 'epoch': 5.0}

In [None]:
repository_id

'flan-t5-base-samsum-finetuned-QMSum-01'

In [None]:
trainer.save_model(repository_id + "-final")

The best score we achieved is an `rouge1` score of `47.23`. 

Lets save our results and tokenizer to the Hugging Face Hub and create a model card. 

In [None]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()

Upload file logs/events.out.tfevents.1685314927.4007f3941c51.1367.0:   0%|          | 1.00/8.19k [00:00<?, ?B/…

Upload file logs/events.out.tfevents.1685317097.4007f3941c51.1367.2:   0%|          | 1.00/613 [00:00<?, ?B/s]

To https://huggingface.co/MocktaiLEngineer/flan-t5-base-samsum-finetuned-QMSum-01
   eec3576..cf70bb4  main -> main

   eec3576..cf70bb4  main -> main



'https://huggingface.co/MocktaiLEngineer/flan-t5-base-samsum-finetuned-QMSum-01/commit/cf70bb4b8da717586d6b6aa167cea5eac3acea0a'

## 4. Run Inference

Now we have a trained model, we can use it to run inference. We will use the `pipeline` API from transformers and a `test` example from our dataset.

In [None]:
from transformers import pipeline
from random import randrange        

# load model and tokenizer from huggingface hub with pipeline
summarizer = pipeline("summarization", model="MocktaiLEngineer/flan-t5-base-samsum-finetuned-QMSum-01", device=0)

# select a random test sample
sample = processed_dataset['test'][randrange(len(processed_dataset["test"]))]
print(f"dialogue: \n{sample['meeting_transcript']}\n---------------")

# summarize dialogue
res = summarizer(sample["meeting_transcript"])

print(f"flan-t5-base summary:\n{res[0]['summary_text']}")

Token indices sequence length is longer than the specified maximum sequence length for this model (1179 > 512). Running this sequence through the model will result in indexing errors


dialogue: 
Vikki Howells AM: Thank you I have got one further series of questions around outofcourt disposals which you have already mentioned briefly In your written evidence you say there is ongoing work exploring diversion rather than prosecution in respect of this Bill Firstly could you tell us a bit more about the work that is being done to explore this as an option ? 
Barry Hughes: We have been working with the National Police Chiefs Council Their lead is deputy chief constable Sara Glen She is responsible for developing the police approach to outofcourt disposals and simplifying the range of outofcourt disposals There is quite a range and life would be simpler and clearer to have fewer types of disposal with more clarity about what each one of them involved I must say this is primarily a matter for the police because there are a great many offences or reports of crime that do not reach the CPS because they are dealt with by way of an outofcourt disposal Any case that the police 