# Fine-tuning an Encoder-Decoder LLM (T5) for Summarization

### Please refer to the respective sections in the book for further details.


## Step 1. Installing libraries and Data loading

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate

In [2]:
from datasets import load_dataset

xsum_dataset = load_dataset("EdinburghNLP/xsum")
xsum_dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [3]:
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Summary: {example['summary']}'")
        print(f"'>> Document: {example['document']}'")


show_samples(xsum_dataset)


'>> Summary: As Chancellor George Osborne announced all English state schools will become academies, the Welsh Government continues to reject the model here.'
'>> Document: In Wales, councils are responsible for funding and overseeing schools.
But in England, Mr Osborne's plan will mean local authorities will cease to have a role in providing education.
Academies are directly funded by central government and head teachers have more freedom over admissions and to change the way the school works.
It is a significant development in the continued divergence of schools systems on either side of Offa's Dyke.
And although the Welsh Government will get extra cash to match the money for English schools to extend the school day, it can spend it on any devolved policy area.
Ministers have no plans to follow suit.
At the moment, governing bodies are responsible for setting school hours and they need ministerial permission to make significant changes.
There are already more than 2,000 secondary ac

## Step2. Data pre-processing

In [4]:
from transformers import AutoTokenizer

model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [5]:
inputs = tokenizer("Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.")
inputs

{'input_ids': [7433, 18, 413, 2673, 33, 6168, 640, 8, 12580, 17600, 7, 11, 970, 51, 89, 2593, 11, 10987, 32, 1343, 227, 18368, 2953, 57, 16133, 4937, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁Clean',
 '-',
 'up',
 '▁operations',
 '▁are',
 '▁continuing',
 '▁across',
 '▁the',
 '▁Scottish',
 '▁Border',
 's',
 '▁and',
 '▁Du',
 'm',
 'f',
 'ries',
 '▁and',
 '▁Gall',
 'o',
 'way',
 '▁after',
 '▁flooding',
 '▁caused',
 '▁by',
 '▁Storm',
 '▁Frank',
 '.',
 '</s>']

In [7]:
xsum_dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [8]:
import numpy as np

def word_count_analysis(dataset):
    document_word_counts = []
    summary_word_counts = []

    for example in dataset:
        document_words = example["document"].split()

        summary_words = example["summary"].split()

        document_word_counts.append(len(document_words))
        summary_word_counts.append(len(summary_words))

    return document_word_counts, summary_word_counts

document_word_counts, summary_word_counts = word_count_analysis(xsum_dataset['train'])

mean_document_word_count = np.mean(document_word_counts)
median_document_word_count = np.median(document_word_counts)
mean_summary_word_count = np.mean(summary_word_counts)
median_summary_word_count = np.median(summary_word_counts)

print("Mean Document Word Count:", mean_document_word_count)
print("Median Document Word Count:", median_document_word_count)
print("Mean Summary Word Count:", mean_summary_word_count)
print("Median Summary Word Count:", median_summary_word_count)


Mean Document Word Count: 373.8646328015879
Median Document Word Count: 295.0
Mean Summary Word Count: 21.09764512730035
Median Summary Word Count: 21.0


In [9]:
max_input_length = 512
max_target_length = 30


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["document"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["summary"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [10]:
tokenized_datasets = xsum_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/11332 [00:00<?, ? examples/s]

In [11]:
generated_summary = "Hurricane Patricia is a category 5 storm."
reference_summary = "Hurricane Patricia has been rated as a category 5 storm."

In [None]:
!pip install rouge_score

In [13]:
import evaluate

rouge_score = evaluate.load("rouge")

2024-02-20 19:11:56.789415: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-20 19:11:56.822949: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-20 19:11:56.822976: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-20 19:11:56.823884: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-20 19:11:56.829273: I tensorflow/core/platform/cpu_feature_guar

In [14]:
scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

{'rouge1': 0.7058823529411764,
 'rouge2': 0.5333333333333333,
 'rougeL': 0.7058823529411764,
 'rougeLsum': 0.7058823529411764}

In [15]:
scores["rouge1"]

0.7058823529411764

In [16]:
!pip install nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [17]:
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [18]:
from nltk.tokenize import sent_tokenize


def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])


print(three_sentence_summary(xsum_dataset["train"][1]["document"]))

A fire alarm went off at the Holiday Inn in Hope Street at about 04:20 BST on Saturday and guests were asked to leave the hotel.
As they gathered outside they saw the two buses, parked side-by-side in the car park, engulfed by flames.
One of the tour groups is from Germany, the other from China and Taiwan.


In [19]:
def evaluate_baseline(dataset, metric):
    summaries = [three_sentence_summary(text) for text in dataset["document"]]
    return metric.compute(predictions=summaries, references=dataset["summary"])

## Step3. Model training

In [21]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, device_map="auto")

In [22]:
from transformers import Seq2SeqTrainingArguments

batch_size = 8
num_train_epochs = 8
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-xsum",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
)

In [23]:
import numpy as np

def calculate_evaluation_metrics(prediction_data):
    predicted, true_labels = prediction_data
    translated_predictions = tokenizer.batch_decode(predicted, skip_special_tokens=True)
    true_labels = np.where(true_labels != -100, true_labels, tokenizer.pad_token_id)
    translated_labels = tokenizer.batch_decode(true_labels, skip_special_tokens=True)
    formatted_predictions = ["\n".join(sent_tokenize(pred_text.strip())) for pred_text in translated_predictions]
    formatted_labels = ["\n".join(sent_tokenize(label_text.strip())) for label_text in translated_labels]
    rouge_results = rouge_score.compute(
        predictions=formatted_predictions, references=formatted_labels, use_stemmer=True
    )
    scaled_results = {metric: score * 100 for metric, score in rouge_results.items()}
    return {metric_key: round(score_value, 4) for metric_key, score_value in scaled_results.items()}


In [24]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [25]:
tokenized_datasets = tokenized_datasets.remove_columns(
    xsum_dataset["train"].column_names
)

In [26]:
features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator(features)

{'input_ids': tensor([[  37,  423,  583,  ...,   30,  142,    1],
        [  71, 1472, 6196,  ...,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[ 7433,    18,   413,  2673,    33,  6168,   640,     8, 12580, 17600,
             7,    11,   970,    51,    89,  2593,    11, 10987,    32,  1343,
           227, 18368,  2953,    57, 16133,  4937,     5,     1],
        [ 2759,  8548, 14264,    43,   118, 10932,    57,  1472,    16,     3,
             9, 18024,  1584,   739,  3211,    16, 27874,   690,  2050,     5,
             1,  -100,  -100,  -100,  -100,  -100,  -100,  -100]]), 'decoder_input_ids': tensor([[    0,  7433,    18,   413,  2673,    33,  6168,   640,     8, 12580,
         17600,     7,    11,   970,    51,    89,  2593,    11, 10987,    32,
          1343,   227, 18368,  2953,    57, 16133,  4937,     5],
        [    0,  2759,  8548, 14264,    43,   118, 10932,    57,  1472,    16,
        

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=calculate_evaluation_metrics,
)

In [28]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.7272,2.380859,30.1122,9.0727,24.192,24.1895
2,2.5379,2.303072,30.8973,9.6536,24.9135,24.9103
3,2.4586,2.262487,31.3401,10.078,25.3178,25.3218
4,2.4075,2.238034,31.4357,10.2628,25.48,25.4768
5,2.3705,2.21962,31.691,10.4986,25.7174,25.7084
6,2.3431,2.209808,32.0841,10.7143,25.9952,25.9915
7,2.3241,2.201328,32.0676,10.7231,25.9653,25.9639
8,2.3117,2.198619,32.1047,10.7836,26.0316,26.0205




TrainOutput(global_step=204048, training_loss=2.4350560664719323, metrics={'train_runtime': 32650.3728, 'train_samples_per_second': 49.995, 'train_steps_per_second': 6.249, 'total_flos': 2.203709646914519e+17, 'train_loss': 2.4350560664719323, 'epoch': 8.0})

## Step 4. Model Evaluation

In [29]:
trainer.evaluate()

{'eval_loss': 2.1986191272735596,
 'eval_rouge1': 32.1047,
 'eval_rouge2': 10.7836,
 'eval_rougeL': 26.0316,
 'eval_rougeLsum': 26.0205,
 'eval_runtime': 873.7662,
 'eval_samples_per_second': 12.969,
 'eval_steps_per_second': 1.622,
 'epoch': 8.0}

### Step 4.1 Inference

In [31]:
from transformers import pipeline

model_id = "t5-small-finetuned-xsum/checkpoint-204000/"
summarizer = pipeline("summarization", model=model_id)

In [42]:
def print_summary(idx):
    document = xsum_dataset["test"][idx]["document"]
    summary = summarizer(xsum_dataset["test"][idx]["document"])[0]["summary_text"]
    print(f"'>>> Document: {document}'")
    print(f"\n'>>> Summary: {summary}'")

In [43]:
print_summary(100)

'>>> Document: The British Transport Police said the move was a "proportionate response" in the face of a mounting terrorism threat.
Specially trained officers will begin carrying the stun weapons over the next few weeks.
It brings the Scottish force into line with their counterpart in England, where Tasers have been used since 2011.
The weapons are used to incapacitate suspects through the use of an electric current.
Temporary Assistant Chief Constable Alun Thomas said: "This decision is not based on specific intelligence of any criminal behaviour or imminent threat, but will allow us the option to deploy Taser devices where, in the course of their duty, an officer needs to protect the public or themselves by using force.
"The current threat to the UK from international terrorism remains 'severe', meaning an attack is highly likely.
"Recent terrorist attacks across the world are a stark reminder that the threat from terrorism is a genuine risk, and it is important that we keep our sec

In [44]:
print_summary(0)

'>>> Document: Prison Link Cymru had 1,099 referrals in 2015-16 and said some ex-offenders were living rough for up to a year before finding suitable accommodation.
Workers at the charity claim investment in housing would be cheaper than jailing homeless repeat offenders.
The Welsh Government said more people than ever were getting help to address housing problems.
Changes to the Housing Act in Wales, introduced in 2015, removed the right for prison leavers to be given priority for accommodation.
Prison Link Cymru, which helps people find accommodation after their release, said things were generally good for women because issues such as children or domestic violence were now considered.
However, the same could not be said for men, the charity said, because issues which often affect them, such as post traumatic stress disorder or drug dependency, were often viewed as less of a priority.
Andrew Stevens, who works in Welsh prisons trying to secure housing for prison leavers, said the need