<a href="https://colab.research.google.com/github/AndreRab/T5-small-finetuned-for-summarization-task/blob/main/Summarization_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

In this notebook, I explore the application of the T-5 model for the task of summarizing news articles. The ability to automatically generate concise summaries of long articles can significantly enhance information accessibility and comprehension. Our goal is to fine-tune a small variant of the T-5 model, known for its efficiency and effectiveness across various natural language processing tasks, including summarization.

# Data downloading

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [2]:
from datasets import load_dataset

In [3]:
dataset = load_dataset('abisee/cnn_dailymail', '3.0.0')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Let's see how our dataset looks like

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [5]:
dataset['train']['article'][0]

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details o

In [6]:
dataset['train']['highlights'][0]

"Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund ."

In [7]:
def show_samples_from_dataset(dataset, n_samples=3):
  samples = dataset['test'].shuffle().select(range(n_samples))
  for sample in samples:
    print(f"Article: {sample['article']}")
    print(f"highlights: {sample['highlights']}")
    print()

In [8]:
show_samples_from_dataset(dataset, 1)

Article: (CNN)The cover-up is often worse than the crime. Henry Louis Gates stands accused of scrubbing part of a segment in his PBS documentary series "Finding Your Roots" because the actor Ben Affleck put pressure on him. Affleck's concern was that the segment would have aired his family's dirty laundry, which includes a slaveholding ancestor, Benjamin Cole. Affleck said, in a statement posted on Facebook, that he "didn't want any television show about my family to include a guy who owned slaves. I was embarrassed." And Gates later explained that he subbed that part of the segment for another that made for more "compelling television." But providing a window into the importance of slavery's past to America's present should never just be about what makes for good television. Gates missed an opportunity. And Affleck's initial reluctance to acknowledge his truth (an impulse, he said on Facebook, he regrets) is surprising. Last month, Affleck lent his star power to support continued fore

Let's cut our dataset

In [9]:
new_size = 0.05

for key in dataset.keys():
  dataset[key] = dataset[key].train_test_split(train_size = new_size, seed = 42)['train']

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 14355
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 668
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 574
    })
})

# Tokenizer and data preprocessing

In [11]:
from transformers import AutoTokenizer

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [12]:
model_checkpoint = 'google-t5/t5-small'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Let's see how tokinezer works

In [13]:
input = tokenizer('I love you! jfhdjhfdjkfdhf')
input

{'input_ids': [27, 333, 25, 55, 3, 354, 89, 107, 26, 354, 107, 89, 26, 354, 157, 89, 26, 107, 89, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [14]:
tokenizer.convert_ids_to_tokens(input['input_ids'])

['▁I',
 '▁love',
 '▁you',
 '!',
 '▁',
 'j',
 'f',
 'h',
 'd',
 'j',
 'h',
 'f',
 'd',
 'j',
 'k',
 'f',
 'd',
 'h',
 'f',
 '</s>']

In [15]:
max_input_length = 1024
max_output_length = 128

def preprocess_input(examples):
  model_inputs = tokenizer(examples['article'], max_length = max_input_length, truncation=True)
  targets = tokenizer(examples['highlights'], max_length = max_output_length, truncation=True)
  model_inputs['labels'] = targets['input_ids']
  return model_inputs

In [16]:
dataset_tokenized = dataset.copy()
for key in dataset.keys():
  dataset_tokenized[key] = dataset_tokenized[key].map(preprocess_input, batched = True)

Map:   0%|          | 0/14355 [00:00<?, ? examples/s]

Map:   0%|          | 0/668 [00:00<?, ? examples/s]

Map:   0%|          | 0/574 [00:00<?, ? examples/s]

In [17]:
dataset_tokenized

{'train': Dataset({
     features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 14355
 }),
 'validation': Dataset({
     features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 668
 }),
 'test': Dataset({
     features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 574
 })}

Remove unnecessary features from the dataset

In [18]:
for key in dataset_tokenized.keys():
    columns_to_remove = dataset_tokenized[key].column_names[:3]
    dataset_tokenized[key] = dataset_tokenized[key].remove_columns(columns_to_remove)

In [19]:
dataset_tokenized

{'train': Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 14355
 }),
 'validation': Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 668
 }),
 'test': Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 574
 })}

# Metrics

In [20]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=f9f646772253f0d315dd297400c2249e963f13687dfc11fed879ce6827bbb619
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [21]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [22]:
import evaluate
rouge_score = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Let's check how rouge metric works

In [23]:
sentence_one = 'I like cars!'
sentence_two = 'I think, cars are so interesting!'

rouge_score.compute(predictions=[sentence_one], references=[sentence_two])

{'rouge1': 0.4444444444444444,
 'rouge2': 0.0,
 'rougeL': 0.4444444444444444,
 'rougeLsum': 0.4444444444444444}

In [24]:
sentence_two = 'I hate cars!'

rouge_score.compute(predictions=[sentence_one], references=[sentence_two])

{'rouge1': 0.6666666666666666,
 'rouge2': 0.0,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666}

In [25]:
import nltk
nltk.download('punkt_tab')
nltk.download("punkt")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Model fine-tuning

In [26]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [27]:
from transformers import Seq2SeqTrainingArguments

batch_size = 8
train_epochs = 4
logging_steps = len(dataset_tokenized['train']) // batch_size

args = Seq2SeqTrainingArguments(
    output_dir = f'{model_checkpoint}-finetuned-cnn_daily_mail',
    evaluation_strategy='epoch',
    learning_rate=1e-4,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    weight_decay = 0.1,
    num_train_epochs=train_epochs,
    logging_steps=logging_steps,
    predict_with_generate=True
)



In [28]:
import numpy as np
from nltk import sent_tokenize

def compute_metrics(eval):
  predictions, targets = eval

  decode_pred = tokenizer.batch_decode(predictions, skip_special_tokens=True)
  decode_targets = tokenizer.batch_decode((np.where(targets != -100, targets, tokenizer.pad_token_id)), skip_special_tokens=True)

  decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decode_pred]
  decoded_targets = ["\n".join(sent_tokenize(label.strip())) for label in decode_targets]

  result = rouge_score.compute(predictions=decoded_preds, references=decoded_targets, use_stemmer = True)
  return result

In [29]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [30]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Seq2SeqTrainer(


In [31]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,1.9261,1.764953,0.249336,0.115679,0.206068,0.233952
2,1.8406,1.751016,0.24708,0.115698,0.204712,0.231854
3,1.7986,1.749189,0.248771,0.114355,0.204596,0.232672
4,1.7705,1.751394,0.249125,0.114944,0.204852,0.233142




TrainOutput(global_step=7180, training_loss=1.8339097680487673, metrics={'train_runtime': 5566.3891, 'train_samples_per_second': 10.315, 'train_steps_per_second': 1.29, 'total_flos': 1.5525127303790592e+16, 'train_loss': 1.8339097680487673, 'epoch': 4.0})

Let's define a function to see model predictions

In [32]:
def show_samples_with_model_predictions(dataset, n_samples=3):
  model.to('cpu')
  samples = dataset['validation'].shuffle().select(range(n_samples))
  for sample in samples:
    inputs = tokenizer(sample['article'], return_tensors="pt")
    outputs = model.generate(**inputs)
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Article: {sample['article']}\n")
    print(f"highlights: {sample['highlights']}\n")
    print(f"Model prediction: {decoded_output}\n")

In [33]:
show_samples_with_model_predictions(dataset, 1)

Article: Tottenham have announced an agreement has been reached with Archway Sheet Metal Works which clears the path for their new stadium to be built. The club have plans in place to construct a new 56,000-seater stadium on their White Hart Lane site but have faced a lengthy court battle with Archway - who refused to relocate to allow Tottenham to begin the process. Earlier in the month the business decided not to appeal against a High Court ruling which forced them to find new premises and now Spurs have announced they have reached a private deal with Archway that will allow them to take over the land next year. How Tottenham's new stadium will look for night games from 2018-19 season onwards . A short statement on the club's website read: 'Tottenham Hotspur Football Club, Archway Sheet Metal Works Ltd and the Josif Family (Archway) are delighted to announce that a private agreement has been reached for the purchase of Archway's property on Paxton Road by the Club. 'In order to allow

In [34]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [37]:
trainer.push_to_hub(f'{model_checkpoint}-finetuned-cnn_daily_mail')

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

events.out.tfevents.1733038991.d82139174bbc.667.0:   0%|          | 0.00/9.23k [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/AndreiRabau/t5-small-finetuned-cnn_daily_mail/commit/0648254bfedb14644d81bb39bb9510128995c9e7', commit_message='google-t5/t5-small-finetuned-cnn_daily_mail', commit_description='', oid='0648254bfedb14644d81bb39bb9510128995c9e7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/AndreiRabau/t5-small-finetuned-cnn_daily_mail', endpoint='https://huggingface.co', repo_type='model', repo_id='AndreiRabau/t5-small-finetuned-cnn_daily_mail'), pr_revision=None, pr_num=None)

# Conclusion
My findings demonstrate that the fine-tuned T-5 model achieved a high level of accuracy in generating coherent and concise summaries, as evidenced by its performance on standard summarization metrics like ROUGE scores.