# Abstractive Summarization using PEGASUS on CNN/DailyMail

In this notebook, we fine-tune Google's **PEGASUS** model for abstractive summarization on the CNN/DailyMail dataset. PEGASUS was specifically pre-trained for summarization tasks using a novel self-supervised gap-sentence generation objective.


### What is PEGASUS?

* PEGASUS is a transformer-based encoder-decoder model specifically designed for abstractive summarization.

* It was introduced by Google Research in 2020.

* It improves summarization by pretraining with a novel gap-sentence generation objective: masking whole sentences and training the model to generate those sentences from the rest of the text, simulating summarization during pretraining.



### Key Points:

- Architecture: Standard Transformer encoder-decoder like BART and T5.

- Pretraining Objective: Gap Sentence Generation (GSG) — helps model learn to predict important sentences, similar to summary sentences.

- Fine-tuning: Done on downstream summarization datasets such as CNN/DailyMail, XSum.

- Performance: Achieves state-of-the-art results on multiple benchmarks, often outperforming BART and T5 in summarization tasks.

### Why PEGASUS is important for Abstractive Summarization:

- Its pretraining makes it better at understanding what sentences are summary-worthy.

- Fine-tuning on summarization datasets makes it highly effective at generating fluent, concise, and relevant summaries.

- Works well for both short and moderately long documents.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install transformers torch datasets accelerate rouge_score  --quiet

## 🔹 Step 1: Load the CNN/DailyMail Dataset (1% Subset)

We use a small 1% subset of the dataset to demonstrate fine-tuning for faster execution and debugging.


In [None]:
from datasets import load_dataset
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, Trainer, TrainingArguments
import torch

dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:1%]")  # 1% subset for quick fine-tuning

In [None]:
dataset[0]['article']

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details o

## 🔹 Step 2: Load PEGASUS Tokenizer and Model

We use the pre-trained `google/pegasus-cnn_dailymail` variant which is specialized for summarizing news articles.


In [None]:
model_name = "google/pegasus-cnn_dailymail"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 🔹 Step 3: Preprocessing - Tokenizing Inputs and Targets

We tokenize both the article (`input`) and the highlights (`target`) with truncation and padding. Label padding tokens are replaced with -100 to be ignored during loss computation.


In [None]:
max_input_length = 512
max_target_length = 128

def preprocess_function(examples):
    inputs = examples["article"]
    targets = examples["highlights"]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length")

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    # Replace pad token id's in labels by -100 to ignore in loss
    model_inputs["labels"] = [
        [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in model_inputs["labels"]
    ]
    return model_inputs

## 🔹 Step 4: Tokenize Dataset

Apply the preprocessing function to the full dataset.


In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)

## 🔹 Step 5: Define Training Arguments

Configure model fine-tuning parameters such as learning rate, batch size, logging, and saving behavior.


In [None]:
training_args = TrainingArguments(
    output_dir="./pegasus_finetuned",
    learning_rate=3e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    weight_decay=0.01,
    save_total_limit=1,
    save_steps=500,
    logging_dir="./logs",
    logging_steps=100,
    fp16=torch.cuda.is_available(),
    report_to="none",
    disable_tqdm=False
)

## 🔹 Step 6: Initialize Trainer and Start Training


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
)

# 6. Train!
trainer.train()

  trainer = Trainer(


Step,Training Loss
100,1.5319
200,1.3742
300,1.2023
400,1.1087
500,1.1173
600,1.161
700,1.1327
800,1.1985
900,1.183
1000,1.1001




TrainOutput(global_step=1436, training_loss=1.1949089151238996, metrics={'train_runtime': 816.7738, 'train_samples_per_second': 3.515, 'train_steps_per_second': 1.758, 'total_flos': 4147826166595584.0, 'train_loss': 1.1949089151238996, 'epoch': 1.0})

## 🔹 Step 7: Inference - Generate Summary for a Test Article

Use the fine-tuned PEGASUS model to summarize an unseen article from the dataset.


In [31]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [32]:
text = dataset[0]['article']
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="longest", max_length=512).to(device)
summary_ids = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

Harry Potter star Daniel Radcliffe turns 18 on Monday . He insists money won't cast a spell on him . Radcliffe's earnings from the first five Potter films have been held in a trust fund .
