# Abstractive Summarization using BART on CNN/DailyMail

This notebook demonstrates how to fine-tune Facebook's `BART-large` model on the CNN/DailyMail dataset for abstractive summarization. BART combines a bidirectional encoder (like BERT) and an autoregressive decoder (like GPT), making it ideal for sequence-to-sequence tasks like summarization.


In [7]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install transformers torch datasets accelerate rouge_score  --quiet

In [23]:
from datasets import load_dataset
from transformers import BartTokenizer, BartForConditionalGeneration, TrainingArguments, Trainer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments
import torch

### 🔹 Step 1: Load the CNN/DailyMail Dataset

We use a small subset of the training and validation sets for demonstration and faster training.


In [None]:
dataset = load_dataset("cnn_dailymail", "3.0.0")
train_data = dataset["train"].shuffle(seed=42).select(range(2000))  # Use small subset for quick training
val_data = dataset["validation"].select(range(500))

### 🔹 Step 2: Load Pretrained BART Tokenizer and Model

We'll use the `facebook/bart-large-cnn` checkpoint which is pretrained on summarization tasks.


In [None]:
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

### 🔹 Step 3: Tokenize the Dataset

We tokenize both the input (article) and the target (summary) with max length constraints. Padding and truncation are used to ensure consistent sequence lengths.


In [None]:
def preprocess(example):
    inputs = tokenizer(
        example["article"],
        max_length=512,
        truncation=True,
        padding="max_length"
    )
    targets = tokenizer(
        example["highlights"],
        max_length=128,
        truncation=True,
        padding="max_length"
    )
    inputs["labels"] = targets["input_ids"]
    return inputs

In [None]:
train_encoded = train_data.map(preprocess, batched=True)
val_encoded = val_data.map(preprocess, batched=True)

### 🔹 Step 4: Set Training Arguments

We configure batch size, learning rate, number of epochs, and other training-related hyperparameters using `Seq2SeqTrainingArguments`.


In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./bart-summarizer",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    save_strategy="epoch",
    logging_dir="./logs",
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),
    report_to="none",
    disable_tqdm=False
)

### 🔹 Step 5: Define Trainer and Data Collator

We use Hugging Face's `Trainer` API with a special data collator for sequence-to-sequence tasks.


In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_encoded,
    eval_dataset=val_encoded,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

### 🔹 Step 6: Start Training

Now we train the model on the summarization task. This will take some time depending on GPU/CPU availability.


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.7565,0.644206
2,0.4817,0.684369


TrainOutput(global_step=2000, training_loss=0.6297503890991211, metrics={'train_runtime': 846.8051, 'train_samples_per_second': 4.724, 'train_steps_per_second': 2.362, 'total_flos': 4444731538931712.0, 'train_loss': 0.6297503890991211, 'epoch': 2.0})

### 🔹 Step 7: Generate Summary on Test Article

Let’s try summarizing an example article from the test split and print the generated summary.


In [None]:
article = dataset["test"][0]["article"]

inputs = tokenizer(
    article, return_tensors="pt", max_length=512, truncation=True
).to(model.device)

summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("\n📌 Generated Summary:\n", summary)


📌 Generated Summary:
 Palestinians become the 123rd member of the International Criminal Court .
The move gives the court jurisdiction over alleged crimes in Palestinian territories .
Israel and the United States opposed the Palestinians' efforts to join the court .
But Human Rights Watch welcomed the development .
