# Text Summarization with BART

This notebook demonstrates how to fine-tune a pre-trained BART model for text summarization using the CNN/DailyMail dataset. We will:

1. Install required packages
2. Load and preprocess the data
3. Fine-tune the BART model
4. Generate summaries
5. Evaluate using ROUGE metrics

## Step 1: Install Required Packages

First, we need to install the necessary packages for our text summarization task:
- `datasets`: For loading and managing the CNN/DailyMail dataset
- `transformers`: For the BART model and tokenizer
- `fsspec`: For file system operations
- `rouge_score`: For evaluation metrics

In [1]:
# Install required packages with upgrade flag to ensure latest versions
!pip install -U datasets transformers fsspec rouge_score

Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting transformers
  Downloading transformers-4.53.2-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting fsspec
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fsspec
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.8/494.8 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading transformers-4.53.2-py3-none-any.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m127.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)

## Step 2: Import Libraries and Load Model

Import all necessary libraries and load the pre-trained BART model for text summarization.

In [2]:
# Import PyTorch for deep learning operations
import torch

# Import datasets library for loading CNN/DailyMail dataset
from datasets import load_dataset

# Import transformers for BART model and tokenizer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Import PyTorch utilities for data loading and optimization
from torch.utils.data import DataLoader
from torch.optim import AdamW

# Import tqdm for progress bars during training
from tqdm import tqdm

In [3]:
# Define the pre-trained model name - BART Large CNN is fine-tuned for summarization
model_name = "facebook/bart-large-cnn"

# Load the tokenizer for text preprocessing
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the pre-trained BART model for sequence-to-sequence generation
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

## Step 3: Data Preprocessing

Define a preprocessing function to tokenize articles and summaries for training.

In [4]:
# Define preprocessing function for tokenizing articles and highlights
def preprocess_function(examples):
    # Tokenize input articles with maximum length of 512 tokens
    # Padding ensures all sequences have the same length for batch processing
    # Truncation cuts off text that exceeds the maximum length
    inputs = tokenizer(examples["article"], max_length=512, padding="max_length", truncation=True)

    # Tokenize target summaries with maximum length of 128 tokens
    targets = tokenizer(examples["highlights"], max_length=128, padding="max_length", truncation=True)

    # Set the labels for training (model will learn to generate these)
    inputs["labels"] = targets["input_ids"]

    # Replace padding token ids with -100 so they are ignored during loss calculation
    # This prevents the model from learning to predict padding tokens
    inputs["labels"] = [
        [(label if label != tokenizer.pad_token_id else -100) for label in labels]
        for labels in inputs["labels"]
    ]
    return inputs

## Step 4: Load and Prepare Dataset

Load the CNN/DailyMail dataset and apply preprocessing to prepare it for training.

In [6]:
# Load the CNN/DailyMail dataset version 3.0.0 from Hugging Face
dataset = load_dataset("cnn_dailymail", "3.0.0", download_mode="force_redownload")

# Select only the first 25 samples from training set for faster experimentation
# In practice, you would use the full dataset for better results
dataset["train"] = dataset["train"].select(range(25))

# Apply the preprocessing function to tokenize all samples in the dataset
# batched=True processes multiple samples at once for efficiency
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Set the format to PyTorch tensors for the required columns
# This converts the data to the format expected by PyTorch models
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Map:   0%|          | 0/13368 [00:00<?, ? examples/s]

Map:   0%|          | 0/11490 [00:00<?, ? examples/s]

## Step 5: Training Setup

Configure the DataLoader, device, and optimizer for training the model.

In [7]:
# Create DataLoader for batch processing during training
# batch_size=2 processes 2 samples at a time (small batch for memory efficiency)
# shuffle=True randomizes the order of samples in each epoch
train_loader = DataLoader(tokenized_dataset["train"], batch_size=2, shuffle=True)

# Set up device for computation (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move model to the selected device (GPU/CPU)
model.to(device)

# Initialize Adam optimizer with weight decay for training
# lr=5e-5 is a common learning rate for fine-tuning transformer models
optimizer = AdamW(model.parameters(), lr=5e-5)

Using device: cuda


## Step 6: Training Loop

Fine-tune the BART model on our dataset for 5 epochs.

In [8]:
# Set model to training mode to enable dropout and batch normalization
model.train()

# Train for 5 epochs (complete passes through the dataset)
for epoch in range(5):
    # Initialize total loss for this epoch
    total_loss = 0

    # Iterate through batches with progress bar
    for batch in tqdm(train_loader):
        # Move batch data to the same device as the model
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass: compute model outputs and loss
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass: compute gradients
        loss.backward()

        # Update model parameters using the optimizer
        optimizer.step()

        # Clear gradients for the next iteration
        optimizer.zero_grad()

        # Accumulate loss for reporting
        total_loss += loss.item()

    # Print epoch results
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

100%|██████████| 13/13 [00:08<00:00,  1.56it/s]


Epoch 1, Loss: 29.5691


100%|██████████| 13/13 [00:07<00:00,  1.72it/s]


Epoch 2, Loss: 9.7674


100%|██████████| 13/13 [00:07<00:00,  1.70it/s]


Epoch 3, Loss: 4.7316


100%|██████████| 13/13 [00:07<00:00,  1.68it/s]


Epoch 4, Loss: 3.1695


100%|██████████| 13/13 [00:08<00:00,  1.61it/s]

Epoch 5, Loss: 2.5039





## Step 7: Save the Fine-tuned Model

Save the trained model and tokenizer for later use.

In [9]:
# Save the fine-tuned model to local directory
# This saves the model weights and configuration
model.save_pretrained("./my_bart_summary_model")

# Save the tokenizer to the same directory
# This ensures we can properly tokenize text during inference
tokenizer.save_pretrained("./my_bart_summary_model")

print("Model and tokenizer saved successfully!")



Model and tokenizer saved successfully!


## Step 8: Text Summarization Function

Create a function to generate summaries using our fine-tuned model.

In [10]:
# Define function to generate summaries using the fine-tuned model
def summarize(text, model_path="./my_bart_summary_model"):
    # Load the fine-tuned model from the saved directory
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)

    # Load the corresponding tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Tokenize the input text with proper formatting for the model
    # return_tensors="pt" returns PyTorch tensors
    # max_length=512 limits input to model's maximum capacity
    inputs = tokenizer([text], return_tensors="pt", max_length=512, padding=True, truncation=True).to(device)

    # Generate summary using beam search for better quality
    summary_ids = model.generate(
        inputs["input_ids"],
        num_beams=4,           # Use beam search with 4 beams for better results
        max_length=142,        # Maximum summary length
        min_length=56,         # Minimum summary length to avoid too short summaries
        length_penalty=2.0,    # Encourage longer sequences
        no_repeat_ngram_size=3,# Avoid repeating 3-grams for better readability
        early_stopping=True,   # Stop when all beams reach end token
    )

    # Decode the generated tokens back to text
    # skip_special_tokens=True removes [CLS], [SEP], etc.
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

## Step 9: Test Summarization

Test our summarization function on a sample article from the test dataset.

In [11]:
# Get a sample article from the test dataset
sample_text = dataset["test"][0]["article"]

# Display the first 500 characters of the original article
print("Original Article:")
print(sample_text[:500], "...")

# Generate and display the summary
print("\nGenerated Summary:")
summary = summarize(sample_text)
print(summary)

Original Article:
(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, includin ...

Generated Summary:




Palestinian Authority officially becomes 123rd member of the International Criminal Court .
It's latest step that gives court jurisdiction over alleged crimes in Palestinian territories .
Palestinian foreign minister: "Today brings us closer to our shared goals of justice and peace"
Israel, U.S. opposed the Palestinians' efforts to join the court .


## Step 10: ROUGE Evaluation

Evaluate the model's performance using ROUGE metrics (ROUGE-1, ROUGE-2, and ROUGE-L).

In [12]:
# Import ROUGE scorer for evaluation metrics
from rouge_score import rouge_scorer

# Define evaluation function using ROUGE metrics
def evaluate_rouge(dataset, model_path="./my_bart_summary_model"):
    # Load the fine-tuned model and tokenizer for evaluation
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Initialize ROUGE scorer with different ROUGE variants
    # rouge1: unigram overlap, rouge2: bigram overlap, rougeL: longest common subsequence
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    # Initialize lists to store reference summaries and model predictions
    references = []
    predictions = []

    # Use a small subset of test data for quick evaluation (10 samples)
    # In practice, you would evaluate on the full test set
    eval_dataset = dataset["test"].select(range(10))

    # Iterate through evaluation samples with progress bar
    for i in tqdm(range(len(eval_dataset))):
        # Get the original article and its reference summary
        article = eval_dataset[i]["article"]
        highlight = eval_dataset[i]["highlights"]

        # Generate summary using our model
        summary = summarize(article, model_path=model_path)

        # Store reference and prediction for ROUGE calculation
        references.append(highlight)
        predictions.append(summary)

    # Calculate ROUGE scores for each reference-prediction pair
    scores = [scorer.score(ref, pred) for ref, pred in zip(references, predictions)]

    # Calculate average scores across all samples
    avg_scores = {
        metric: {
            'precision': sum(s[metric].precision for s in scores) / len(scores),
            'recall': sum(s[metric].recall for s in scores) / len(scores),
            'fmeasure': sum(s[metric].fmeasure for s in scores) / len(scores),
        }
        for metric in ['rouge1', 'rouge2', 'rougeL']
    }
    return avg_scores

# Run ROUGE evaluation
print("Evaluating with ROUGE metrics...")
rouge_results = evaluate_rouge(dataset)

# Display the results in a formatted way
print("\nROUGE Evaluation Results:")
for metric, values in rouge_results.items():
    print(f"  {metric.upper()}:")
    print(f"    Precision: {values['precision']:.4f}")
    print(f"    Recall:    {values['recall']:.4f}")
    print(f"    F1-score:  {values['fmeasure']:.4f}")

Evaluating with ROUGE metrics...


100%|██████████| 10/10 [00:21<00:00,  2.15s/it]


ROUGE Evaluation Results:
  ROUGE1:
    Precision: 0.2966
    Recall:    0.4248
    F1-score:  0.3422
  ROUGE2:
    Precision: 0.1184
    Recall:    0.1748
    F1-score:  0.1380
  ROUGEL:
    Precision: 0.1994
    Recall:    0.2931
    F1-score:  0.2323



