# LED Summarization on GovReport Dataset

This notebook demonstrates baseline and hierarchical summarization using the LED model on the GovReport dataset. Evaluation is performed using ROUGE, BERTScore, and a factuality metric.

In [2]:
%pip install datasets

Collecting datasets
  Using cached datasets-4.1.1-py3-none-any.whl.metadata (18 kB)
Collecting filelock (from datasets)
  Using cached filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)
Collecting numpy>=1.17 (from datasets)
  Using cached numpy-2.2.6-cp310-cp310-win_amd64.whl.metadata (60 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-21.0.0-cp310-cp310-win_amd64.whl.metadata (3.4 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting pandas (from datasets)
  Using cached pandas-2.3.2-cp310-cp310-win_amd64.whl.metadata (19 kB)
Collecting requests>=2.32.2 (from datasets)
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting tqdm>=4.66.3 (from datasets)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading mu

In [7]:
%pip install transformers torch rouge_score bert_score 

Collecting transformers
  Downloading transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
Collecting torch
  Using cached torch-2.8.0-cp310-cp310-win_amd64.whl.metadata (30 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2025.9.18-cp310-cp310-win_amd64.whl.metadata (41 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Using cached safetensors-0

In [8]:
%pip install summac

Collecting summac
  Downloading summac-0.0.4-py3-none-any.whl.metadata (5.3 kB)
Collecting huggingface-hub<=0.17.0 (from summac)
  Downloading huggingface_hub-0.17.0-py3-none-any.whl.metadata (13 kB)
Collecting sentencepiece (from summac)
  Downloading sentencepiece-0.2.1-cp310-cp310-win_amd64.whl.metadata (10 kB)
Collecting protobuf (from summac)
  Downloading protobuf-6.32.1-cp310-abi3-win_amd64.whl.metadata (593 bytes)
INFO: pip is looking at multiple versions of transformers to determine which version is compatible with other requirements. This could take a while.
Collecting transformers>=4.24.0 (from summac)
  Downloading transformers-4.56.1-py3-none-any.whl.metadata (42 kB)
  Using cached transformers-4.56.0-py3-none-any.whl.metadata (40 kB)
  Downloading transformers-4.55.4-py3-none-any.whl.metadata (41 kB)
  Downloading transformers-4.55.3-py3-none-any.whl.metadata (41 kB)
  Downloading transformers-4.55.2-py3-none-any.whl.metadata (41 kB)
  Downloading transformers-4.55.1-py3-

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 4.1.1 requires huggingface-hub>=0.24.0, but you have huggingface-hub 0.17.0 which is incompatible.


In [12]:
%pip install --upgrade datasets huggingface_hub

Collecting huggingface_hub
  Using cached huggingface_hub-0.35.1-py3-none-any.whl.metadata (14 kB)
Using cached huggingface_hub-0.35.1-py3-none-any.whl (563 kB)
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.17.0
    Uninstalling huggingface-hub-0.17.0:
      Successfully uninstalled huggingface-hub-0.17.0
Successfully installed huggingface_hub-0.35.1
Note: you may need to restart the kernel to use updated packages.


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
summac 0.0.4 requires huggingface-hub<=0.17.0, but you have huggingface-hub 0.35.1 which is incompatible.


In [2]:
# Transformers, datasets, evaluation, and utilities
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Evaluation libraries
from rouge_score import rouge_scorer
from bert_score import score as bert_score
from summac.model_summac import SummaCConv
from summac.model_summac import SummaCZS

# Progress bar
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(


In [3]:
import numpy as np

In [17]:
import json

In [4]:
# Load the GovReport dataset
ds = load_dataset("ccdv/govreport-summarization")

In [3]:

# Inspect a sample
sample = ds['train'][0]
print("Report:\n", sample['report'][:500], "...\n")
print("Summary:\n", sample['summary'][:500], "...")

Report:
 The structure of the armed forces is based on the Total Force concept, which recognizes that all elements of the structure—active duty military personnel, reservists, defense contractors, host nation military and civilian personnel, and DOD federal civilian employees—contribute to national defense. In recent years, federal civilian personnel have deployed along with military personnel to participate in Operations Joint Endeavor, conducted in the countries of Bosnia-Herzegovina, Croatia, and Hung ...

Summary:
 As the Department of Defense (DOD) has expanded its involvement in overseas military operations, it has grown increasingly reliant on its federal civilian workforce to support contingency operations. The Senate Armed Services Committee required GAO to examine DOD's policies concerning the health care for DOD civilians who deploy in support of contingency operations in Afghanistan and Iraq. GAO analyzed over 3,400 deployment-related records for deployed federal civilians 

In [5]:
# Load LED model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")
model = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

  _torch_pytree._register_pytree_node(


In [6]:
def compute_rouge(preds, refs):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
    for pred, ref in zip(preds, refs):
        score = scorer.score(ref, pred)
        for k in scores:
            scores[k].append(score[k].fmeasure)
    avg_scores = {k: float(np.mean(scores[k])) for k in scores}
    print("ROUGE scores:")
    for k, v in avg_scores.items():
        print(f"  {k}: {v:.4f}")

In [7]:
def compute_bert_score(preds, refs):
    P, R, F1 = bert_score(preds, refs, lang="en", rescale_with_baseline=True)
    scores = {
        "precision": float(P.mean()),
        "recall": float(R.mean()),
        "f1": float(F1.mean())
    }
    print("BERTScore:")
    for k, v in scores.items():
        print(f"  {k}: {v:.4f}")

In [8]:
from summac.model_summac import SummaCConv
from summac.model_summac import SummaCZS

# Load SummaC Zero-shot model (recommended for factuality)
summac_model = SummaCZS(granularity="sentence", model_name="vitc", device=device)
# summac_model.load_from_pretrained()

def compute_factuality(preds, refs):
    results = summac_model.score(
        sources=refs,
        summaries=preds,
        batch_size=4,
        nli_batch_size=32,
        return_prob=True,
        return_sentence_level=False
    )
    # Return average probability as factuality score
    return float(np.mean([r["prob"] for r in results]))

In [9]:
def led_summarize(document, max_input_length=16384, max_output_length=1024):
    inputs = tokenizer(
        document,
        return_tensors="pt",
        truncation=True,
        max_length=max_input_length
    )
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)
    global_attention_mask = torch.zeros_like(attention_mask)
    global_attention_mask[:, 0] = 1  # global attention on first token

    summary_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        global_attention_mask=global_attention_mask,
        max_length=max_output_length,
        num_beams=4
    )
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)



In [7]:
ds

DatasetDict({
    train: Dataset({
        features: ['report', 'summary'],
        num_rows: 17517
    })
    validation: Dataset({
        features: ['report', 'summary'],
        num_rows: 973
    })
    test: Dataset({
        features: ['report', 'summary'],
        num_rows: 973
    })
})

In [18]:
# Baseline: summarize and evaluate on first 10 test samples
N = 10
documents = [ds['validation'][i]['report'] for i in range(N)]
references = [ds['validation'][i]['summary'] for i in range(N)]

baseline_summaries = []
for doc in tqdm(documents, desc="Baseline Summarization"):
    baseline_summaries.append(led_summarize(doc))


Baseline Summarization: 100%|██████████| 10/10 [21:37<00:00, 129.78s/it]


In [19]:
with open("baseline_summaries_val.json", "w", encoding="utf-8") as f:
    json.dump(baseline_summaries, f, ensure_ascii=False, indent=2)

In [None]:
with open("baseline_summaries_val.json", "r", encoding="utf-8") as f:
    baseline_summaries = json.load(f)

In [10]:
def print_token_counts(texts, tokenizer, label="Input"):
    """
    Prints the token count for each text in the list.
    """
    for i, text in enumerate(texts):
        tokens = tokenizer.tokenize(text)
        print(f"{label} {i+1}: {len(tokens)} tokens")

In [14]:
print_token_counts(documents, tokenizer, label="Report")
print_token_counts(references, tokenizer, label="Summary")

Report 1: 2204 tokens
Report 2: 4639 tokens
Report 3: 4287 tokens
Report 4: 6353 tokens
Report 5: 8092 tokens
Report 6: 6941 tokens
Report 7: 10364 tokens
Report 8: 8434 tokens
Report 9: 7722 tokens
Report 10: 6630 tokens
Summary 1: 632 tokens
Summary 2: 697 tokens
Summary 3: 737 tokens
Summary 4: 734 tokens
Summary 5: 868 tokens
Summary 6: 604 tokens
Summary 7: 545 tokens
Summary 8: 970 tokens
Summary 9: 672 tokens
Summary 10: 744 tokens


In [20]:
len(baseline_summaries)

10

In [21]:
print_token_counts(baseline_summaries, tokenizer, label="outputs")


outputs 1: 1022 tokens
outputs 2: 1022 tokens
outputs 3: 1022 tokens
outputs 4: 1022 tokens
outputs 5: 1022 tokens
outputs 6: 1022 tokens
outputs 7: 1022 tokens
outputs 8: 1022 tokens
outputs 9: 1022 tokens
outputs 10: 1022 tokens


In [12]:
def chunk_document(document, max_tokens=1500):
    # Simple chunking by paragraphs, keeping each chunk under max_tokens
    paragraphs = document.split('\n')
    chunks = []
    current_chunk = ""
    current_tokens = 0
    for para in paragraphs:
        para_tokens = len(tokenizer.tokenize(para))
        if current_tokens + para_tokens > max_tokens and current_chunk:
            chunks.append(current_chunk)
            current_chunk = para
            current_tokens = para_tokens
        else:
            current_chunk += "\n" + para if current_chunk else para
            current_tokens += para_tokens
    if current_chunk:
        chunks.append(current_chunk)
    return chunks


In [23]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\thula_\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [25]:
from nltk.tokenize import sent_tokenize

def robust_sentence_chunk(document, tokenizer, max_tokens=2000, min_chunk_ratio=0.5):
    sentences = sent_tokenize(document)
    chunks = []
    current_chunk = ""
    current_tokens = 0
    min_chunk_tokens = int(max_tokens * min_chunk_ratio)
    for sent in sentences:
        sent_tokens = len(tokenizer.tokenize(sent))
        if current_tokens + sent_tokens > max_tokens and current_chunk:
            chunks.append(current_chunk)
            current_chunk = sent
            current_tokens = sent_tokens
        else:
            current_chunk += " " + sent if current_chunk else sent
            current_tokens += sent_tokens
    if current_chunk:
        if len(tokenizer.tokenize(current_chunk)) < min_chunk_tokens and len(chunks) > 0:
            chunks[-1] += " " + current_chunk
        else:
            chunks.append(current_chunk)
    return chunks

In [28]:
def hierarchical_summarize(document, summary_length=1024):
    chunks = robust_sentence_chunk(document, tokenizer)
    chunk_summaries = [led_summarize(chunk, max_output_length=summary_length//2)
                       for chunk in chunks]
    combined_summary = " ".join(chunk_summaries)
    final_summary = led_summarize(combined_summary, max_output_length=summary_length)
    return final_summary

In [29]:
# Hierarchical summarization and evaluation on first 10 test samples
hierarchical_summaries = []
for doc in tqdm(documents, desc="Hierarchical Summarization"):
    hierarchical_summaries.append(hierarchical_summarize(doc))

Hierarchical Summarization: 100%|██████████| 10/10 [33:26<00:00, 200.69s/it]


In [30]:
with open("hierarchical_summaries.json", "w", encoding="utf-8") as f:
    json.dump(hierarchical_summaries, f, ensure_ascii=False, indent=2)


In [None]:
with open("hierarchical_summaries.json", "r", encoding="utf-8") as f:
    hierarchical_summaries = json.load(f)

In [31]:
compute_rouge(baseline_summaries, references)

ROUGE scores:
  rouge1: 0.4665
  rouge2: 0.1530
  rougeL: 0.1907


In [33]:
compute_rouge(hierarchical_summaries, references)

ROUGE scores:
  rouge1: 0.5050
  rouge2: 0.1519
  rougeL: 0.1907


In [32]:
compute_bert_score(baseline_summaries, references)


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore:
  precision: -0.1065
  recall: -0.0509
  f1: -0.0782


In [34]:
compute_bert_score(hierarchical_summaries, references)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore:
  precision: -0.0129
  recall: 0.0055
  f1: -0.0023


In [35]:
print_token_counts(hierarchical_summaries, tokenizer, label="hirchial outputs")

hirchial outputs 1: 625 tokens
hirchial outputs 2: 1022 tokens
hirchial outputs 3: 1022 tokens
hirchial outputs 4: 858 tokens
hirchial outputs 5: 1022 tokens
hirchial outputs 6: 927 tokens
hirchial outputs 7: 1022 tokens
hirchial outputs 8: 1022 tokens
hirchial outputs 9: 1022 tokens
hirchial outputs 10: 1022 tokens


In [None]:
hierarchical_summaries_512 = []
for doc in tqdm(documents, desc="Hierarchical Summarization"):
    hierarchical_summaries_512.append(hierarchical_summarize(doc, 512))

Hierarchical Summarization: 100%|██████████| 10/10 [16:06<00:00, 96.63s/it] 


In [41]:
print_token_counts(hierarchical_summaries_512, tokenizer, label="hirchial outputs")

hirchial outputs 1: 294 tokens
hirchial outputs 2: 510 tokens
hirchial outputs 3: 510 tokens
hirchial outputs 4: 510 tokens
hirchial outputs 5: 510 tokens
hirchial outputs 6: 510 tokens
hirchial outputs 7: 510 tokens
hirchial outputs 8: 509 tokens
hirchial outputs 9: 510 tokens
hirchial outputs 10: 510 tokens


In [42]:
with open("hierarchical_summaries_512.json", "w", encoding="utf-8") as f:
    json.dump(hierarchical_summaries, f, ensure_ascii=False, indent=2)

In [43]:
compute_rouge(hierarchical_summaries_512, references)

ROUGE scores:
  rouge1: 0.4402
  rouge2: 0.1168
  rougeL: 0.1797


In [44]:
compute_bert_score(hierarchical_summaries_512, references)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore:
  precision: 0.0231
  recall: 0.0467
  f1: 0.0362
