Skip to content

extractPages retains all original PDF objects, resulting in no size reduction #60

@yk-kd

Description

@yk-kd

Summary

extractPages() does not reduce file size because all objects from the original PDF are retained in the new document's context. For PDFs with large embedded resources (e.g., full CJK font files), extracting a few pages produces output nearly identical in size to the original.

Reproduction

import { PDF } from '@libpdf/core';
import { readFileSync } from 'fs';

const pdfBytes = readFileSync('large.pdf'); // 80MB, 128 pages
const doc = await PDF.load(pdfBytes);

// Extract only 5 pages
const extracted = await doc.extractPages([0, 1, 2, 3, 4]);
const bytes = await extracted.save({ incremental: false, subsetFonts: true });

console.log(bytes.length); // ~80MB — same as original

Individual page analysis

Pages Size
Original (128 pages) 79.7 MB
extractPages([0,1,2,3,4]) 79.7 MB
extractPages([0]) — page 1 only 79.7 MB
extractPages([1]) — page 2 only 0.6 MB

Page 1 contains a CJK font (~79 MB of stream data across 641 objects). Extracting any set of pages that includes page 1 produces a file of the same size as the original, even though only the glyphs used on that page should be needed.

Attempted workarounds (none reduced size)

  • save({ subsetFonts: true, incremental: false })
  • flattenAnnotations() / flattenAll() / flattenLayers()
  • Reload: PDF.load(await extracted.save()) then save again
  • copyPagesFrom() to a new PDF.create() document
  • Setting pdf.ctx = null before saving extracted document

Comparison with pdf-lib

For reference, pdf-lib (which @libpdf/core is forked from) handles this correctly:

import { PDFDocument } from 'pdf-lib';

const srcDoc = await PDFDocument.load(pdfBytes); // 80MB, 128 pages
const newDoc = await PDFDocument.create();
const pages = await newDoc.copyPages(srcDoc, [0, 1, 2, 3, 4]);
for (const p of pages) newDoc.addPage(p);

const bytes = await newDoc.save();
console.log(bytes.length); // ~2.6MB ✅
Method 5 pages Page 1 only
@libpdf/core extractPages 79.7 MB 79.7 MB
@libpdf/core copyPagesFrom 79.7 MB 79.7 MB
pdf-lib copyPages 2.6 MB 0.3 MB

Expected Behavior

extractPages() (and copyPagesFrom()) should only include objects that are actually referenced by the extracted pages, and save() should garbage-collect unreachable objects.

Environment

  • @libpdf/core: 0.3.4
  • Node.js: 24.14.1
  • OS: macOS (also reproduced on Linux via Docker)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions