Summary
extractPages() does not reduce file size because all objects from the original PDF are retained in the new document's context. For PDFs with large embedded resources (e.g., full CJK font files), extracting a few pages produces output nearly identical in size to the original.
Reproduction
import { PDF } from '@libpdf/core';
import { readFileSync } from 'fs';
const pdfBytes = readFileSync('large.pdf'); // 80MB, 128 pages
const doc = await PDF.load(pdfBytes);
// Extract only 5 pages
const extracted = await doc.extractPages([0, 1, 2, 3, 4]);
const bytes = await extracted.save({ incremental: false, subsetFonts: true });
console.log(bytes.length); // ~80MB — same as original
Individual page analysis
| Pages |
Size |
| Original (128 pages) |
79.7 MB |
| extractPages([0,1,2,3,4]) |
79.7 MB |
| extractPages([0]) — page 1 only |
79.7 MB |
| extractPages([1]) — page 2 only |
0.6 MB |
Page 1 contains a CJK font (~79 MB of stream data across 641 objects). Extracting any set of pages that includes page 1 produces a file of the same size as the original, even though only the glyphs used on that page should be needed.
Attempted workarounds (none reduced size)
save({ subsetFonts: true, incremental: false })
flattenAnnotations() / flattenAll() / flattenLayers()
- Reload:
PDF.load(await extracted.save()) then save again
copyPagesFrom() to a new PDF.create() document
- Setting
pdf.ctx = null before saving extracted document
Comparison with pdf-lib
For reference, pdf-lib (which @libpdf/core is forked from) handles this correctly:
import { PDFDocument } from 'pdf-lib';
const srcDoc = await PDFDocument.load(pdfBytes); // 80MB, 128 pages
const newDoc = await PDFDocument.create();
const pages = await newDoc.copyPages(srcDoc, [0, 1, 2, 3, 4]);
for (const p of pages) newDoc.addPage(p);
const bytes = await newDoc.save();
console.log(bytes.length); // ~2.6MB ✅
| Method |
5 pages |
Page 1 only |
@libpdf/core extractPages |
79.7 MB |
79.7 MB |
@libpdf/core copyPagesFrom |
79.7 MB |
79.7 MB |
pdf-lib copyPages |
2.6 MB |
0.3 MB |
Expected Behavior
extractPages() (and copyPagesFrom()) should only include objects that are actually referenced by the extracted pages, and save() should garbage-collect unreachable objects.
Environment
@libpdf/core: 0.3.4
- Node.js: 24.14.1
- OS: macOS (also reproduced on Linux via Docker)
Summary
extractPages()does not reduce file size because all objects from the original PDF are retained in the new document's context. For PDFs with large embedded resources (e.g., full CJK font files), extracting a few pages produces output nearly identical in size to the original.Reproduction
Individual page analysis
Page 1 contains a CJK font (~79 MB of stream data across 641 objects). Extracting any set of pages that includes page 1 produces a file of the same size as the original, even though only the glyphs used on that page should be needed.
Attempted workarounds (none reduced size)
save({ subsetFonts: true, incremental: false })flattenAnnotations()/flattenAll()/flattenLayers()PDF.load(await extracted.save())then save againcopyPagesFrom()to a newPDF.create()documentpdf.ctx = nullbefore saving extracted documentComparison with pdf-lib
For reference,
pdf-lib(which@libpdf/coreis forked from) handles this correctly:@libpdf/coreextractPages@libpdf/corecopyPagesFrompdf-libcopyPagesExpected Behavior
extractPages()(andcopyPagesFrom()) should only include objects that are actually referenced by the extracted pages, andsave()should garbage-collect unreachable objects.Environment
@libpdf/core: 0.3.4