-
Notifications
You must be signed in to change notification settings - Fork 48
Description
Bug Description
PDFs with large amounts of trailing null bytes (e.g., ~2MB of zero padding after %%EOF) fail to load because findStartXRef() only searches the last 1024 bytes of the file. When the padding exceeds 1024 bytes, the startxref marker is not found, normal parsing fails, and the parser falls back to brute-force recovery — which then fails to resolve compressed objects.
What happens
findStartXRef()searches backwards from end of file, but only within 1024 bytes (src/parser/xref-parser.ts:52)- The search window is entirely within trailing null padding, so
startxrefis not found - Normal parsing fails:
"Could not find startxref marker" - Parser falls back to brute-force recovery (
parseWithRecovery) - Brute-force scans for
N M objpatterns but createsIndirectObjectParserinstances without alengthResolver— so all streams with indirect/Lengthreferences fail to parse - The brute-force xref only discovers uncompressed objects (e.g., 243 out of 376)
- Objects inside compressed object streams (type 2 xref entries) are completely missing
- If the
/Pagesroot is in a compressed object stream, loading fails:
Error: Root Pages object is not a dictionary
at PDFPageTree.load
at PDF.load
Reproduction
Any PDF with >1024 bytes of trailing null/zero padding after %%EOF will trigger this. Such files are common when PDFs are uploaded through systems that pad to block boundaries.
// A PDF where the actual %%EOF is at offset ~6.5MB but the file is ~8.7MB
// (2.17MB of trailing 0x00 bytes)
const pdf = await PDF.load(bytes);
// Throws: "Root Pages object is not a dictionary"Root Causes
1. findStartXRef search window too small
src/parser/xref-parser.ts:52:
const searchStart = Math.max(0, len - 1024);Should either:
- Skip trailing null/whitespace bytes before searching, or
- Use a significantly larger search window (pdf.js uses 1024 but also handles trailing garbage)
2. Brute-force recovery doesn't handle compressed objects
parseWithRecovery() in src/parser/document-parser.ts:
- Scans for
N M objpatterns to build an xref - But
IndirectObjectParserinstances are created without alengthResolver, so streams with indirect/Lengthfail - Objects inside object streams (ObjStm) are invisible to this scan — they don't have
obj/endobjwrappers - The recovery should attempt to parse discovered xref streams, or parse object streams to discover their compressed contents
3. (Separate, lower priority) PdfDict LRU cache correctness issue
PdfDict uses Map<PdfName, PdfObject> with PdfName instances as keys. Since PdfName.of() uses an LRU cache (max 10,000), evicted names get new instances, causing Map.get() to silently fail. This can cause "Document has no catalog (missing /Root in trailer)" errors at save time in long-running servers processing many PDFs. Fix: use Map<string, PdfObject> internally.
Environment
@libpdf/corev2.11- Node.js (long-running server processing many PDFs)