Skip to content

findStartXRef fails on PDFs with large trailing null padding, causing brute-force fallback to miss compressed objects #54

@redsuperbat

Description

@redsuperbat

Bug Description

PDFs with large amounts of trailing null bytes (e.g., ~2MB of zero padding after %%EOF) fail to load because findStartXRef() only searches the last 1024 bytes of the file. When the padding exceeds 1024 bytes, the startxref marker is not found, normal parsing fails, and the parser falls back to brute-force recovery — which then fails to resolve compressed objects.

What happens

  1. findStartXRef() searches backwards from end of file, but only within 1024 bytes (src/parser/xref-parser.ts:52)
  2. The search window is entirely within trailing null padding, so startxref is not found
  3. Normal parsing fails: "Could not find startxref marker"
  4. Parser falls back to brute-force recovery (parseWithRecovery)
  5. Brute-force scans for N M obj patterns but creates IndirectObjectParser instances without a lengthResolver — so all streams with indirect /Length references fail to parse
  6. The brute-force xref only discovers uncompressed objects (e.g., 243 out of 376)
  7. Objects inside compressed object streams (type 2 xref entries) are completely missing
  8. If the /Pages root is in a compressed object stream, loading fails:
Error: Root Pages object is not a dictionary
    at PDFPageTree.load
    at PDF.load

Reproduction

Any PDF with >1024 bytes of trailing null/zero padding after %%EOF will trigger this. Such files are common when PDFs are uploaded through systems that pad to block boundaries.

// A PDF where the actual %%EOF is at offset ~6.5MB but the file is ~8.7MB
// (2.17MB of trailing 0x00 bytes)
const pdf = await PDF.load(bytes);
// Throws: "Root Pages object is not a dictionary"

Root Causes

1. findStartXRef search window too small

src/parser/xref-parser.ts:52:

const searchStart = Math.max(0, len - 1024);

Should either:

  • Skip trailing null/whitespace bytes before searching, or
  • Use a significantly larger search window (pdf.js uses 1024 but also handles trailing garbage)

2. Brute-force recovery doesn't handle compressed objects

parseWithRecovery() in src/parser/document-parser.ts:

  • Scans for N M obj patterns to build an xref
  • But IndirectObjectParser instances are created without a lengthResolver, so streams with indirect /Length fail
  • Objects inside object streams (ObjStm) are invisible to this scan — they don't have obj/endobj wrappers
  • The recovery should attempt to parse discovered xref streams, or parse object streams to discover their compressed contents

3. (Separate, lower priority) PdfDict LRU cache correctness issue

PdfDict uses Map<PdfName, PdfObject> with PdfName instances as keys. Since PdfName.of() uses an LRU cache (max 10,000), evicted names get new instances, causing Map.get() to silently fail. This can cause "Document has no catalog (missing /Root in trailer)" errors at save time in long-running servers processing many PDFs. Fix: use Map<string, PdfObject> internally.

Environment

  • @libpdf/core v2.11
  • Node.js (long-running server processing many PDFs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions