Skip to content

[hardening] PdfParser silently produces empty output for scanned / image-only PDFs #60

@MTCMarkFranco

Description

@MTCMarkFranco

Enhancement

PdfParser.ParseAsync produces an empty ParsedDocument when given a scanned / image-only PDF (no text layer). Downstream the pipeline silently reports zero findings, which is indistinguishable from a clean document — a serious failure mode for real customer uploads.

Acceptance criteria

  • After parse, if total canonical text length is 0 AND the PDF has ≥1 page with non-zero image content, throw a typed ScannedPdfNotSupportedException (or equivalent) with a clear message.
  • Alternatively (and tracked here as one acceptable resolution): emit the failure into ParsedDocument.Metadata with key parser_warning and add an integration check that surfaces it in CLI / API output.
  • Decision documented in the implementation PR.
  • Unit test with a small image-only PDF fixture.
  • Engine genericity guard still passes.

Out of scope

  • OCR integration (would be its own feature, not hardening).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions