Skip to content

[hardening] PdfParser font-size heading heuristic uses page-wide max instead of segment-local size #59

@MTCMarkFranco

Description

@MTCMarkFranco

Bug

PdfParser.ClassifySegment font-size heuristic uses the largest font size anywhere on the page rather than the font size of the actual segment:

Location: src/LambdaRag.Parsing/PdfParser.cs ~line 127

double maxSize = lineFontSizes.Values.Max();
if (maxSize >= medianFontSize * 1.2 && text.Length <= 200)
    return ContentBlockKind.Heading;

Impact: On any page that contains a single large title, every paragraph ≤200 chars on that page is classified as a heading. This produces phantom heading paths and corrupts HeadingPath for downstream selectors / projections.

Expected: Map each parsed segment back to the y-range of the letters that composed it (or, equivalently, build paragraph boundaries from Letters rather than from page.Text), and use that segment's own max font size for the comparison.

Acceptance criteria

  • Segment-level font-size measurement implemented.
  • Unit test with a synthetic two-section PDF (one large title, then several normal paragraphs) verifying only the title is classified as a heading.
  • Existing PDF parser tests still pass.
  • Engine genericity guard still passes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingphase-1-pattern-defPhase 1: Pattern definition (writing)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions