Skip to content

[hardening] DocxParser: preserve line breaks, tabs, and soft hyphens (stop using InnerText) #58

@MTCMarkFranco

Description

@MTCMarkFranco

Enhancement

DocxParser.ProcessParagraph calls para.InnerText, which concatenates all descendant text without preserving:

  • Explicit line breaks (<w:br/>) — collapsed.
  • Tabs (<w:tab/>) — emitted as empty string.
  • Soft hyphens / non-breaking hyphens (<w:softHyphen/>, <w:noBreakHyphen/>).

Impact:

  • Multi-line addresses, signature blocks, and bulleted clauses-with-line-breaks become a single run-on string, breaking sentence-boundary heuristics in selectors.
  • Tab-separated content (e.g. tabular paragraphs that aren't tables) loses field separation.

Acceptance criteria

  • Replace para.InnerText with a run-walker that emits \n for Break, \t for TabChar, - for SoftHyphen/NoBreakHyphen.
  • NormalizeInlineParagraph updated (or a new normalizer) to keep intentional \n while still collapsing whitespace runs inside a single visual line.
  • Unit tests covering each token type.
  • Existing parser tests still pass.
  • Engine genericity guard still passes.

Out of scope

  • Hyperlink target preservation (separate concern).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions