Enhancement
DocxParser.ProcessParagraph calls para.InnerText, which concatenates all descendant text without preserving:
- Explicit line breaks (
<w:br/>) — collapsed.
- Tabs (
<w:tab/>) — emitted as empty string.
- Soft hyphens / non-breaking hyphens (
<w:softHyphen/>, <w:noBreakHyphen/>).
Impact:
- Multi-line addresses, signature blocks, and bulleted clauses-with-line-breaks become a single run-on string, breaking sentence-boundary heuristics in selectors.
- Tab-separated content (e.g. tabular paragraphs that aren't tables) loses field separation.
Acceptance criteria
Out of scope
- Hyperlink target preservation (separate concern).
Enhancement
DocxParser.ProcessParagraphcallspara.InnerText, which concatenates all descendant text without preserving:<w:br/>) — collapsed.<w:tab/>) — emitted as empty string.<w:softHyphen/>,<w:noBreakHyphen/>).Impact:
Acceptance criteria
para.InnerTextwith a run-walker that emits\nforBreak,\tforTabChar,-forSoftHyphen/NoBreakHyphen.NormalizeInlineParagraphupdated (or a new normalizer) to keep intentional\nwhile still collapsing whitespace runs inside a single visual line.Out of scope