Fix: python-docx reader now honors Word heading styles#1
Merged
Conversation
The [docx] (python-docx) path concatenated paragraph text and dropped Heading1-9/Title styles + w:numPr numbering, so installing the [docx] extra produced an EMPTY clause map on heading-styled Word contracts — worse than the no-extra stdlib reader, which already honored them. Both readers now share one emitter (_emit_docx_paragraph) that turns heading-styled / auto-numbered paragraphs into `## headings` (run-in body split off) and fully-bold lines into `**...**`, so the two paths agree. The stdlib reader is refactored onto the same helper with no behavior change (golden fixtures unchanged). Tests: test_emit_docx_paragraph (CI-safe unit test) and test_docx_readers_agree_on_clause_map (asserts python-docx == stdlib clause map on heading_docx.docx; skips without [docx]). No output-schema change. mypy --strict clean; 132 passed / 1 skipped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
extract's
[docx](python-docx) reader dropped Word heading styles. It concatenatedpara.textand applied no structure, soHeading1-9/Titlestyles andw:numPrauto-numbering were lost — installing the[docx]extra produced an empty clause map on heading-styled Word contracts. Ironically the no-extra stdlib reader was better (it already maps heading styles →## headings).Found while wiring a real
.docxdemo into the contract-ops playground: the same NDA gave a 6-clause map via the stdlib reader and 0 clauses via the[docx]path.The fix
Both readers now share one emitter,
_emit_docx_paragraph, that turns heading-styled / auto-numbered paragraphs into## headings(run-in body split onto the next line) and fully-bold lines into**...**. The python-docx branch reads style +w:numProff the underlying element (para._p) and routes through it; the stdlib reader is refactored onto the same helper with no behavior change.Verification
mypy --strictclean.[pdf]extra absent). The golden fixtures (stdlib reader) are unchanged, proving the stdlib refactor is behavior-equivalent.test_emit_docx_paragraph— CI-safe unit test of the shared emitter.test_docx_readers_agree_on_clause_map— asserts the python-docx and stdlib readers produce the same clause map onheading_docx.docx(this test fails before the fix). Usespytest.importorskip("docx").Notes
## [Unreleased]changelog entry) to avoid colliding with the just-released 0.1.9 — fold it into the next release (0.1.10).[dev]extra doesn't installpython-docx, so the new agreement test skips in CI. Addingpython-docxto[dev]would exercise the fixed path in CI. Happy to do that here if you want.🤖 Generated with Claude Code