Fix: python-docx reader now honors Word heading styles by DrBaher · Pull Request #1 · DrBaher/extract-cli

DrBaher · 2026-05-22T11:14:21Z

The bug

extract's [docx] (python-docx) reader dropped Word heading styles. It concatenated para.text and applied no structure, so Heading1-9/Title styles and w:numPr auto-numbering were lost — installing the [docx] extra produced an empty clause map on heading-styled Word contracts. Ironically the no-extra stdlib reader was better (it already maps heading styles → ## headings).

Found while wiring a real .docx demo into the contract-ops playground: the same NDA gave a 6-clause map via the stdlib reader and 0 clauses via the [docx] path.

The fix

Both readers now share one emitter, _emit_docx_paragraph, that turns heading-styled / auto-numbered paragraphs into ## headings (run-in body split onto the next line) and fully-bold lines into **...**. The python-docx branch reads style + w:numPr off the underlying element (para._p) and routes through it; the stdlib reader is refactored onto the same helper with no behavior change.

Verification

mypy --strict clean.
132 passed / 1 skipped (1 skip = [pdf] extra absent). The golden fixtures (stdlib reader) are unchanged, proving the stdlib refactor is behavior-equivalent.
New tests:
- test_emit_docx_paragraph — CI-safe unit test of the shared emitter.
- test_docx_readers_agree_on_clause_map — asserts the python-docx and stdlib readers produce the same clause map on heading_docx.docx (this test fails before the fix). Uses pytest.importorskip("docx").

Notes

No output-schema change.
I intentionally did not bump the version (left a ## [Unreleased] changelog entry) to avoid colliding with the just-released 0.1.9 — fold it into the next release (0.1.10).
Optional follow-up: CI's [dev] extra doesn't install python-docx, so the new agreement test skips in CI. Adding python-docx to [dev] would exercise the fixed path in CI. Happy to do that here if you want.

🤖 Generated with Claude Code

The [docx] (python-docx) path concatenated paragraph text and dropped Heading1-9/Title styles + w:numPr numbering, so installing the [docx] extra produced an EMPTY clause map on heading-styled Word contracts — worse than the no-extra stdlib reader, which already honored them. Both readers now share one emitter (_emit_docx_paragraph) that turns heading-styled / auto-numbered paragraphs into `## headings` (run-in body split off) and fully-bold lines into `**...**`, so the two paths agree. The stdlib reader is refactored onto the same helper with no behavior change (golden fixtures unchanged). Tests: test_emit_docx_paragraph (CI-safe unit test) and test_docx_readers_agree_on_clause_map (asserts python-docx == stdlib clause map on heading_docx.docx; skips without [docx]). No output-schema change. mypy --strict clean; 132 passed / 1 skipped.

DrBaher merged commit 0021a8f into main May 22, 2026
10 checks passed

DrBaher deleted the claude/docx-heading-fix branch May 22, 2026 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: python-docx reader now honors Word heading styles#1

Fix: python-docx reader now honors Word heading styles#1
DrBaher merged 1 commit into
mainfrom
claude/docx-heading-fix

DrBaher commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DrBaher commented May 22, 2026

The bug

The fix

Verification

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant