Skip to content

openspec: text-extraction-office-completeness scaffolding#1590

Open
rjzondervan wants to merge 1 commit into
developmentfrom
feat/text-extraction-office-completeness
Open

openspec: text-extraction-office-completeness scaffolding#1590
rjzondervan wants to merge 1 commit into
developmentfrom
feat/text-extraction-office-completeness

Conversation

@rjzondervan
Copy link
Copy Markdown
Member

Summary

  • Scaffolds the openspec change text-extraction-office-completeness — a single OfficeDocumentWalker driving both extraction and anonymisation over the same content surface.
  • Extends current 2-levels-deep walker to traverse tables (recursive), lists (ListItemRun), headers (per section, per variant), footers (per section, per variant), footnotes, endnotes, and text frames.
  • Adds ODT as a first-class supported format alongside DOCX on both paths.

What's in this PR

Pure openspec scaffolding. Four artifacts:

  • proposal.md — gap analysis (the two-level depth, missing structures, ODT-unsupported reality), pipeline shape, capability scope
  • design.md — 10 decisions: single walker class for both read+mutate (D1), element visitation rules (D2), output ordering with section markers (D3), ODT integration via PhpWord's ODText reader/writer (D4), strtr longest-match-first semantics (D5), ADR-005 logging (D6), backwards-compat-for-DOCX-consumers (D7), forceReExtract opt-in for stale records (D8), extractWordextractOfficeDocument rename with one-cycle alias (D9), fixture strategy (D10)
  • specs/text-extraction-office-completeness/spec.md — 9 ADDED Requirements with scenarios: walker coverage, extractText shape, replace mutation, ODT extraction, ODT writer-back, DOCX entity substitution in all walker-covered structures, ADR-005 logging, pre-change DOCX superset guarantee, BLOCKING reopen-clean validation gate
  • tasks.md — 10 sections from the walker class through TextExtractionService + DocumentProcessingHandler refactors, ODT-specific writer-back path, fixtures, unit + integration tests, the manual Word/LibreOffice validation gate (BLOCKING), docs, and quality gates

openspec validate text-extraction-office-completeness — clean.

Why now

Two specific bugs surfaced in recent operator testing on Woo dossiers:

  1. Entity detection misses names in document headers, footers, and footnotes (case numbers and behandelaar identifiers escape redaction).
  2. ODT inputs hit a replaceWordsInTextDocument path that does str_ireplace on the binary ZIP container — corrupts the file. Operators see "Anonymisation succeeded" then can't open the result.

This change closes both. Per design D7, DOCX extraction is strictly additive — no regression for existing consumers; ODT goes from broken to working.

Composition

Pairs with sister office-document-sanitization (separate PR #1589). Sanitiser strips wrappers; walker traverses surviving content. Both target DOCX + ODT. Can land independently or together.

Status

Ready for team review. No implementation has started — /opsx:apply runs only after this PR merges.

Test plan

  • Review the 9 ADDED Requirements for completeness vs. the actual gap
  • Confirm output ordering decisions (D3) make sense for downstream NER context
  • Confirm extractWord deprecation window (D9) — one cycle, or longer?
  • Confirm capability name text-extraction-office-completeness is acceptable

Scaffold the openspec change for a deeper walker over .docx and .odt
documents, covering tables (recursive), lists, headers, footers, footnotes,
endnotes, and text frames on both extraction and anonymisation paths. Adds
ODT as a first-class supported format alongside DOCX.

Pairs with office-document-sanitization (sister change): sanitiser strips
identity-bearing wrappers; walker traverses surviving content. Walker is
coverage-expansion; sanitiser is correctness-fix.

Pure openspec — proposal, design, capability spec (9 ADDED Requirements),
tasks (10 sections). No implementation in this PR; awaiting team review
before /opsx:apply.
@github-actions
Copy link
Copy Markdown
Contributor

Quality Report — ConductionNL/openregister @ 579196d

Check PHP Vue Security License Tests
lint
phpcs
phpmd
psalm
phpstan
phpmetrics
eslint
stylelint
composer ✅ 162/162
npm ✅ 602/602
PHPUnit
Newman ⏭️
Playwright ⏭️

Quality workflow — 2026-05-19 09:29 UTC

Download the full PDF report from the workflow artifacts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant