Skip to content

openspec: office-document-sanitization scaffolding#1589

Open
rjzondervan wants to merge 1 commit into
developmentfrom
feat/office-document-sanitization
Open

openspec: office-document-sanitization scaffolding#1589
rjzondervan wants to merge 1 commit into
developmentfrom
feat/office-document-sanitization

Conversation

@rjzondervan
Copy link
Copy Markdown
Member

Summary

  • Scaffolds the openspec change office-document-sanitization — XML-level sanitiser that runs ahead of entity anonymisation on .docx and .odt inputs.
  • Strips identity-bearing structures (comments, tracked changes, revision history, document metadata, person-identifying field codes, custom XML data bindings); flattens hyperlinks.
  • Original file preserved (sanitiser operates on a temp copy); produces a SanitizationReport value object for audit, persisted on AnonymizationLog.sanitization (new JSON column).

What's in this PR

Pure openspec scaffolding — no implementation code yet. Four artifacts:

  • proposal.md — why / what changes / capability / impact
  • design.md — 11 decisions including XML-surgery rationale (D1), per-format strategy (D2), tracked-change accept-all semantics (D3), custom XML strip rationale (D4), metadata sentinel "DocuDesk Anonymisation" (D5), person-field-code handling (D6), hyperlink flatten (D7), temp-file pipeline (D8), report shape (D9), and the BLOCKING Word + LibreOffice reopen validation gate (D10)
  • specs/office-document-sanitization/spec.md — 11 ADDED Requirements with scenarios covering all the above + ADR-005 logging compliance and encrypted-document handling
  • tasks.md — 15 sections from value objects through DI wiring, fixtures, unit + integration tests, the manual validation gate (BLOCKING), docs, and spec maintenance

openspec validate office-document-sanitization — clean.

Why now

This is part of the "anonimiseren bij de bron" arc. Current DocumentProcessingHandler anonymises text in body paragraphs but leaves comments, tracked-change authors, document metadata, and custom XML data bindings — every one of which can carry PII that survives the operation. This change closes that gap before the deeper walker (sister change) extends content coverage.

Composition

This change pairs with sister text-extraction-office-completeness (separate PR). Sanitiser strips wrappers; walker traverses content. Both target DOCX + ODT. Sanitiser is correctness-fix; walker is coverage-expansion. They can land independently.

Status

Ready for team review. No implementation work has started — /opsx:apply runs only after this PR merges.

Test plan

  • Review artifact set for completeness vs. the discovered scope
  • Validate design.md decisions D1-D10 align with team conventions
  • Confirm capability name office-document-sanitization is acceptable
  • Confirm sentinel value "DocuDesk Anonymisation" (D5) is acceptable, or propose alternative

Scaffold the openspec change for an XML-level sanitiser that runs ahead of the
entity-anonymisation walker on .docx and .odt inputs. Sanitiser strips
identity-bearing structures (comments, tracked changes, revision history,
document metadata, person-identifying field codes, custom XML data bindings)
and flattens hyperlinks. Operates on a temp copy so the original file is
preserved; produces a SanitizationReport for audit.

Pure openspec — proposal, design, capability spec (11 ADDED Requirements),
tasks (15 sections). No implementation in this PR; awaiting team review
before /opsx:apply.
@github-actions
Copy link
Copy Markdown
Contributor

Quality Report — ConductionNL/openregister @ 19fd74b

Check PHP Vue Security License Tests
lint
phpcs
phpmd
psalm
phpstan
phpmetrics
eslint
stylelint
composer ✅ 162/162
npm ✅ 602/602
PHPUnit
Newman ⏭️
Playwright ⏭️

Quality workflow — 2026-05-19 09:25 UTC

Download the full PDF report from the workflow artifacts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant