Skip to content

fix(gdpval): normalize python-docx ns0 namespacing before LibreOffice convert#1270

Merged
bxyu-nvidia merged 1 commit into
mainfrom
agronskiy/gdpval/fix-ns0-package-namespacing
May 8, 2026
Merged

fix(gdpval): normalize python-docx ns0 namespacing before LibreOffice convert#1270
bxyu-nvidia merged 1 commit into
mainfrom
agronskiy/gdpval/fix-ns0-package-namespacing

Conversation

@agronskiy
Copy link
Copy Markdown
Contributor

Summary

Some .docx files in the GDPVal corpus are emitted by python-docx (or similar lxml-based tools) and serialize the OPC package XML with an explicit ns0: namespace prefix:

<ns0:Relationships xmlns:ns0="http://schemas.openxmlformats.org/...">
<ns0:Types        xmlns:ns0="http://schemas.openxmlformats.org/...">

These are valid OOXML — Microsoft Word and pandoc accept them — but LibreOffice 24.2 rejects them silently with Error: source file could not be loaded. In the gdpval resources server this surfaces as pdf miss events when preconverting task input documents.

The prefixing appears in BOTH _rels/.rels AND [Content_Types].xml. Rewriting only one of them is not enough — both must be normalized for LibreOffice to load the file.

convert_to_pdf now detects the ns0: prefix and, when present, writes a namespace-normalized copy of the package to a tempdir and runs LibreOffice on that copy. The original file on disk is left untouched and the output PDF still lands next to the original.

Empirical impact

Reproduced and verified on the actual vadams Kimi corpus on lustre. Across the 46 source files in 27 affected GDPVal tasks:

Failure mode Files Resolved by this PR
ns0: prefix in package XML 43 Yes
ns0: prefix + malformed <Relationship> chain 2 No (separate cause)
Missing rels target parts (one .pptx) 1 No (out of scope)

So this PR resolves 43/46 (~93%) of the failing source files. The remaining 3 are outside the scope of this fix.

Test plan

  • Unit tests for _rewrite_ns0_namespace, _ooxml_has_ns0_prefix, _normalize_ooxml_zip, and convert_to_pdf selecting the normalized copy when ns0: is present (21 tests pass)
  • End-to-end: pulled Compensation Model Ideas.docx from lustre (a known-failing file), ran convert_to_pdf — produced a valid PDF with the message "converted Compensation Model Ideas.docx (after ns0 normalization)"
  • Lint + format clean (ruff check, ruff format --check)

Notes

  • Out of scope: the 2 files with malformed <Relationship> chains in word/_rels/document.xml.rels (caused by an upstream URL-redaction step that left invalid XML) and the 1 .pptx with missing rels targets. These need separate handling.

🤖 Generated with Claude Code

… convert

Some `.docx` files in the GDPVal corpus are emitted by python-docx
(or similar lxml-based tools) which serialize the OPC package XML
with an explicit `ns0:` namespace prefix in both `_rels/.rels` and
`[Content_Types].xml`:

    <ns0:Relationships xmlns:ns0="http://schemas.openxmlformats.org/...">
    <ns0:Types        xmlns:ns0="http://schemas.openxmlformats.org/...">

These are valid OOXML — Microsoft Word and pandoc accept them — but
LibreOffice 24.2 rejects them with `Error: source file could not be
loaded`, which surfaces in the gdpval resources server as `pdf miss`
events. The prefixing must be rewritten in BOTH parts; touching only
`_rels/.rels` is not sufficient.

`convert_to_pdf` now detects the ns0 prefix and, when present, copies
the file to a tempdir with the package XML rewritten to default-
namespace form before invoking LibreOffice. The original file is left
untouched, and the output PDF still lands next to the original.

Empirically, this resolves 43 of the 46 failing source files across
the 27 affected GDPVal task UUIDs. The remaining 3 files have separate
issues (malformed `<Relationship>` chains, missing rels targets) and
are out of scope for this change.

Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>
@bxyu-nvidia bxyu-nvidia merged commit ed190cd into main May 8, 2026
6 checks passed
@bxyu-nvidia bxyu-nvidia deleted the agronskiy/gdpval/fix-ns0-package-namespacing branch May 8, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants