fix(gdpval): normalize python-docx ns0 namespacing before LibreOffice convert#1270
Merged
Merged
Conversation
… convert
Some `.docx` files in the GDPVal corpus are emitted by python-docx
(or similar lxml-based tools) which serialize the OPC package XML
with an explicit `ns0:` namespace prefix in both `_rels/.rels` and
`[Content_Types].xml`:
<ns0:Relationships xmlns:ns0="http://schemas.openxmlformats.org/...">
<ns0:Types xmlns:ns0="http://schemas.openxmlformats.org/...">
These are valid OOXML — Microsoft Word and pandoc accept them — but
LibreOffice 24.2 rejects them with `Error: source file could not be
loaded`, which surfaces in the gdpval resources server as `pdf miss`
events. The prefixing must be rewritten in BOTH parts; touching only
`_rels/.rels` is not sufficient.
`convert_to_pdf` now detects the ns0 prefix and, when present, copies
the file to a tempdir with the package XML rewritten to default-
namespace form before invoking LibreOffice. The original file is left
untouched, and the output PDF still lands next to the original.
Empirically, this resolves 43 of the 46 failing source files across
the 27 affected GDPVal task UUIDs. The remaining 3 files have separate
issues (malformed `<Relationship>` chains, missing rels targets) and
are out of scope for this change.
Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>
bxyu-nvidia
approved these changes
May 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Some
.docxfiles in the GDPVal corpus are emitted bypython-docx(or similarlxml-based tools) and serialize the OPC package XML with an explicitns0:namespace prefix:These are valid OOXML — Microsoft Word and pandoc accept them — but LibreOffice 24.2 rejects them silently with
Error: source file could not be loaded. In the gdpval resources server this surfaces aspdf missevents when preconverting task input documents.The prefixing appears in BOTH
_rels/.relsAND[Content_Types].xml. Rewriting only one of them is not enough — both must be normalized for LibreOffice to load the file.convert_to_pdfnow detects thens0:prefix and, when present, writes a namespace-normalized copy of the package to a tempdir and runs LibreOffice on that copy. The original file on disk is left untouched and the output PDF still lands next to the original.Empirical impact
Reproduced and verified on the actual vadams Kimi corpus on lustre. Across the 46 source files in 27 affected GDPVal tasks:
ns0:prefix in package XMLns0:prefix + malformed<Relationship>chain.pptx)So this PR resolves 43/46 (~93%) of the failing source files. The remaining 3 are outside the scope of this fix.
Test plan
_rewrite_ns0_namespace,_ooxml_has_ns0_prefix,_normalize_ooxml_zip, andconvert_to_pdfselecting the normalized copy whenns0:is present (21 tests pass)Compensation Model Ideas.docxfrom lustre (a known-failing file), ranconvert_to_pdf— produced a valid PDF with the message "converted Compensation Model Ideas.docx (after ns0 normalization)"ruff check,ruff format --check)Notes
<Relationship>chains inword/_rels/document.xml.rels(caused by an upstream URL-redaction step that left invalid XML) and the 1.pptxwith missing rels targets. These need separate handling.🤖 Generated with Claude Code