feat(python,rust): implement normalize_text, extract_canonical_text, canonicalize_claims#2
Merged
Merged
Conversation
Adds README scaffolds for two new language bindings of the HTMLTrust canonicalization library. Both MUST produce byte-identical output to the existing JS, Go, and PHP implementations for every test vector in a shared conformance suite (TBD). Python uses stdlib unicodedata plus beautifulsoup4/lxml for the extract_canonical_text HTML parser. Rust uses unicode-normalization plus scraper/html5ever. Implementation to follow in separate commits; test vectors must be centralized before implementation to enforce cross-language parity. See TODO-Cleanup.md at the umbrella project root for implementation task tracking.
…canonicalize_claims Promotes the Python and Rust scaffolds to working bindings of the HTMLTrust canonicalization library, with byte-identical output to the existing JavaScript / Go / PHP implementations on the 18-case shared normalization conformance suite. Python (htmltrust_canonicalization/): - pyproject.toml (Python >= 3.10, beautifulsoup4 >= 4.12, pytest dev) - _normalize.py: normalize_text() -- 8 phases mirroring the JS reference byte-for-byte. Source is pure ASCII with codepoint sets built programmatically via chr()/range pairs to survive any editor and to stay auditable without Unicode-aware tooling. - _extract.py: extract_canonical_text() -- bs4-based HTML walk; strips script/style/meta/link/head/noscript, inserts space at block-element boundaries (h1-h6, p, div, article, etc.), preserves inline runs, applies the full normalize_text() pipeline. - _claims.py: canonicalize_claims() -- normalize names + values, sort lexically by name, join with newlines as name=value. - tests/: 39 tests including all 18 normalization vectors from the JS test suite, plus extract / claims contract coverage. Rust (rust/): - Cargo.toml (edition 2021, MSRV 1.74, unicode-normalization + scraper + ego-tree). - src/lib.rs: same three functions, same logic as Python/JS. Codepoint sets defined as const &[(u32, u32)] ranges and &[u32] points; the whitespace pass uses a manual run-collapsing loop to avoid pulling in a regex dependency. - tests/conformance.rs: same 18 normalization vectors plus extract / claims parity tests (14 tests total). cargo clippy clean. Out of scope for this PR: signature verification (verify_signature) and keyid resolvers (did:web, direct URL, trust directory) -- those are P1 in TODO-Cleanup and will follow once the JavaScript surface area for verification + resolvers lands on main (currently on the open feat/protocol-conformance PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promotes the Python and Rust scaffolds in this repo to working bindings of the HTMLTrust canonicalization library, with byte-identical output to the existing JavaScript / Go / PHP implementations.
Python (
python/) now exposes:normalize_text(text, preserve_whitespace=False)-- the 8-phase pipeline, ported fromjavascript/index.jsbyte-for-byte. Source is pure ASCII; codepoint sets are built programmatically fromchr()/ range pairs so the file survives any editor and is auditable without Unicode-aware tooling.extract_canonical_text(html, preserve_whitespace=False)-- BeautifulSoup-based DOM walk; stripsscript/style/meta/link/head/noscript, inserts a single space at every block-element boundary, runs the full normalization pipeline.canonicalize_claims(claims)-- normalize names + values, sort lexically, join asname=valuelines.pytest), including all 18 normalization vectors fromjavascript/test.js.Rust (
rust/) now exposes the same three functions with byte-identical output:normalize_text(text: &str, preserve_whitespace: bool) -> Stringextract_canonical_text(html: &str) -> String(parses withscraper/ html5ever)canonicalize_claims(claims: &BTreeMap<String, String>) -> Stringcargo test),cargo clippy --all-targets -- -D warningsclean.Out of scope (deferred to follow-up PR)
Signature verification (
verify_signature) and the three keyid resolvers (did:web, direct URL, trust directory) are P1 inTODO-Cleanup.mdand will land in a follow-up once the JavaScript surface area for those (currently on the openfeat/protocol-conformancePR) merges tomain. Keeping this PR scoped to the canonicalization-only contract matches the existing main-branch JS surface area.Test plan
cd python && pip install -e '.[dev]' && pytest-- 39 passed locally.cd rust && cargo test-- 14 passed locally.cd rust && cargo clippy --all-targets -- -D warnings-- clean locally.node javascript/test.js) to confirm byte parity.🤖 Generated with Claude Code