feat(python,rust): implement normalize_text, extract_canonical_text, canonicalize_claims by jt55401 · Pull Request #2 · HTMLTrust/htmltrust-canonicalization

jt55401 · 2026-04-29T04:31:21Z

Summary

Promotes the Python and Rust scaffolds in this repo to working bindings of the HTMLTrust canonicalization library, with byte-identical output to the existing JavaScript / Go / PHP implementations.

Python (python/) now exposes:

normalize_text(text, preserve_whitespace=False) -- the 8-phase pipeline, ported from javascript/index.js byte-for-byte. Source is pure ASCII; codepoint sets are built programmatically from chr() / range pairs so the file survives any editor and is auditable without Unicode-aware tooling.
extract_canonical_text(html, preserve_whitespace=False) -- BeautifulSoup-based DOM walk; strips script/style/meta/link/head/noscript, inserts a single space at every block-element boundary, runs the full normalization pipeline.
canonicalize_claims(claims) -- normalize names + values, sort lexically, join as name=value lines.
39 tests pass (pytest), including all 18 normalization vectors from javascript/test.js.

Rust (rust/) now exposes the same three functions with byte-identical output:

normalize_text(text: &str, preserve_whitespace: bool) -> String
extract_canonical_text(html: &str) -> String (parses with scraper / html5ever)
canonicalize_claims(claims: &BTreeMap<String, String>) -> String
14 tests pass (cargo test), cargo clippy --all-targets -- -D warnings clean.

Out of scope (deferred to follow-up PR)

Signature verification (verify_signature) and the three keyid resolvers (did:web, direct URL, trust directory) are P1 in TODO-Cleanup.md and will land in a follow-up once the JavaScript surface area for those (currently on the open feat/protocol-conformance PR) merges to main. Keeping this PR scoped to the canonicalization-only contract matches the existing main-branch JS surface area.

Test plan

cd python && pip install -e '.[dev]' && pytest -- 39 passed locally.
cd rust && cargo test -- 14 passed locally.
cd rust && cargo clippy --all-targets -- -D warnings -- clean locally.
Spot-check a few normalization cases against the JS reference (node javascript/test.js) to confirm byte parity.

🤖 Generated with Claude Code

Adds README scaffolds for two new language bindings of the HTMLTrust canonicalization library. Both MUST produce byte-identical output to the existing JS, Go, and PHP implementations for every test vector in a shared conformance suite (TBD). Python uses stdlib unicodedata plus beautifulsoup4/lxml for the extract_canonical_text HTML parser. Rust uses unicode-normalization plus scraper/html5ever. Implementation to follow in separate commits; test vectors must be centralized before implementation to enforce cross-language parity. See TODO-Cleanup.md at the umbrella project root for implementation task tracking.

…canonicalize_claims Promotes the Python and Rust scaffolds to working bindings of the HTMLTrust canonicalization library, with byte-identical output to the existing JavaScript / Go / PHP implementations on the 18-case shared normalization conformance suite. Python (htmltrust_canonicalization/): - pyproject.toml (Python >= 3.10, beautifulsoup4 >= 4.12, pytest dev) - _normalize.py: normalize_text() -- 8 phases mirroring the JS reference byte-for-byte. Source is pure ASCII with codepoint sets built programmatically via chr()/range pairs to survive any editor and to stay auditable without Unicode-aware tooling. - _extract.py: extract_canonical_text() -- bs4-based HTML walk; strips script/style/meta/link/head/noscript, inserts space at block-element boundaries (h1-h6, p, div, article, etc.), preserves inline runs, applies the full normalize_text() pipeline. - _claims.py: canonicalize_claims() -- normalize names + values, sort lexically by name, join with newlines as name=value. - tests/: 39 tests including all 18 normalization vectors from the JS test suite, plus extract / claims contract coverage. Rust (rust/): - Cargo.toml (edition 2021, MSRV 1.74, unicode-normalization + scraper + ego-tree). - src/lib.rs: same three functions, same logic as Python/JS. Codepoint sets defined as const &[(u32, u32)] ranges and &[u32] points; the whitespace pass uses a manual run-collapsing loop to avoid pulling in a regex dependency. - tests/conformance.rs: same 18 normalization vectors plus extract / claims parity tests (14 tests total). cargo clippy clean. Out of scope for this PR: signature verification (verify_signature) and keyid resolvers (did:web, direct URL, trust directory) -- those are P1 in TODO-Cleanup and will follow once the JavaScript surface area for verification + resolvers lands on main (currently on the open feat/protocol-conformance PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jt55401 and others added 2 commits April 10, 2026 22:21

jt55401 merged commit 6cea236 into main May 13, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python,rust): implement normalize_text, extract_canonical_text, canonicalize_claims#2

feat(python,rust): implement normalize_text, extract_canonical_text, canonicalize_claims#2
jt55401 merged 2 commits into
mainfrom
feat/python-rust-extract-canonical-text

jt55401 commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jt55401 commented Apr 29, 2026

Summary

Out of scope (deferred to follow-up PR)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant