Skip to content

feat(python,rust): implement normalize_text, extract_canonical_text, canonicalize_claims#2

Merged
jt55401 merged 2 commits into
mainfrom
feat/python-rust-extract-canonical-text
May 13, 2026
Merged

feat(python,rust): implement normalize_text, extract_canonical_text, canonicalize_claims#2
jt55401 merged 2 commits into
mainfrom
feat/python-rust-extract-canonical-text

Conversation

@jt55401
Copy link
Copy Markdown
Contributor

@jt55401 jt55401 commented Apr 29, 2026

Summary

Promotes the Python and Rust scaffolds in this repo to working bindings of the HTMLTrust canonicalization library, with byte-identical output to the existing JavaScript / Go / PHP implementations.

Python (python/) now exposes:

  • normalize_text(text, preserve_whitespace=False) -- the 8-phase pipeline, ported from javascript/index.js byte-for-byte. Source is pure ASCII; codepoint sets are built programmatically from chr() / range pairs so the file survives any editor and is auditable without Unicode-aware tooling.
  • extract_canonical_text(html, preserve_whitespace=False) -- BeautifulSoup-based DOM walk; strips script/style/meta/link/head/noscript, inserts a single space at every block-element boundary, runs the full normalization pipeline.
  • canonicalize_claims(claims) -- normalize names + values, sort lexically, join as name=value lines.
  • 39 tests pass (pytest), including all 18 normalization vectors from javascript/test.js.

Rust (rust/) now exposes the same three functions with byte-identical output:

  • normalize_text(text: &str, preserve_whitespace: bool) -> String
  • extract_canonical_text(html: &str) -> String (parses with scraper / html5ever)
  • canonicalize_claims(claims: &BTreeMap<String, String>) -> String
  • 14 tests pass (cargo test), cargo clippy --all-targets -- -D warnings clean.

Out of scope (deferred to follow-up PR)

Signature verification (verify_signature) and the three keyid resolvers (did:web, direct URL, trust directory) are P1 in TODO-Cleanup.md and will land in a follow-up once the JavaScript surface area for those (currently on the open feat/protocol-conformance PR) merges to main. Keeping this PR scoped to the canonicalization-only contract matches the existing main-branch JS surface area.

Test plan

  • cd python && pip install -e '.[dev]' && pytest -- 39 passed locally.
  • cd rust && cargo test -- 14 passed locally.
  • cd rust && cargo clippy --all-targets -- -D warnings -- clean locally.
  • Spot-check a few normalization cases against the JS reference (node javascript/test.js) to confirm byte parity.

🤖 Generated with Claude Code

jt55401 and others added 2 commits April 10, 2026 22:21
Adds README scaffolds for two new language bindings of the HTMLTrust
canonicalization library. Both MUST produce byte-identical output to
the existing JS, Go, and PHP implementations for every test vector in
a shared conformance suite (TBD).

Python uses stdlib unicodedata plus beautifulsoup4/lxml for the
extract_canonical_text HTML parser.

Rust uses unicode-normalization plus scraper/html5ever.

Implementation to follow in separate commits; test vectors must be
centralized before implementation to enforce cross-language parity.

See TODO-Cleanup.md at the umbrella project root for implementation
task tracking.
…canonicalize_claims

Promotes the Python and Rust scaffolds to working bindings of the
HTMLTrust canonicalization library, with byte-identical output to the
existing JavaScript / Go / PHP implementations on the 18-case shared
normalization conformance suite.

Python (htmltrust_canonicalization/):
- pyproject.toml (Python >= 3.10, beautifulsoup4 >= 4.12, pytest dev)
- _normalize.py: normalize_text() -- 8 phases mirroring the JS reference
  byte-for-byte. Source is pure ASCII with codepoint sets built
  programmatically via chr()/range pairs to survive any editor and to
  stay auditable without Unicode-aware tooling.
- _extract.py: extract_canonical_text() -- bs4-based HTML walk; strips
  script/style/meta/link/head/noscript, inserts space at block-element
  boundaries (h1-h6, p, div, article, etc.), preserves inline runs,
  applies the full normalize_text() pipeline.
- _claims.py: canonicalize_claims() -- normalize names + values, sort
  lexically by name, join with newlines as name=value.
- tests/: 39 tests including all 18 normalization vectors from the JS
  test suite, plus extract / claims contract coverage.

Rust (rust/):
- Cargo.toml (edition 2021, MSRV 1.74, unicode-normalization +
  scraper + ego-tree).
- src/lib.rs: same three functions, same logic as Python/JS. Codepoint
  sets defined as const &[(u32, u32)] ranges and &[u32] points; the
  whitespace pass uses a manual run-collapsing loop to avoid pulling in
  a regex dependency.
- tests/conformance.rs: same 18 normalization vectors plus extract /
  claims parity tests (14 tests total). cargo clippy clean.

Out of scope for this PR: signature verification (verify_signature) and
keyid resolvers (did:web, direct URL, trust directory) -- those are P1
in TODO-Cleanup and will follow once the JavaScript surface area for
verification + resolvers lands on main (currently on the open
feat/protocol-conformance PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jt55401 jt55401 merged commit 6cea236 into main May 13, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant