Skip to content

contract(unicharset): direction + mirror leaf — byte-parity 112/112, varied-field surface complete#633

Merged
AdaWorldAPI merged 1 commit into
mainfrom
claude/happy-hamilton-0azlw4
Jul 2, 2026
Merged

contract(unicharset): direction + mirror leaf — byte-parity 112/112, varied-field surface complete#633
AdaWorldAPI merged 1 commit into
mainfrom
claude/happy-hamilton-0azlw4

Conversation

@AdaWorldAPI

Copy link
Copy Markdown
Owner

The sixth UNICHARSET leaf: direction + mirror — the first columns read PAST the bbox+stats CSV, proving the CSV-skip and completing the varied-field surface of the character set.

What ships (one commit, cc76557-rebased)

  • UniCharSet::{get_direction, get_mirror} + dump_direction/dump_mirror, backed by directions/mirrors: Vec<i32> parsed by continuing the per-line token walk (the bbox+stats group is ONE whitespace token, so columns land at fixed offsets across all 5 of tesseract's istringstream fallback tiers — no bespoke tier detector).
  • Faithful C++ semantics: direction load default U_LEFT_TO_RIGHT (0) vs out-of-range U_OTHER_NEUTRAL (10) — two distinct defaults for two distinct conditions (unicharset.h:712); mirror clamped like other_case, out-of-range → INVALID_UNICHAR_ID (unicharset.h:721).
  • examples/unicharset_dump.rs gains direction|mirror modes (reproduces the parity diffs).

Proof

Byte-identical 112/112 each vs tesseract's own get_direction/get_mirror on real eng.lstm-unicharset, via the self-validating oracle (bijection half re-proves the 5.5.0-header/5.3.4-lib layout before the new field is trusted). Direction is genuinely varied on eng (55× LTR, 33× OTHER_NEUTRAL, plus 2/3/4/6 codes); mirror has 10 real bracket/paren pairs — the parse is exercised, not just defaults.

With this, every UNICHARSET field that varies on real eng data is transcoded and parity-proven (E-CPP-PARITY-1..6). bbox/stats/normed are deferred with reason (uniform on LSTM data = weak falsifier; gated on a legacy unicharset).

Robustness

This commit has survived six rebases across ~240 commits of main churn (V3 substrate, CanonHigh flip, #624#632) with parity re-verified green each time. Current base: post-#632, 795 contract lib tests, clippy -D warnings + fmt clean on touched files.

Board hygiene in-commit: EPIPHANIES E-CPP-PARITY-6, LATEST_STATE branch-work + D-UNICHARSET-DIR-MIRROR, TECH_DEBT TD-CONTRACT-NOT-FMT-GATED.

Companion: tesseract-rs PR (consumer wiring + awareness artifacts) — merge this one FIRST; its CI builds against lance-graph main.

🤖 Generated with Claude Code

https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1


Generated by Claude Code

…arity 112/112

Add get_direction + get_mirror + dump_direction + dump_mirror to UniCharSet,
backed by directions: Vec<i32> + mirrors: Vec<i32>. These are the two columns
after other_case; the bbox+stats group is a single whitespace token, so the
columns land at fixed offsets across all 5 of tesseract's istringstream tiers
(unicharset.cpp:833-868) — the per-line token walk continued one/two positions
past other_case reads them, no bespoke tier detector. A tier without the columns
leaves the walk exhausted -> defaults.

- direction: stored as-is (ICU UCharDirection); load default U_LEFT_TO_RIGHT (0)
  for an absent column; get_direction returns U_OTHER_NEUTRAL (10) out of range
  (unicharset.h:712) -- two distinct "defaults" for two distinct conditions.
- mirror: clamped at load like other_case (>= size -> self); get_mirror returns
  INVALID_UNICHAR_ID (-1) out of range (unicharset.h:721).

Byte-identical 112/112 each vs tesseract's own get_direction/get_mirror on real
eng.lstm-unicharset (self-validating oracle; direction 6 distinct values incl.
55x LTR + 33x OTHER_NEUTRAL, mirror 10 bracket/paren pairs). Sixth leaf of
PROBE-OGAR-ADAPTER-UNICHARSET; first to read past the bbox CSV. Remaining
sub-leaf: the float stats inside the CSV.

- +3 unicharset tests (26 total); my files clippy -D warnings + fmt clean
- examples/unicharset_dump.rs gains direction|mirror modes (reproduce the diffs)
- board: EPIPHANIES E-CPP-PARITY-6; LATEST_STATE branch-work + D-UNICHARSET-DIR-MIRROR;
  TECH_DEBT TD-CONTRACT-NOT-FMT-GATED (contract crate not fmt-checked in CI; the
  rustfmt drift in hhtl/nan_projection/soa_graph is from merged PRs, not this leaf)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@AdaWorldAPI, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 17 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 63e9fe7e-19aa-4d41-9e5c-f3b6ed3e0d6f

📥 Commits

Reviewing files that changed from the base of the PR and between df36747 and cc76557.

📒 Files selected for processing (5)
  • .claude/board/EPIPHANIES.md
  • .claude/board/LATEST_STATE.md
  • .claude/board/TECH_DEBT.md
  • crates/lance-graph-contract/examples/unicharset_dump.rs
  • crates/lance-graph-contract/src/unicharset.rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@AdaWorldAPI AdaWorldAPI merged commit 06b5409 into main Jul 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants