Skip to content

fix(bind): route site_indices by bucket_key, drop 2 trinity loss-records (7→5)#764

Merged
TSavo merged 1 commit into
mainfrom
fix/bind-classify-bucket-key-collision
May 13, 2026
Merged

fix(bind): route site_indices by bucket_key, drop 2 trinity loss-records (7→5)#764
TSavo merged 1 commit into
mainfrom
fix/bind-classify-bucket-key-collision

Conversation

@TSavo
Copy link
Copy Markdown
Owner

@TSavo TSavo commented May 13, 2026

Summary

Bucket-key collision in cmd_bind.rs's site-attribution loop was zeroing concept:identity and concept:bool-cell. Fixing it drops two below-threshold loss-records from the trinity round-trip verdict (7→5).

Root cause

shape_to_concept: BTreeMap<shape_cid, concept_idx> was built with plain insert(), so when two concepts shared a shape, the LATER one's mapping overwrote the earlier. The second-pass loop then routed every lift with that shape to the wrong concept, leaving the original empty.

Fix

  1. shape_to_concept is now built with entry().or_insert() — first concept to claim a shape wins, deterministic with the cluster loop's insertion order.
  2. Second loop re-derives the same bucket_key priority (annotation > catalog > shape) the first loop used and looks up key_to_concept_idx directly, falling back to shape_to_concept only when no key match exists.

Empirical verification

Before (Rust→Java leg on the trinity fixture):

bind: top concepts:
  concept:pair: 3 site(s)
  [concept:identity, concept:bool-cell absent — 0 sites in gaps.json]

trinity_round_trip: 7 loss entries including
  [below-threshold] concept:identity has 0 site(s)
  [below-threshold] concept:bool-cell has 0 site(s)

After:

bind: top concepts:
  concept:identity: 1 site(s)
  concept:unit: 1 site(s)
  concept:bool-cell: 1 site(s)
  [...10 named concepts, sum = 12 bindings]

trinity_round_trip: 5 loss entries — below-threshold gone.

Tests

Why this matters

The two below-threshold gaps were the only loss-records caused by an actual classifier bug — not by substrate limits. The other five (bind-stub-body-emitted, v0-capability-gap×2, source-language-not-supported, leg-3-not-reached) are genuine deferred-to-v1 limits and remain honestly recorded. Branch 2 LBL stays first-class; Branch 1 (byte-identical exact trinity) is closer to achievable now that 2/7 loss-dims are eliminated.

Tracked as task #153.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Fixed an issue in the bind engine's collision handling for scenarios where multiple distinct concepts share shape characteristics. The binding process now correctly preserves individual concept identity during clustering and assignment, preventing incorrect concept attribution.

Review Change Stack

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Warning

Rate limit exceeded

@TSavo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 56 minutes and 5 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7285af33-187a-43ee-8929-6f58144734b5

📥 Commits

Reviewing files that changed from the base of the PR and between 507b0fc and 96e771a.

📒 Files selected for processing (1)
  • implementations/rust/provekit-cli/src/cmd_bind.rs

Walkthrough

The bind engine's concept-to-shape collision handling is fixed in two complementary parts: the shape lookup now preserves the first concept index per shape CID instead of overwriting, and concept resolution during binding recomputes bucket-key priority to match concept creation logic, with fallback to the preserved shape lookup.

Changes

Concept-to-Shape Collision Handling Fix

Layer / File(s) Summary
Preserve first concept index in shape lookup
implementations/rust/provekit-cli/src/cmd_bind.rs
The shape_to_concept map is rebuilt using entry(...).or_insert(...) to preserve the first registered concept index per shape_cid and per shape_cid_alias, preventing overwrites by later concept registrations.
Resolve concept index with bucket key priority
implementations/rust/provekit-cli/src/cmd_bind.rs
During binding, concept_idx is resolved by recomputing the binding's bucket_key with the same priority logic as concept creation (human annotation, catalog match, shape) and looking up via key_to_concept_idx, with fallback to shape_to_concept only when no bucket-key match exists.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 When concepts collide, we hold the first one dear,
Building maps with care to make priorities clear.
Bucket keys guide the path, a matching dance,
Where every binding finds its rightful stance!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the primary fix in the changeset—routing site indices by bucket_key to resolve concept collision issues—and references the measurable outcome (reducing loss records from 7 to 5).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bind-classify-bucket-key-collision

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

The trinity round-trip test consistently emitted two `below-threshold`
gap records reporting that `concept:identity` and `concept:bool-cell`
had 0 sites — despite the fixture containing `wrap_identity` and
`toggle` with the matching `// concept: ...` annotations.

The bug was in cmd_bind.rs's second-pass site-attribution loop:

  let concept_idx = *shape_to_concept.get(&shape_cid).expect(...);

`shape_to_concept` is a BTreeMap<shape_cid, concept_idx> built in a
preceding loop. When two distinct concepts (different bucket_keys,
e.g. different `// concept: X` annotations) happened to share the
same term-shape CID, the later concept's insert *overwrote* the
earlier one's mapping. The second loop then routed every lift with
that shape to the later concept, leaving the earlier one empty.

Two fixes:

1. Build shape_to_concept with `entry().or_insert()` so the FIRST
   concept to claim a shape wins (deterministic; matches the order
   raw_lifts presents to the clustering loop).

2. In the second loop, re-derive the same bucket_key the first loop
   used (annotation > catalog > shape) and look up key_to_concept_idx
   first, falling back to shape_to_concept only when no key match
   exists. This routes each lift to the concept ITS annotation/catalog
   match created, not whichever concept happens to currently own its
   shape_cid in the global map.

Trinity round-trip fixture, before:
  bind: top concepts:
    concept:pair: 3 site(s)
    [concept:identity and concept:bool-cell absent — 0 sites]

  trinity_round_trip verdict: 7 loss entries including
    [below-threshold] concept:identity has 0 site(s)
    [below-threshold] concept:bool-cell has 0 site(s)

After:
  bind: top concepts:
    concept:identity: 1 site(s)
    concept:bool-cell: 1 site(s)
    concept:pair: 1 site(s)
    [...10 named concepts, each 1 site, sum = 12 bindings]

  trinity_round_trip verdict: 5 loss entries; below-threshold gone.

- trinity_roundtrip_test: PASS (5 entries vs 7; honestly LBL on the
  three substrate-level gaps that remain: bind-stub-body-emitted,
  v0-capability-gap×2, source-language-not-supported, leg-3-not-reached).
- cmd_bind_integration: 27/27 PASS.
- All other provekit-cli test groups GREEN (224 passes total).
- Pre-existing polyglot_smoke failure (missing provekit-linkerd
  binary) unrelated and unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@TSavo TSavo force-pushed the fix/bind-classify-bucket-key-collision branch from 507b0fc to 96e771a Compare May 13, 2026 00:16
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
implementations/rust/provekit-cli/src/cmd_bind.rs (1)

605-620: ⚡ Quick win

Deduplicate bucket_key derivation between the two passes.

The bucket_key priority block at Lines 610-616 is a near-verbatim copy of the logic at Lines 522-539 (and the comment at Lines 605-608 acknowledges this is the whole point of the fix). The original collision bug was effectively a "two sites of truth that fell out of sync"; keeping two hand-rolled copies of the priority chain leaves the same trap for the next change (e.g., adding a 4th priority tier, renaming the human: / catalog: / shape: prefixes, or changing how concept_annotation is normalized).

Consider extracting the derivation into a small helper so both passes resolve through the exact same function. As a side benefit, you avoid calling catalog.match_shape() when a concept_annotation is present (currently both loops compute it unconditionally before the if let).

♻️ Sketch of the refactor
+fn derive_bucket_key(
+    lift: &RawLift,
+    shape_cid: &str,
+    catalog: &Catalog,
+) -> (String, Option<String>) {
+    if let Some(human) = lift.concept_annotation.as_ref() {
+        (format!("human:{human}"), None)
+    } else if let Some(c) = catalog.match_shape(shape_cid, &lift.term_shape) {
+        (format!("catalog:{}", c.id), Some(c.id.clone()))
+    } else {
+        (format!("shape:{shape_cid}"), None)
+    }
+}

Then both the clustering pass and the binding pass call derive_bucket_key(...) and can never drift apart. (The first pass still needs the final_name branch; that can stay inline or be folded into the helper's return tuple.)

Alternatively — and arguably cleaner — cache the resolved bucket_key (and the matched catalog_id) on RawLift during the lift verb, so the binding pass becomes a single key_to_concept_idx[&lift.bucket_key] lookup with no recomputation and no possibility of divergence.

Also worth noting: with the fix in place, every lift's bucket_key is guaranteed to have been inserted into key_to_concept_idx by the first loop, so the .or_else(|| shape_to_concept.get(&shape_cid)) fallback at Line 619 is unreachable in practice. Keeping it as defense-in-depth is fine; just be aware the .expect("shape was clustered") panic message would be misleading if the invariant ever broke (it would more likely indicate a bucket_key derivation mismatch, not an unclustered shape).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@implementations/rust/provekit-cli/src/cmd_bind.rs` around lines 605 - 620,
The bucket_key derivation is duplicated between the clustering and binding
passes (see the logic using catalog.match_shape, lift.concept_annotation,
shape_cid, key_to_concept_idx and shape_to_concept) and should be centralized:
add a small helper function (e.g., derive_bucket_key(shape_cid: &Cid, lift:
&RawLift, catalog: &Catalog) -> String) or store the resolved bucket_key on
RawLift during the lift pass, then replace both copies with calls to that helper
(or a direct lookup of lift.bucket_key) so both passes use identical logic and
avoid recomputing catalog.match_shape when concept_annotation is present; keep
the defensive fallback to shape_to_concept if you want, but remove duplicate
logic to prevent drifting invariants.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@implementations/rust/provekit-cli/src/cmd_bind.rs`:
- Around line 605-620: The bucket_key derivation is duplicated between the
clustering and binding passes (see the logic using catalog.match_shape,
lift.concept_annotation, shape_cid, key_to_concept_idx and shape_to_concept) and
should be centralized: add a small helper function (e.g.,
derive_bucket_key(shape_cid: &Cid, lift: &RawLift, catalog: &Catalog) -> String)
or store the resolved bucket_key on RawLift during the lift pass, then replace
both copies with calls to that helper (or a direct lookup of lift.bucket_key) so
both passes use identical logic and avoid recomputing catalog.match_shape when
concept_annotation is present; keep the defensive fallback to shape_to_concept
if you want, but remove duplicate logic to prevent drifting invariants.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c2d218b0-d268-4b4e-9aa9-a37b30099571

📥 Commits

Reviewing files that changed from the base of the PR and between 4628e51 and 507b0fc.

📒 Files selected for processing (1)
  • implementations/rust/provekit-cli/src/cmd_bind.rs

@TSavo TSavo merged commit 9965b8a into main May 13, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants