fix(bind): route site_indices by bucket_key, drop 2 trinity loss-records (7→5) by TSavo · Pull Request #764 · TSavo/provekit

TSavo · 2026-05-13T00:14:20Z

Summary

Bucket-key collision in cmd_bind.rs's site-attribution loop was zeroing concept:identity and concept:bool-cell. Fixing it drops two below-threshold loss-records from the trinity round-trip verdict (7→5).

Root cause

shape_to_concept: BTreeMap<shape_cid, concept_idx> was built with plain insert(), so when two concepts shared a shape, the LATER one's mapping overwrote the earlier. The second-pass loop then routed every lift with that shape to the wrong concept, leaving the original empty.

Fix

shape_to_concept is now built with entry().or_insert() — first concept to claim a shape wins, deterministic with the cluster loop's insertion order.
Second loop re-derives the same bucket_key priority (annotation > catalog > shape) the first loop used and looks up key_to_concept_idx directly, falling back to shape_to_concept only when no key match exists.

Empirical verification

Before (Rust→Java leg on the trinity fixture):

bind: top concepts:
  concept:pair: 3 site(s)
  [concept:identity, concept:bool-cell absent — 0 sites in gaps.json]

trinity_round_trip: 7 loss entries including
  [below-threshold] concept:identity has 0 site(s)
  [below-threshold] concept:bool-cell has 0 site(s)

After:

bind: top concepts:
  concept:identity: 1 site(s)
  concept:unit: 1 site(s)
  concept:bool-cell: 1 site(s)
  [...10 named concepts, sum = 12 bindings]

trinity_round_trip: 5 loss entries — below-threshold gone.

Tests

trinity_roundtrip_test: PASS, verdict drops 7→5 entries.
cmd_bind_integration: 27/27 PASS.
All other provekit-cli test groups: 29 + 114 + 15 + 13 + 27 + 1 + 2 + 19 + 4 = 224 PASS.
Pre-existing test_daemon_polyglot_smoke failure (missing provekit-linkerd binary, documented in feat(wave-c): bind emits real function bodies for trinity-roundtrip slice (#748) #759/feat(760-slice2): Java realize plugin + java-canonical.json + byte-identical trinity test #761 PR bodies) unrelated and unaffected.

Why this matters

The two below-threshold gaps were the only loss-records caused by an actual classifier bug — not by substrate limits. The other five (bind-stub-body-emitted, v0-capability-gap×2, source-language-not-supported, leg-3-not-reached) are genuine deferred-to-v1 limits and remain honestly recorded. Branch 2 LBL stays first-class; Branch 1 (byte-identical exact trinity) is closer to achievable now that 2/7 loss-dims are eliminated.

Tracked as task #153.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Fixed an issue in the bind engine's collision handling for scenarios where multiple distinct concepts share shape characteristics. The binding process now correctly preserves individual concept identity during clustering and assignment, preventing incorrect concept attribution.

chatgpt-codex-connector · 2026-05-13T00:14:25Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

coderabbitai · 2026-05-13T00:14:34Z

Warning

Rate limit exceeded

@TSavo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 56 minutes and 5 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7285af33-187a-43ee-8929-6f58144734b5

📥 Commits

Reviewing files that changed from the base of the PR and between 507b0fc and 96e771a.

📒 Files selected for processing (1)

implementations/rust/provekit-cli/src/cmd_bind.rs

Walkthrough

The bind engine's concept-to-shape collision handling is fixed in two complementary parts: the shape lookup now preserves the first concept index per shape CID instead of overwriting, and concept resolution during binding recomputes bucket-key priority to match concept creation logic, with fallback to the preserved shape lookup.

Changes

Concept-to-Shape Collision Handling Fix

Layer / File(s)	Summary
Preserve first concept index in shape lookup `implementations/rust/provekit-cli/src/cmd_bind.rs`	The `shape_to_concept` map is rebuilt using `entry(...).or_insert(...)` to preserve the first registered concept index per `shape_cid` and per `shape_cid_alias`, preventing overwrites by later concept registrations.
Resolve concept index with bucket key priority `implementations/rust/provekit-cli/src/cmd_bind.rs`	During binding, `concept_idx` is resolved by recomputing the binding's `bucket_key` with the same priority logic as concept creation (human annotation, catalog match, shape) and looking up via `key_to_concept_idx`, with fallback to `shape_to_concept` only when no bucket-key match exists.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 When concepts collide, we hold the first one dear,
Building maps with care to make priorities clear.
Bucket keys guide the path, a matching dance,
Where every binding finds its rightful stance!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the primary fix in the changeset—routing site indices by bucket_key to resolve concept collision issues—and references the measurable outcome (reducing loss records from 7 to 5).
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/bind-classify-bucket-key-collision

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The trinity round-trip test consistently emitted two `below-threshold` gap records reporting that `concept:identity` and `concept:bool-cell` had 0 sites — despite the fixture containing `wrap_identity` and `toggle` with the matching `// concept: ...` annotations. The bug was in cmd_bind.rs's second-pass site-attribution loop: let concept_idx = *shape_to_concept.get(&shape_cid).expect(...); `shape_to_concept` is a BTreeMap<shape_cid, concept_idx> built in a preceding loop. When two distinct concepts (different bucket_keys, e.g. different `// concept: X` annotations) happened to share the same term-shape CID, the later concept's insert *overwrote* the earlier one's mapping. The second loop then routed every lift with that shape to the later concept, leaving the earlier one empty. Two fixes: 1. Build shape_to_concept with `entry().or_insert()` so the FIRST concept to claim a shape wins (deterministic; matches the order raw_lifts presents to the clustering loop). 2. In the second loop, re-derive the same bucket_key the first loop used (annotation > catalog > shape) and look up key_to_concept_idx first, falling back to shape_to_concept only when no key match exists. This routes each lift to the concept ITS annotation/catalog match created, not whichever concept happens to currently own its shape_cid in the global map. Trinity round-trip fixture, before: bind: top concepts: concept:pair: 3 site(s) [concept:identity and concept:bool-cell absent — 0 sites] trinity_round_trip verdict: 7 loss entries including [below-threshold] concept:identity has 0 site(s) [below-threshold] concept:bool-cell has 0 site(s) After: bind: top concepts: concept:identity: 1 site(s) concept:bool-cell: 1 site(s) concept:pair: 1 site(s) [...10 named concepts, each 1 site, sum = 12 bindings] trinity_round_trip verdict: 5 loss entries; below-threshold gone. - trinity_roundtrip_test: PASS (5 entries vs 7; honestly LBL on the three substrate-level gaps that remain: bind-stub-body-emitted, v0-capability-gap×2, source-language-not-supported, leg-3-not-reached). - cmd_bind_integration: 27/27 PASS. - All other provekit-cli test groups GREEN (224 passes total). - Pre-existing polyglot_smoke failure (missing provekit-linkerd binary) unrelated and unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

implementations/rust/provekit-cli/src/cmd_bind.rs (1)
605-620: ⚡ Quick win

Deduplicate bucket_key derivation between the two passes.

The bucket_key priority block at Lines 610-616 is a near-verbatim copy of the logic at Lines 522-539 (and the comment at Lines 605-608 acknowledges this is the whole point of the fix). The original collision bug was effectively a "two sites of truth that fell out of sync"; keeping two hand-rolled copies of the priority chain leaves the same trap for the next change (e.g., adding a 4th priority tier, renaming the human: / catalog: / shape: prefixes, or changing how concept_annotation is normalized).

Consider extracting the derivation into a small helper so both passes resolve through the exact same function. As a side benefit, you avoid calling catalog.match_shape() when a concept_annotation is present (currently both loops compute it unconditionally before the if let).
♻️ Sketch of the refactor
+fn derive_bucket_key(
+    lift: &RawLift,
+    shape_cid: &str,
+    catalog: &Catalog,
+) -> (String, Option<String>) {
+    if let Some(human) = lift.concept_annotation.as_ref() {
+        (format!("human:{human}"), None)
+    } else if let Some(c) = catalog.match_shape(shape_cid, &lift.term_shape) {
+        (format!("catalog:{}", c.id), Some(c.id.clone()))
+    } else {
+        (format!("shape:{shape_cid}"), None)
+    }
+}
Then both the clustering pass and the binding pass call derive_bucket_key(...) and can never drift apart. (The first pass still needs the final_name branch; that can stay inline or be folded into the helper's return tuple.)

Alternatively — and arguably cleaner — cache the resolved bucket_key (and the matched catalog_id) on RawLift during the lift verb, so the binding pass becomes a single key_to_concept_idx[&lift.bucket_key] lookup with no recomputation and no possibility of divergence.
Also worth noting: with the fix in place, every lift's bucket_key is guaranteed to have been inserted into key_to_concept_idx by the first loop, so the .or_else(|| shape_to_concept.get(&shape_cid)) fallback at Line 619 is unreachable in practice. Keeping it as defense-in-depth is fine; just be aware the .expect("shape was clustered") panic message would be misleading if the invariant ever broke (it would more likely indicate a bucket_key derivation mismatch, not an unclustered shape).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@implementations/rust/provekit-cli/src/cmd_bind.rs` around lines 605 - 620,
The bucket_key derivation is duplicated between the clustering and binding
passes (see the logic using catalog.match_shape, lift.concept_annotation,
shape_cid, key_to_concept_idx and shape_to_concept) and should be centralized:
add a small helper function (e.g., derive_bucket_key(shape_cid: &Cid, lift:
&RawLift, catalog: &Catalog) -> String) or store the resolved bucket_key on
RawLift during the lift pass, then replace both copies with calls to that helper
(or a direct lookup of lift.bucket_key) so both passes use identical logic and
avoid recomputing catalog.match_shape when concept_annotation is present; keep
the defensive fallback to shape_to_concept if you want, but remove duplicate
logic to prevent drifting invariants.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@implementations/rust/provekit-cli/src/cmd_bind.rs`:
- Around line 605-620: The bucket_key derivation is duplicated between the
clustering and binding passes (see the logic using catalog.match_shape,
lift.concept_annotation, shape_cid, key_to_concept_idx and shape_to_concept) and
should be centralized: add a small helper function (e.g.,
derive_bucket_key(shape_cid: &Cid, lift: &RawLift, catalog: &Catalog) -> String)
or store the resolved bucket_key on RawLift during the lift pass, then replace
both copies with calls to that helper (or a direct lookup of lift.bucket_key) so
both passes use identical logic and avoid recomputing catalog.match_shape when
concept_annotation is present; keep the defensive fallback to shape_to_concept
if you want, but remove duplicate logic to prevent drifting invariants.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c2d218b0-d268-4b4e-9aa9-a37b30099571

📥 Commits

Reviewing files that changed from the base of the PR and between 4628e51 and 507b0fc.

📒 Files selected for processing (1)

implementations/rust/provekit-cli/src/cmd_bind.rs

TSavo force-pushed the fix/bind-classify-bucket-key-collision branch from 507b0fc to 96e771a Compare May 13, 2026 00:16

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

TSavo merged commit 9965b8a into main May 13, 2026
18 of 19 checks passed

coderabbitai Bot mentioned this pull request May 15, 2026

fix(provekit-cli): wire concept-shape catalog from menagerie into bind #1011

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bind): route site_indices by bucket_key, drop 2 trinity loss-records (7→5)#764

fix(bind): route site_indices by bucket_key, drop 2 trinity loss-records (7→5)#764
TSavo merged 1 commit into
mainfrom
fix/bind-classify-bucket-key-collision

TSavo commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Rate limit exceeded

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TSavo commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Empirical verification

Tests

Why this matters

Summary by CodeRabbit

Uh oh!

chatgpt-codex-connector Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TSavo commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading