fix(bind): route site_indices by bucket_key, drop 2 trinity loss-records (7→5)#764
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe bind engine's concept-to-shape collision handling is fixed in two complementary parts: the shape lookup now preserves the first concept index per shape CID instead of overwriting, and concept resolution during binding recomputes bucket-key priority to match concept creation logic, with fallback to the preserved shape lookup. ChangesConcept-to-Shape Collision Handling Fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The trinity round-trip test consistently emitted two `below-threshold`
gap records reporting that `concept:identity` and `concept:bool-cell`
had 0 sites — despite the fixture containing `wrap_identity` and
`toggle` with the matching `// concept: ...` annotations.
The bug was in cmd_bind.rs's second-pass site-attribution loop:
let concept_idx = *shape_to_concept.get(&shape_cid).expect(...);
`shape_to_concept` is a BTreeMap<shape_cid, concept_idx> built in a
preceding loop. When two distinct concepts (different bucket_keys,
e.g. different `// concept: X` annotations) happened to share the
same term-shape CID, the later concept's insert *overwrote* the
earlier one's mapping. The second loop then routed every lift with
that shape to the later concept, leaving the earlier one empty.
Two fixes:
1. Build shape_to_concept with `entry().or_insert()` so the FIRST
concept to claim a shape wins (deterministic; matches the order
raw_lifts presents to the clustering loop).
2. In the second loop, re-derive the same bucket_key the first loop
used (annotation > catalog > shape) and look up key_to_concept_idx
first, falling back to shape_to_concept only when no key match
exists. This routes each lift to the concept ITS annotation/catalog
match created, not whichever concept happens to currently own its
shape_cid in the global map.
Trinity round-trip fixture, before:
bind: top concepts:
concept:pair: 3 site(s)
[concept:identity and concept:bool-cell absent — 0 sites]
trinity_round_trip verdict: 7 loss entries including
[below-threshold] concept:identity has 0 site(s)
[below-threshold] concept:bool-cell has 0 site(s)
After:
bind: top concepts:
concept:identity: 1 site(s)
concept:bool-cell: 1 site(s)
concept:pair: 1 site(s)
[...10 named concepts, each 1 site, sum = 12 bindings]
trinity_round_trip verdict: 5 loss entries; below-threshold gone.
- trinity_roundtrip_test: PASS (5 entries vs 7; honestly LBL on the
three substrate-level gaps that remain: bind-stub-body-emitted,
v0-capability-gap×2, source-language-not-supported, leg-3-not-reached).
- cmd_bind_integration: 27/27 PASS.
- All other provekit-cli test groups GREEN (224 passes total).
- Pre-existing polyglot_smoke failure (missing provekit-linkerd
binary) unrelated and unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
507b0fc to
96e771a
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
implementations/rust/provekit-cli/src/cmd_bind.rs (1)
605-620: ⚡ Quick winDeduplicate
bucket_keyderivation between the two passes.The bucket_key priority block at Lines 610-616 is a near-verbatim copy of the logic at Lines 522-539 (and the comment at Lines 605-608 acknowledges this is the whole point of the fix). The original collision bug was effectively a "two sites of truth that fell out of sync"; keeping two hand-rolled copies of the priority chain leaves the same trap for the next change (e.g., adding a 4th priority tier, renaming the
human:/catalog:/shape:prefixes, or changing howconcept_annotationis normalized).Consider extracting the derivation into a small helper so both passes resolve through the exact same function. As a side benefit, you avoid calling
catalog.match_shape()when aconcept_annotationis present (currently both loops compute it unconditionally before theif let).♻️ Sketch of the refactor
+fn derive_bucket_key( + lift: &RawLift, + shape_cid: &str, + catalog: &Catalog, +) -> (String, Option<String>) { + if let Some(human) = lift.concept_annotation.as_ref() { + (format!("human:{human}"), None) + } else if let Some(c) = catalog.match_shape(shape_cid, &lift.term_shape) { + (format!("catalog:{}", c.id), Some(c.id.clone())) + } else { + (format!("shape:{shape_cid}"), None) + } +}Then both the clustering pass and the binding pass call
derive_bucket_key(...)and can never drift apart. (The first pass still needs thefinal_namebranch; that can stay inline or be folded into the helper's return tuple.)Alternatively — and arguably cleaner — cache the resolved
bucket_key(and the matchedcatalog_id) onRawLiftduring the lift verb, so the binding pass becomes a singlekey_to_concept_idx[&lift.bucket_key]lookup with no recomputation and no possibility of divergence.Also worth noting: with the fix in place, every lift's
bucket_keyis guaranteed to have been inserted intokey_to_concept_idxby the first loop, so the.or_else(|| shape_to_concept.get(&shape_cid))fallback at Line 619 is unreachable in practice. Keeping it as defense-in-depth is fine; just be aware the.expect("shape was clustered")panic message would be misleading if the invariant ever broke (it would more likely indicate a bucket_key derivation mismatch, not an unclustered shape).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@implementations/rust/provekit-cli/src/cmd_bind.rs` around lines 605 - 620, The bucket_key derivation is duplicated between the clustering and binding passes (see the logic using catalog.match_shape, lift.concept_annotation, shape_cid, key_to_concept_idx and shape_to_concept) and should be centralized: add a small helper function (e.g., derive_bucket_key(shape_cid: &Cid, lift: &RawLift, catalog: &Catalog) -> String) or store the resolved bucket_key on RawLift during the lift pass, then replace both copies with calls to that helper (or a direct lookup of lift.bucket_key) so both passes use identical logic and avoid recomputing catalog.match_shape when concept_annotation is present; keep the defensive fallback to shape_to_concept if you want, but remove duplicate logic to prevent drifting invariants.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@implementations/rust/provekit-cli/src/cmd_bind.rs`:
- Around line 605-620: The bucket_key derivation is duplicated between the
clustering and binding passes (see the logic using catalog.match_shape,
lift.concept_annotation, shape_cid, key_to_concept_idx and shape_to_concept) and
should be centralized: add a small helper function (e.g.,
derive_bucket_key(shape_cid: &Cid, lift: &RawLift, catalog: &Catalog) -> String)
or store the resolved bucket_key on RawLift during the lift pass, then replace
both copies with calls to that helper (or a direct lookup of lift.bucket_key) so
both passes use identical logic and avoid recomputing catalog.match_shape when
concept_annotation is present; keep the defensive fallback to shape_to_concept
if you want, but remove duplicate logic to prevent drifting invariants.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c2d218b0-d268-4b4e-9aa9-a37b30099571
📒 Files selected for processing (1)
implementations/rust/provekit-cli/src/cmd_bind.rs
Summary
Bucket-key collision in
cmd_bind.rs's site-attribution loop was zeroingconcept:identityandconcept:bool-cell. Fixing it drops twobelow-thresholdloss-records from the trinity round-trip verdict (7→5).Root cause
shape_to_concept: BTreeMap<shape_cid, concept_idx>was built with plaininsert(), so when two concepts shared a shape, the LATER one's mapping overwrote the earlier. The second-pass loop then routed every lift with that shape to the wrong concept, leaving the original empty.Fix
shape_to_conceptis now built withentry().or_insert()— first concept to claim a shape wins, deterministic with the cluster loop's insertion order.bucket_keypriority (annotation > catalog > shape) the first loop used and looks upkey_to_concept_idxdirectly, falling back toshape_to_conceptonly when no key match exists.Empirical verification
Before (Rust→Java leg on the trinity fixture):
After:
Tests
trinity_roundtrip_test: PASS, verdict drops 7→5 entries.cmd_bind_integration: 27/27 PASS.provekit-clitest groups: 29 + 114 + 15 + 13 + 27 + 1 + 2 + 19 + 4 = 224 PASS.test_daemon_polyglot_smokefailure (missingprovekit-linkerdbinary, documented in feat(wave-c): bind emits real function bodies for trinity-roundtrip slice (#748) #759/feat(760-slice2): Java realize plugin + java-canonical.json + byte-identical trinity test #761 PR bodies) unrelated and unaffected.Why this matters
The two below-threshold gaps were the only loss-records caused by an actual classifier bug — not by substrate limits. The other five (
bind-stub-body-emitted,v0-capability-gap×2,source-language-not-supported,leg-3-not-reached) are genuine deferred-to-v1 limits and remain honestly recorded. Branch 2 LBL stays first-class; Branch 1 (byte-identical exact trinity) is closer to achievable now that 2/7 loss-dims are eliminated.Tracked as task #153.
🤖 Generated with Claude Code
Summary by CodeRabbit