fix: prevent tool_use/tool_result mismatch at gradient prefix/raw boundary (#424)#428
Merged
Conversation
aa6f958 to
6259a23
Compare
…ism (#424) Core recall improvements: - Raise temporal SOURCE_WEIGHT 0.5→0.8 (parity with distillation for display budget) - Raise charBudget 8K→12K chars for recall results (more room for specific details) - Raise MAX_RRF_LISTS 10→14 (accommodate distillation recency list) - Lower vectorBoostMinTerms 3→2 (activate vector boost for 2-term queries) - Add distillation recency RRF list (structural parity with temporal) - Show [lossy] hint on distillation recall results when r_compression < 1.0 Distillation transparency: - formatDistillations() now renders compression signal + IDs + source count so the model can see how lossy each distillation is and drill into details - gradient.ts loads r_compression, c_norm, source_ids from DB for distillations - Recall tool description updated with drill-down guidance for d:xxx/t:xxx IDs Eval realism: - compactionBaseline() now iterates (2-4 passes) matching real Claude Code auto-compact behavior at 83.5% of context window threshold - QA_SYSTEM prompt made baseline-agnostic: no recall-specific coaching - buildQAPrompt() preamble neutralized across all baselines Closes #424
BYK
added a commit
that referenced
this pull request
May 20, 2026
#433) ## Summary Takes the CM-1 eval score from 4.39 → **4.71** at 400K inflation (14 of 15 questions >= 4.7). ## Changes ### 1. Scripted upstream interceptor for eval replay (`harness.ts`) **Problem:** The eval replays scripted conversation turns through the gateway, but the gateway forwards to the real API which generates its own response. The gateway stores the API's response in temporal (not the scripted content), so the scenario's ground-truth details were never in Lore's memory. **Fix:** `buildScriptedInterceptor()` creates an upstream interceptor that returns the scenario's scripted assistant turns as Anthropic-format responses. Benefits: - No upstream API calls during replay (saves cost) - Gateway stores the SCRIPTED content in temporal - Distillation processes the ground-truth content - Recall can find the exact details questions test Assumptions documented: scripted responses never contain recall tool_use blocks (no follow-up calls), non-filler user/assistant turns alternate 1:1. ### 2. Distillation prompt tuning (`prompt.ts`) Two new sections in `DISTILLATION_SYSTEM` (per `docs/PROMPT_CHANGES.md`): - **DECISIONS AND ALTERNATIVES:** Record ALL options evaluated with names, chosen/rejected reasoning - **DEBUGGING AND INVESTIGATION:** Record hypothesis sequence including wrong ones with evidence Token delta: +400 tokens (~10%). No length caps, no structural changes. Affects gen-0 distillation only. ### 3. Stale `.d.ts` fix (`prompt.d.ts`) Updated `formatDistillations` type declaration to include the optional `id`, `r_compression`, `source_ids` fields added in #428. ## Review Fixes - Added assumption documentation to `buildScriptedInterceptor` JSDoc - Removed unnecessary `(b: any)` cast — uses `as const` for proper type narrowing - Updated stale `prompt.d.ts` declaration ## Eval Results (CM-1, 400K inflation) | Metric | Before | After | |---|---|---| | Score | 4.39 | **4.71** | | Questions >= 4.0 | 13/15 | **14/15** | | Questions = 5.0 | 9/15 | **11/15** | Remaining failure: m3 (1.5) — model doesn't invoke recall because distilled context gives false confidence. ## Tests - 1752 pass, 0 fail - Typecheck clean across all 4 packages
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the tool_use/tool_result pairing mismatch (#424) and improves recall search quality, distillation transparency, and eval baseline realism. Eval score: 3.69 → 4.09 at 400K inflation, 4.86 without inflation.
Closes #424
1. Fix: tool_use/tool_result pairing mismatch (#424)
Root cause: The gradient context manager assembles
[...distilledPrefix, ...rawWindow]. The prefix ends with an assistant message. When the budget-driven cutoff starts the raw window with an assistant containingtool_useblocks, back-to-back assistants cause the Anthropic API to reject with "tool_use ids found without tool_result blocks."Fixes:
gradient.ts: All three assembly points (tryFit,tryFitStable, layer 4) advance past leading assistant messages when prefix is presentpipeline.ts:removeOrphanedToolResults()gains bidirectional Pass 2 — validates tool_use→tool_result (not just tool_result→tool_use)Verified: Zero
upstream error: 400in the live inflated eval (was: hundreds).2. Recall search quality
SOURCE_WEIGHTtemporal 0.5→0.8 (parity with distillation for display budget)charBudget8K→12K chars for richer recall outputMAX_RRF_LISTS10→14 to accommodate new distillation recency listvectorBoostMinTerms3→2 (vector boost activates for 2-term queries)[lossy]hint on distillation recall results whenr_compression < 1.03. Distillation transparency (uniform citation format)
All recall-able references use uniform
(prefix:id)format throughout context:(d:UUID | lossy | N sources)— shows compression quality and source count[tool results provided] (t:msgID)— model can fetch original tool output via recall(d:xxx),(t:xxx),(k:xxx)on truncated results (unchanged)(prefix:id)citations can be fetched via theidparameter4. Temporal storage fix
temporal.store()now runs BEFOREresolveToolResults()so the original tool_result content (e.g., file contents, command output) is preserved in temporal storage and searchable via recall — not the useless[tool results provided]placeholder.5. Eval baseline realism
compactionBaseline()now iterates (2-4 passes) matching Claude Code auto-compact at 83.5% thresholdQA_SYSTEMandbuildQAPromptmade baseline-agnostic — no recall-specific coachingTest coverage
removeOrphanedToolResultsPass 2 tests (5 cases)(t:msgID)placeholder format