fix: prevent tool_use/tool_result mismatch at gradient prefix/raw boundary (#424) by BYK · Pull Request #428 · BYK/loreai

BYK · 2026-05-20T13:20:26Z

Summary

Fixes the tool_use/tool_result pairing mismatch (#424) and improves recall search quality, distillation transparency, and eval baseline realism. Eval score: 3.69 → 4.09 at 400K inflation, 4.86 without inflation.

Closes #424

1. Fix: tool_use/tool_result pairing mismatch (#424)

Root cause: The gradient context manager assembles [...distilledPrefix, ...rawWindow]. The prefix ends with an assistant message. When the budget-driven cutoff starts the raw window with an assistant containing tool_use blocks, back-to-back assistants cause the Anthropic API to reject with "tool_use ids found without tool_result blocks."

Fixes:

gradient.ts: All three assembly points (tryFit, tryFitStable, layer 4) advance past leading assistant messages when prefix is present
pipeline.ts: removeOrphanedToolResults() gains bidirectional Pass 2 — validates tool_use→tool_result (not just tool_result→tool_use)

Verified: Zero upstream error: 400 in the live inflated eval (was: hundreds).

2. Recall search quality

SOURCE_WEIGHT temporal 0.5→0.8 (parity with distillation for display budget)
charBudget 8K→12K chars for richer recall output
MAX_RRF_LISTS 10→14 to accommodate new distillation recency list
vectorBoostMinTerms 3→2 (vector boost activates for 2-term queries)
Distillation recency RRF list (structural parity with temporal)
[lossy] hint on distillation recall results when r_compression < 1.0

3. Distillation transparency (uniform citation format)

All recall-able references use uniform (prefix:id) format throughout context:

Distillation headers: (d:UUID | lossy | N sources) — shows compression quality and source count
Tool result placeholders: [tool results provided] (t:msgID) — model can fetch original tool output via recall
Recall results: (d:xxx), (t:xxx), (k:xxx) on truncated results (unchanged)
Recall tool description: Documents that (prefix:id) citations can be fetched via the id parameter

4. Temporal storage fix

temporal.store() now runs BEFORE resolveToolResults() so the original tool_result content (e.g., file contents, command output) is preserved in temporal storage and searchable via recall — not the useless [tool results provided] placeholder.

5. Eval baseline realism

compactionBaseline() now iterates (2-4 passes) matching Claude Code auto-compact at 83.5% threshold
QA_SYSTEM and buildQAPrompt made baseline-agnostic — no recall-specific coaching
Post-replay embedding backfill for vector search availability

Test coverage

1741 pass, 0 fail across all 1746 tests
New gradient prefix/raw boundary test (bug: inflated eval produces tool_use/tool_result mismatch errors through gateway #424)
New removeOrphanedToolResults Pass 2 tests (5 cases)
End-to-end integration test: gateway format → Lore format → resolveToolResults → gradient → loreMessagesToGateway → removeOrphanedToolResults → Anthropic API compliance validation
Updated temporal-adapter tests for new (t:msgID) placeholder format
Typecheck clean across all 4 packages

…ism (#424) Core recall improvements: - Raise temporal SOURCE_WEIGHT 0.5→0.8 (parity with distillation for display budget) - Raise charBudget 8K→12K chars for recall results (more room for specific details) - Raise MAX_RRF_LISTS 10→14 (accommodate distillation recency list) - Lower vectorBoostMinTerms 3→2 (activate vector boost for 2-term queries) - Add distillation recency RRF list (structural parity with temporal) - Show [lossy] hint on distillation recall results when r_compression < 1.0 Distillation transparency: - formatDistillations() now renders compression signal + IDs + source count so the model can see how lossy each distillation is and drill into details - gradient.ts loads r_compression, c_norm, source_ids from DB for distillations - Recall tool description updated with drill-down guidance for d:xxx/t:xxx IDs Eval realism: - compactionBaseline() now iterates (2-4 passes) matching real Claude Code auto-compact behavior at 83.5% of context window threshold - QA_SYSTEM prompt made baseline-agnostic: no recall-specific coaching - buildQAPrompt() preamble neutralized across all baselines Closes #424

#433) ## Summary Takes the CM-1 eval score from 4.39 → **4.71** at 400K inflation (14 of 15 questions >= 4.7). ## Changes ### 1. Scripted upstream interceptor for eval replay (`harness.ts`) **Problem:** The eval replays scripted conversation turns through the gateway, but the gateway forwards to the real API which generates its own response. The gateway stores the API's response in temporal (not the scripted content), so the scenario's ground-truth details were never in Lore's memory. **Fix:** `buildScriptedInterceptor()` creates an upstream interceptor that returns the scenario's scripted assistant turns as Anthropic-format responses. Benefits: - No upstream API calls during replay (saves cost) - Gateway stores the SCRIPTED content in temporal - Distillation processes the ground-truth content - Recall can find the exact details questions test Assumptions documented: scripted responses never contain recall tool_use blocks (no follow-up calls), non-filler user/assistant turns alternate 1:1. ### 2. Distillation prompt tuning (`prompt.ts`) Two new sections in `DISTILLATION_SYSTEM` (per `docs/PROMPT_CHANGES.md`): - **DECISIONS AND ALTERNATIVES:** Record ALL options evaluated with names, chosen/rejected reasoning - **DEBUGGING AND INVESTIGATION:** Record hypothesis sequence including wrong ones with evidence Token delta: +400 tokens (~10%). No length caps, no structural changes. Affects gen-0 distillation only. ### 3. Stale `.d.ts` fix (`prompt.d.ts`) Updated `formatDistillations` type declaration to include the optional `id`, `r_compression`, `source_ids` fields added in #428. ## Review Fixes - Added assumption documentation to `buildScriptedInterceptor` JSDoc - Removed unnecessary `(b: any)` cast — uses `as const` for proper type narrowing - Updated stale `prompt.d.ts` declaration ## Eval Results (CM-1, 400K inflation) | Metric | Before | After | |---|---|---| | Score | 4.39 | **4.71** | | Questions >= 4.0 | 13/15 | **14/15** | | Questions = 5.0 | 9/15 | **11/15** | Remaining failure: m3 (1.5) — model doesn't invoke recall because distilled context gives false confidence. ## Tests - 1752 pass, 0 fail - Typecheck clean across all 4 packages

BYK self-assigned this May 20, 2026

BYK force-pushed the fix-tool-pairing-424 branch 9 times, most recently from aa6f958 to 6259a23 Compare May 20, 2026 18:23

BYK force-pushed the fix-tool-pairing-424 branch from 6259a23 to 65c0df9 Compare May 20, 2026 19:05

BYK merged commit 17289f6 into main May 20, 2026
10 checks passed

BYK deleted the fix-tool-pairing-424 branch May 20, 2026 19:19

BYK mentioned this pull request May 20, 2026

fix: scripted interceptor for eval replay + distillation prompt tuning #433

Merged

BYK mentioned this pull request May 20, 2026

docs: update eval results — context retention 3.9→4.6, +77% vs tail-window #435

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent tool_use/tool_result mismatch at gradient prefix/raw boundary (#424)#428

fix: prevent tool_use/tool_result mismatch at gradient prefix/raw boundary (#424)#428
BYK merged 1 commit into
mainfrom
fix-tool-pairing-424

BYK commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Fix: tool_use/tool_result pairing mismatch (#424)

2. Recall search quality

3. Distillation transparency (uniform citation format)

4. Temporal storage fix

5. Eval baseline realism

Test coverage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BYK commented May 20, 2026 •

edited

Loading