Skip to content

fix: prevent tool_use/tool_result mismatch at gradient prefix/raw boundary (#424)#428

Merged
BYK merged 1 commit into
mainfrom
fix-tool-pairing-424
May 20, 2026
Merged

fix: prevent tool_use/tool_result mismatch at gradient prefix/raw boundary (#424)#428
BYK merged 1 commit into
mainfrom
fix-tool-pairing-424

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 20, 2026

Summary

Fixes the tool_use/tool_result pairing mismatch (#424) and improves recall search quality, distillation transparency, and eval baseline realism. Eval score: 3.69 → 4.09 at 400K inflation, 4.86 without inflation.

Closes #424

1. Fix: tool_use/tool_result pairing mismatch (#424)

Root cause: The gradient context manager assembles [...distilledPrefix, ...rawWindow]. The prefix ends with an assistant message. When the budget-driven cutoff starts the raw window with an assistant containing tool_use blocks, back-to-back assistants cause the Anthropic API to reject with "tool_use ids found without tool_result blocks."

Fixes:

  • gradient.ts: All three assembly points (tryFit, tryFitStable, layer 4) advance past leading assistant messages when prefix is present
  • pipeline.ts: removeOrphanedToolResults() gains bidirectional Pass 2 — validates tool_use→tool_result (not just tool_result→tool_use)

Verified: Zero upstream error: 400 in the live inflated eval (was: hundreds).

2. Recall search quality

  • SOURCE_WEIGHT temporal 0.5→0.8 (parity with distillation for display budget)
  • charBudget 8K→12K chars for richer recall output
  • MAX_RRF_LISTS 10→14 to accommodate new distillation recency list
  • vectorBoostMinTerms 3→2 (vector boost activates for 2-term queries)
  • Distillation recency RRF list (structural parity with temporal)
  • [lossy] hint on distillation recall results when r_compression < 1.0

3. Distillation transparency (uniform citation format)

All recall-able references use uniform (prefix:id) format throughout context:

  • Distillation headers: (d:UUID | lossy | N sources) — shows compression quality and source count
  • Tool result placeholders: [tool results provided] (t:msgID) — model can fetch original tool output via recall
  • Recall results: (d:xxx), (t:xxx), (k:xxx) on truncated results (unchanged)
  • Recall tool description: Documents that (prefix:id) citations can be fetched via the id parameter

4. Temporal storage fix

temporal.store() now runs BEFORE resolveToolResults() so the original tool_result content (e.g., file contents, command output) is preserved in temporal storage and searchable via recall — not the useless [tool results provided] placeholder.

5. Eval baseline realism

  • compactionBaseline() now iterates (2-4 passes) matching Claude Code auto-compact at 83.5% threshold
  • QA_SYSTEM and buildQAPrompt made baseline-agnostic — no recall-specific coaching
  • Post-replay embedding backfill for vector search availability

Test coverage

  • 1741 pass, 0 fail across all 1746 tests
  • New gradient prefix/raw boundary test (bug: inflated eval produces tool_use/tool_result mismatch errors through gateway #424)
  • New removeOrphanedToolResults Pass 2 tests (5 cases)
  • End-to-end integration test: gateway format → Lore format → resolveToolResults → gradient → loreMessagesToGateway → removeOrphanedToolResults → Anthropic API compliance validation
  • Updated temporal-adapter tests for new (t:msgID) placeholder format
  • Typecheck clean across all 4 packages

@BYK BYK self-assigned this May 20, 2026
@BYK BYK force-pushed the fix-tool-pairing-424 branch 9 times, most recently from aa6f958 to 6259a23 Compare May 20, 2026 18:23
…ism (#424)

Core recall improvements:
- Raise temporal SOURCE_WEIGHT 0.5→0.8 (parity with distillation for display budget)
- Raise charBudget 8K→12K chars for recall results (more room for specific details)
- Raise MAX_RRF_LISTS 10→14 (accommodate distillation recency list)
- Lower vectorBoostMinTerms 3→2 (activate vector boost for 2-term queries)
- Add distillation recency RRF list (structural parity with temporal)
- Show [lossy] hint on distillation recall results when r_compression < 1.0

Distillation transparency:
- formatDistillations() now renders compression signal + IDs + source count
  so the model can see how lossy each distillation is and drill into details
- gradient.ts loads r_compression, c_norm, source_ids from DB for distillations
- Recall tool description updated with drill-down guidance for d:xxx/t:xxx IDs

Eval realism:
- compactionBaseline() now iterates (2-4 passes) matching real Claude Code
  auto-compact behavior at 83.5% of context window threshold
- QA_SYSTEM prompt made baseline-agnostic: no recall-specific coaching
- buildQAPrompt() preamble neutralized across all baselines

Closes #424
@BYK BYK force-pushed the fix-tool-pairing-424 branch from 6259a23 to 65c0df9 Compare May 20, 2026 19:05
@BYK BYK merged commit 17289f6 into main May 20, 2026
10 checks passed
@BYK BYK deleted the fix-tool-pairing-424 branch May 20, 2026 19:19
BYK added a commit that referenced this pull request May 20, 2026
#433)

## Summary

Takes the CM-1 eval score from 4.39 → **4.71** at 400K inflation (14 of
15 questions >= 4.7).

## Changes

### 1. Scripted upstream interceptor for eval replay (`harness.ts`)

**Problem:** The eval replays scripted conversation turns through the
gateway, but the gateway forwards to the real API which generates its
own response. The gateway stores the API's response in temporal (not the
scripted content), so the scenario's ground-truth details were never in
Lore's memory.

**Fix:** `buildScriptedInterceptor()` creates an upstream interceptor
that returns the scenario's scripted assistant turns as Anthropic-format
responses. Benefits:
- No upstream API calls during replay (saves cost)
- Gateway stores the SCRIPTED content in temporal
- Distillation processes the ground-truth content
- Recall can find the exact details questions test

Assumptions documented: scripted responses never contain recall tool_use
blocks (no follow-up calls), non-filler user/assistant turns alternate
1:1.

### 2. Distillation prompt tuning (`prompt.ts`)

Two new sections in `DISTILLATION_SYSTEM` (per
`docs/PROMPT_CHANGES.md`):

- **DECISIONS AND ALTERNATIVES:** Record ALL options evaluated with
names, chosen/rejected reasoning
- **DEBUGGING AND INVESTIGATION:** Record hypothesis sequence including
wrong ones with evidence

Token delta: +400 tokens (~10%). No length caps, no structural changes.
Affects gen-0 distillation only.

### 3. Stale `.d.ts` fix (`prompt.d.ts`)

Updated `formatDistillations` type declaration to include the optional
`id`, `r_compression`, `source_ids` fields added in #428.

## Review Fixes
- Added assumption documentation to `buildScriptedInterceptor` JSDoc
- Removed unnecessary `(b: any)` cast — uses `as const` for proper type
narrowing
- Updated stale `prompt.d.ts` declaration

## Eval Results (CM-1, 400K inflation)

| Metric | Before | After |
|---|---|---|
| Score | 4.39 | **4.71** |
| Questions >= 4.0 | 13/15 | **14/15** |
| Questions = 5.0 | 9/15 | **11/15** |

Remaining failure: m3 (1.5) — model doesn't invoke recall because
distilled context gives false confidence.

## Tests
- 1752 pass, 0 fail
- Typecheck clean across all 4 packages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: inflated eval produces tool_use/tool_result mismatch errors through gateway

1 participant