feat: improve recall tool description + add cross-session cue eval scenarios#396
Merged
Conversation
5e06851 to
b174c8e
Compare
…enarios
Rewrite RECALL_TOOL_DESCRIPTION with dual-trigger structure so the LLM
uses recall at layer 0 (early session) when users reference past sessions:
(1) Cross-session references — explicit cue phrases like 'last time',
'we discussed', 'earlier', 'remember'. Prior sessions are never
in context.
(2) Missing details — file paths, decisions, preferences not visible
in the current window.
Extend the eval suite to test cross-session recall trigger sensitivity:
- Add x-lore-recall-invoked response header to non-streaming recall paths
- Add RECALL_TRIGGER scoring criterion and crossSessionCueRecall rubric
- Pass recallInvoked metadata through judge for recall_trigger scoring
- Add 8 new MSR-1 questions using conversational cross-session cues
(msr1-q13 through msr1-q20)
b174c8e to
0ad89f4
Compare
This was referenced May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a user says things like "we had this thing from earlier sessions" at the start of a conversation (layer 0, no compression), the LLM does not use the recall tool. The current tool description frames everything around "trimmed context" — which isn't true at layer 0, so the LLM dismisses the need to search.
Solution
1. Recall Tool Description Rewrite
Rewrote
RECALL_TOOL_DESCRIPTIONwith a dual-trigger structure that separates two distinct cases:This works at every gradient layer because cross-session content is never in context, regardless of compression state.
2. Eval Extension for Cross-Session Recall Cue Detection
Extended the eval suite to test whether the LLM uses recall when given conversational cross-session references:
x-lore-recall-invokedresponse header — added to all non-streaming recall return paths in the gateway pipeline so the eval harness can detect recall usage (the gateway handles recall transparently — clients never seetool_useblocks)RECALL_TRIGGERscoring criterion — new judge criterion that scores whether the LLM appropriately used recall for cross-session references (1-5 scale)crossSessionCueRecallrubric — factual_accuracy (0.25), completeness (0.25), recall_trigger (0.3), temporal_attribution (0.2)recallInvokedmetadata flow — propagated fromaskQuestionViaGateway()→judge()→EvalResult.metadata, conditionally included in the judge prompt only when the rubric has arecall_triggercriterionFiles Changed
packages/core/src/recall.tsRECALL_TOOL_DESCRIPTIONpackages/gateway/src/pipeline.tsextraHeaderstononStreamHttpResponse(), setx-lore-recall-invokedheaderpackages/core/eval/judge.tsRECALL_TRIGGERcriterion,crossSessionCueRecallrubric,metadataparam tojudge()packages/core/eval/harness.tsrecallInvokedthrough scoring pipelinepackages/core/eval/scenarios/multi-session-recall.tsVerification
bun run typecheck— all 4 packages passbun test— all 1630 tests pass