Problem
In the CM-1 eval (live mode), the model sometimes writes text that resembles a recall marker (e.g. 📚 Fetching details for t:... and t:... simultaneously…) as its final answer instead of actually calling the recall tool.
Evidence
Question cm-1-e1: "What test file was created for the upload abort handling?"
- Expected:
src/__tests__/upload-abort.test.ts
- Got:
📚 Fetching details for t:e8a949d7... and t:36d96080... simultaneously…
- Score: 1/5
The model generated free-form text that looks like a marker but never invoked the recall tool. The actual buildRecallMarker() produces 📚 Fetching detail for <id>… (singular), so this isn't a marker leak — the model composed this text itself.
Root Cause Hypothesis
The model sees recall markers from previous turns in the conversation and mimics the format instead of using the tool. This could be addressed by:
- Making the marker format less "tool-like" so the model doesn't confuse it with an action
- Adding explicit instructions in the recall tool description to always use the tool, never write markers manually
- Adjusting the QA prompt to discourage this behavior
Impact
This affected 1 of 15 CM-1 questions in live eval. Score impact: ~0.27 points on the overall CM-1 average.
Context
Discovered during live eval of #404 (multi-turn recall).
Problem
In the CM-1 eval (live mode), the model sometimes writes text that resembles a recall marker (e.g.
📚 Fetching details for t:... and t:... simultaneously…) as its final answer instead of actually calling the recall tool.Evidence
Question cm-1-e1: "What test file was created for the upload abort handling?"
src/__tests__/upload-abort.test.ts📚 Fetching details for t:e8a949d7... and t:36d96080... simultaneously…The model generated free-form text that looks like a marker but never invoked the recall tool. The actual
buildRecallMarker()produces📚 Fetching detail for <id>…(singular), so this isn't a marker leak — the model composed this text itself.Root Cause Hypothesis
The model sees recall markers from previous turns in the conversation and mimics the format instead of using the tool. This could be addressed by:
Impact
This affected 1 of 15 CM-1 questions in live eval. Score impact: ~0.27 points on the overall CM-1 average.
Context
Discovered during live eval of #404 (multi-turn recall).