Problem
In the CM-1 eval, "easy" questions about late-session details (turns 38-45) score 1.9 average — worse than "hard" questions about early-session details (turns 0-10) which score 3.6 average. This is inverted from expectations.
Evidence
| Question |
Expected Answer |
Model Answer |
Score |
| cm-1-e1 |
src/__tests__/upload-abort.test.ts |
(marker text, not a real answer) |
1 |
| cm-1-e2 |
fix: resolve stale temp file accumulation and ENOSPC handling |
"PR #342" (hallucinated) |
1 |
| cm-1-e3 |
17 tests passed |
"not explicitly listed" |
2 |
| cm-1-e4 |
fix/upload-cleanup-lock |
"branch name was not recorded" |
1 |
| cm-1-e5 |
unused 'open' import in cleanup.ts |
(correct) |
4.3 |
Meanwhile hard questions (Sentry issue ID, user who reported bug, stack trace) scored 4-5.
Root Cause Hypothesis
The easy questions ask about details from the last few turns of the conversation. In the eval, the QA phase runs in a new session after the conversation phase. The late-session details exist in:
- The temporal messages table (but may be filtered by
distilled=0 if distillation ran)
- Distillation summaries (but specific details like branch names and PR titles may be compressed away)
The recall search may not rank these recent-but-distilled details highly enough, or the distillation may lose granular facts (exact branch names, test counts, PR titles).
Possible Fixes
- Ensure distillation preserves specific identifiers (branch names, file paths, version numbers)
- Check if
distilled=0 filter on temporal search is too aggressive for QA questions
- Boost recency in recall scoring for within-session queries
Impact
Easy questions: avg 1.9 (should be ~4+)
Overall CM-1 impact: dragging score from ~3.6 down to 2.8
Context
Discovered during live eval of #404 (multi-turn recall). The multi-turn recall mechanism is working well for hard questions (early-session detail retrieval), but late-session details are paradoxically harder to find.
Problem
In the CM-1 eval, "easy" questions about late-session details (turns 38-45) score 1.9 average — worse than "hard" questions about early-session details (turns 0-10) which score 3.6 average. This is inverted from expectations.
Evidence
src/__tests__/upload-abort.test.tsfix: resolve stale temp file accumulation and ENOSPC handlingfix/upload-cleanup-lockMeanwhile hard questions (Sentry issue ID, user who reported bug, stack trace) scored 4-5.
Root Cause Hypothesis
The easy questions ask about details from the last few turns of the conversation. In the eval, the QA phase runs in a new session after the conversation phase. The late-session details exist in:
distilled=0if distillation ran)The recall search may not rank these recent-but-distilled details highly enough, or the distillation may lose granular facts (exact branch names, test counts, PR titles).
Possible Fixes
distilled=0filter on temporal search is too aggressive for QA questionsImpact
Easy questions: avg 1.9 (should be ~4+)
Overall CM-1 impact: dragging score from ~3.6 down to 2.8
Context
Discovered during live eval of #404 (multi-turn recall). The multi-turn recall mechanism is working well for hard questions (early-session detail retrieval), but late-session details are paradoxically harder to find.