Skip to content

bug: recall fails to find late-session details (easy questions score worst in CM-1) #410

@BYK

Description

@BYK

Problem

In the CM-1 eval, "easy" questions about late-session details (turns 38-45) score 1.9 average — worse than "hard" questions about early-session details (turns 0-10) which score 3.6 average. This is inverted from expectations.

Evidence

Question Expected Answer Model Answer Score
cm-1-e1 src/__tests__/upload-abort.test.ts (marker text, not a real answer) 1
cm-1-e2 fix: resolve stale temp file accumulation and ENOSPC handling "PR #342" (hallucinated) 1
cm-1-e3 17 tests passed "not explicitly listed" 2
cm-1-e4 fix/upload-cleanup-lock "branch name was not recorded" 1
cm-1-e5 unused 'open' import in cleanup.ts (correct) 4.3

Meanwhile hard questions (Sentry issue ID, user who reported bug, stack trace) scored 4-5.

Root Cause Hypothesis

The easy questions ask about details from the last few turns of the conversation. In the eval, the QA phase runs in a new session after the conversation phase. The late-session details exist in:

  1. The temporal messages table (but may be filtered by distilled=0 if distillation ran)
  2. Distillation summaries (but specific details like branch names and PR titles may be compressed away)

The recall search may not rank these recent-but-distilled details highly enough, or the distillation may lose granular facts (exact branch names, test counts, PR titles).

Possible Fixes

  1. Ensure distillation preserves specific identifiers (branch names, file paths, version numbers)
  2. Check if distilled=0 filter on temporal search is too aggressive for QA questions
  3. Boost recency in recall scoring for within-session queries

Impact

Easy questions: avg 1.9 (should be ~4+)
Overall CM-1 impact: dragging score from ~3.6 down to 2.8

Context

Discovered during live eval of #404 (multi-turn recall). The multi-turn recall mechanism is working well for hard questions (early-session detail retrieval), but late-session details are paradoxically harder to find.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions