Skip to content

fix: downweight knowledge in recall when session content exists#432

Merged
BYK merged 1 commit into
mainfrom
fix-recall-ranking
May 20, 2026
Merged

fix: downweight knowledge in recall when session content exists#432
BYK merged 1 commit into
mainfrom
fix-recall-ranking

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 20, 2026

Summary

When scope=all and temporal/distillation results exist, apply 0.6x weight to knowledge BM25 and vector RRF lists. This deprioritizes cross-session LTM entries when session-specific content is available.

Problem

When the model invokes recall with a query like "alternative approach flock locking", an LTM knowledge entry about proper-lockfile (which contains terms like "flock", "advisory locking") outranks the actual session temporal message about "lock file with staleness check". The LTM entry has perfect keyword overlap with the query while the temporal message has weaker BM25 relevance.

Fix

Track whether temporal/distillation results exist across the query expansion loop (hasSessionResults flag). When scope === 'all' and session-specific results exist:

  • Knowledge BM25 list: weight: 0.6 (was implicit 1.0)
  • Knowledge vector list: weight: vectorWeight * 0.6 (was vectorWeight)

This ensures session-specific content ranks higher when both sources match, while knowledge still surfaces when no session content exists (e.g., cross-session queries, scope: 'knowledge').

RRF Score Impact (both at rank 0, 2-term query)

Source Before After
Knowledge (BM25 + vector) 0.0167 + 0.0250 = 0.042 0.0100 + 0.0150 = 0.025
Temporal (BM25 + recency + vector) 0.0167 + 0.0167 + 0.0250 = 0.058 unchanged

Eval Results (CM-1, 400K inflation)

Score: 4.39 (up from 3.69 baseline). 12 of 15 questions >= 4.0.

Remaining failures (m3, h4) are a query generation problem, not ranking: the distillation compresses away the distinguishing term ("staleness check") so the model can't include it in its recall query.

Tests

  • 1752 pass, 0 fail
  • Typecheck clean across all 4 packages

When scope=all and temporal/distillation results exist, apply 0.6x weight to
knowledge BM25 and vector RRF lists. This deprioritizes cross-session LTM
entries when session-specific content is available — temporal details about
what actually happened are more likely the answer than general knowledge.

Also adds session-affinity RRF boost and scripted assistant storage in eval.

Eval score: 3.69 → 4.39 at 400K inflation.
@BYK BYK self-assigned this May 20, 2026
@BYK BYK merged commit 26fe451 into main May 20, 2026
10 checks passed
@BYK BYK deleted the fix-recall-ranking branch May 20, 2026 21:56
BYK added a commit that referenced this pull request May 21, 2026
…indow (#435)

Updates marketing copy with the latest eval results from the recall
quality + distillation transparency work (#428, #430, #431, #432, #433,
#434).

### README.md
- Context retention table: Medium 2.3→4.1, Hard 3.3→4.8, Average 3.9→4.6
- Lore vs tail-window delta: +50%→+77%
- Added footnote: Lore scores averaged across multiple runs;
TW/compaction baselines from a prior eval run with the same scenarios
- Added v6 to version history

### docs/index.html
- Hero stat: +50%→+77% vs tail-window
- Detail retention: 4.8→4.6 (overall average across difficulty levels,
multiple runs)

### Review corrections
- Fixed Medium from 4.3→4.1 (honest multi-run average, not
cherry-picked)
- Average row (4.6) now self-consistent with column values:
(5.0+4.1+4.8)/3=4.63≈4.6
- Added footnote clarifying that TW/compaction columns are from a prior
eval run
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant