Skip to content

fix: store scripted assistant content in eval + session-affinity recall boost#431

Merged
BYK merged 1 commit into
mainfrom
fix-eval-scripted-storage
May 20, 2026
Merged

fix: store scripted assistant content in eval + session-affinity recall boost#431
BYK merged 1 commit into
mainfrom
fix-eval-scripted-storage

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 20, 2026

Summary

Two fixes for eval recall quality that improve the inflated CM-1 score from 3.69 → 4.38.

Changes

1. Store scripted assistant turns in eval replay (harness.ts)

The eval replays scripted conversation turns through the gateway, but the gateway stores its own API response (which differs from the scripted content) as the assistant temporal message. The scenario's actual content (e.g., "staleness check vs flock", "5000ms timeout error") was never stored in temporal and therefore never searchable via recall.

Fix: After each scripted assistant turn is added to history, store it directly in temporal with the replay session ID. This makes the scenario's ground-truth content available for recall search.

Impact: m4 (exact error message) went from 1.2 → 4.0.

2. Session-affinity RRF boost (recall.ts)

When scope=all and sessionID is known, add extra RRF lists for same-session temporal and distillation results. This gives current-session content a ranking boost over cross-session LTM entries that may match keywords but lack session-specific context.

Eval Results (CM-1, 400K inflation)

Metric Before After
Average score 3.69 4.38
Questions >= 4.0 7/15 13/15
Questions = 5.0 5/15 9/15

Remaining failures: m3 (2.2, LTM entry outranks session content), h4 (1.7, initial wrong hypothesis not found).

Tests

  • 1752 pass, 0 fail
  • Typecheck clean across all 4 packages

…ll boost

Two fixes for eval recall quality:

1. Store scripted assistant turns in temporal during eval replay. The gateway
   stores its own API response (which differs from the scripted content), so
   the scenario's actual content was never searchable via recall. This fixes
   m4 (5000ms timeout error) going from 1.2 → 4.0.

2. Session-affinity RRF boost in recall: when scope=all and sessionID is known,
   add extra RRF lists for same-session temporal/distillation results. This
   boosts current-session content over cross-session LTM entries.

Eval score: 3.69 → 4.38 at 400K inflation (13 of 15 questions >= 4.0).
@BYK BYK self-assigned this May 20, 2026
@BYK BYK merged commit 48ffead into main May 20, 2026
10 checks passed
@BYK BYK deleted the fix-eval-scripted-storage branch May 20, 2026 21:05
BYK added a commit that referenced this pull request May 21, 2026
…indow (#435)

Updates marketing copy with the latest eval results from the recall
quality + distillation transparency work (#428, #430, #431, #432, #433,
#434).

### README.md
- Context retention table: Medium 2.3→4.1, Hard 3.3→4.8, Average 3.9→4.6
- Lore vs tail-window delta: +50%→+77%
- Added footnote: Lore scores averaged across multiple runs;
TW/compaction baselines from a prior eval run with the same scenarios
- Added v6 to version history

### docs/index.html
- Hero stat: +50%→+77% vs tail-window
- Detail retention: 4.8→4.6 (overall average across difficulty levels,
multiple runs)

### Review corrections
- Fixed Medium from 4.3→4.1 (honest multi-run average, not
cherry-picked)
- Average row (4.6) now self-consistent with column values:
(5.0+4.1+4.8)/3=4.63≈4.6
- Added footnote clarifying that TW/compaction columns are from a prior
eval run
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant