fix: store scripted assistant content in eval + session-affinity recall boost by BYK · Pull Request #431 · BYK/loreai

BYK · 2026-05-20T20:35:04Z

Summary

Two fixes for eval recall quality that improve the inflated CM-1 score from 3.69 → 4.38.

Changes

1. Store scripted assistant turns in eval replay (`harness.ts`)

The eval replays scripted conversation turns through the gateway, but the gateway stores its own API response (which differs from the scripted content) as the assistant temporal message. The scenario's actual content (e.g., "staleness check vs flock", "5000ms timeout error") was never stored in temporal and therefore never searchable via recall.

Fix: After each scripted assistant turn is added to history, store it directly in temporal with the replay session ID. This makes the scenario's ground-truth content available for recall search.

Impact: m4 (exact error message) went from 1.2 → 4.0.

2. Session-affinity RRF boost (`recall.ts`)

When scope=all and sessionID is known, add extra RRF lists for same-session temporal and distillation results. This gives current-session content a ranking boost over cross-session LTM entries that may match keywords but lack session-specific context.

Eval Results (CM-1, 400K inflation)

Metric	Before	After
Average score	3.69	4.38
Questions >= 4.0	7/15	13/15
Questions = 5.0	5/15	9/15

Remaining failures: m3 (2.2, LTM entry outranks session content), h4 (1.7, initial wrong hypothesis not found).

Tests

1752 pass, 0 fail
Typecheck clean across all 4 packages

…ll boost Two fixes for eval recall quality: 1. Store scripted assistant turns in temporal during eval replay. The gateway stores its own API response (which differs from the scripted content), so the scenario's actual content was never searchable via recall. This fixes m4 (5000ms timeout error) going from 1.2 → 4.0. 2. Session-affinity RRF boost in recall: when scope=all and sessionID is known, add extra RRF lists for same-session temporal/distillation results. This boosts current-session content over cross-session LTM entries. Eval score: 3.69 → 4.38 at 400K inflation (13 of 15 questions >= 4.0).

…indow (#435) Updates marketing copy with the latest eval results from the recall quality + distillation transparency work (#428, #430, #431, #432, #433, #434). ### README.md - Context retention table: Medium 2.3→4.1, Hard 3.3→4.8, Average 3.9→4.6 - Lore vs tail-window delta: +50%→+77% - Added footnote: Lore scores averaged across multiple runs; TW/compaction baselines from a prior eval run with the same scenarios - Added v6 to version history ### docs/index.html - Hero stat: +50%→+77% vs tail-window - Detail retention: 4.8→4.6 (overall average across difficulty levels, multiple runs) ### Review corrections - Fixed Medium from 4.3→4.1 (honest multi-run average, not cherry-picked) - Average row (4.6) now self-consistent with column values: (5.0+4.1+4.8)/3=4.63≈4.6 - Added footnote clarifying that TW/compaction columns are from a prior eval run

BYK self-assigned this May 20, 2026

BYK merged commit 48ffead into main May 20, 2026
10 checks passed

BYK deleted the fix-eval-scripted-storage branch May 20, 2026 21:05

BYK mentioned this pull request May 20, 2026

docs: update eval results — context retention 3.9→4.6, +77% vs tail-window #435

Merged

This was referenced May 21, 2026

publish: BYK/loreai@0.23.0 #439

Closed

publish: BYK/loreai@0.23.0 #448

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: store scripted assistant content in eval + session-affinity recall boost#431

fix: store scripted assistant content in eval + session-affinity recall boost#431
BYK merged 1 commit into
mainfrom
fix-eval-scripted-storage

BYK commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented May 20, 2026

Summary

Changes

1. Store scripted assistant turns in eval replay (harness.ts)

2. Session-affinity RRF boost (recall.ts)

Eval Results (CM-1, 400K inflation)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Store scripted assistant turns in eval replay (`harness.ts`)

2. Session-affinity RRF boost (`recall.ts`)