Skip to content

feat: wire --inflate flag into eval CLI for 400K token scenario testing#386

Merged
BYK merged 1 commit into
mainfrom
feat-inflate-cli
May 19, 2026
Merged

feat: wire --inflate flag into eval CLI for 400K token scenario testing#386
BYK merged 1 commit into
mainfrom
feat-inflate-cli

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 19, 2026

Summary

Adds --inflate <tokens> flag to the eval CLI that inflates scenarios to a target token count before running them. This enables fair baseline comparison at realistic conversation lengths.

First Results at 400K Tokens

PR-2 (implicit preferences):

Baseline Score Delta
Lore 4.20
Tail-window 2.90 -1.30

Lore decisively outperforms tail-window at realistic conversation lengths. At 400K tokens, tail-window can only fit the last 80K tokens (dropping preferences stated early in the conversation), while Lore's distillation preserves them.

Lore wins 6/8 questions. Tail-window only wins on 2 questions where the relevant facts happen to be in its surviving 80K window.

Usage

bun packages/core/eval/run.ts --mode live --inflate 400000 --baselines lore,tail-window

Files Changed

  • packages/core/eval/run.ts--inflate arg parsing + logging
  • packages/core/eval/types.tsinflateTokens field on EvalConfig
  • packages/core/eval/harness.ts — inflation before scenario execution

Adds --inflate <tokens> flag to eval run.ts that inflates scenarios
before running them. Uses inflateScenario() from inflate.ts.

First results at 400K tokens (PR-2 implicit preferences):
  Lore:        4.20
  Tail-window: 2.90

At realistic conversation lengths, Lore decisively outperforms
tail-window on 6/8 questions.
@BYK BYK self-assigned this May 19, 2026
@BYK BYK merged commit 1e3ab93 into main May 19, 2026
10 checks passed
@BYK BYK deleted the feat-inflate-cli branch May 19, 2026 10:42
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant