Feat: Pin prompt versions in benchmark run artifacts#169
Merged
Conversation
Surface FACT_EXTRACTION_PROMPT_VERSION (extractor.rs) and ASK_SYSTEM_PROMPT_VERSION (admin.rs) on the /health response so the benchmark harness can pin them into result-artifact metadata at run start. Closes the attribution gap where two LOCOMO runs with different extractor prompts produced indistinguishable JSON on disk. HealthResponse gains a prompt_versions: PromptVersions block with extract + ask fields. Both fields are always populated — there is no "version unknown" state for a running server. ASK_SYSTEM_PROMPT_VERSION loses its #[allow(dead_code)] since it's now load-bearing. Pinned by health_response_serializes_prompt_versions_block so a future rename can't silently break the harness pipeline. 188/188 tests pass (was 187). MEM-56
Read prompt_versions from GET /health at run start, fail fast if the
server doesn't expose them (no silent fallback to empty metadata), and
thread the dict onto RunArtifact so every result JSON records which
extract.v* / ask.v* produced it. Comparison table renders a
'prompt versions' row so a future 'score jumped in week N' delta is
attributable to the prompt change vs the weights change rather than
guessed at from git history.
Changes:
- core/types.py: RunArtifact gains prompt_versions: dict[str, str]
with empty-dict default (legacy artifacts loaded by 'compare' still
parse). Fresh runs always populate because the harness fails fast
at startup when the field is missing.
- run.py: at server boot check, after the mode validation, abort with
a clear error if health.prompt_versions doesn't carry both extract
and ask. On success, log the versions and stash them on config
under _server_prompt_versions so stage_eval picks them up without a
signature change.
- core/report.py: generate_comparison_table renders a 'prompt versions'
row showing extract.vN/ask.vM per preset. Empty cells for legacy
artifacts so cross-cycle comparisons stay readable.
Manually verified end-to-end against the running server:
- /health returns {extract:extract.v1, ask:ask.v1}
- Harness fail-fast triggered against a pre-MEM-56 server (no
prompt_versions field) with the documented error message
- Stand-alone Python script confirmed: server -> harness -> artifact
JSON contains the prompt_versions block
- Comparison table renders the row with synthetic data
Not benchmarked end-to-end because this PR doesn't touch scoring,
extraction, or retrieval — pure metadata plumbing. A full LOCOMO +
LongMemEval re-run would reproduce yesterday's ranker numbers within
judge noise at ~$8 spend. Saving that budget for MEM-54 / MEM-55 /
MEM-57 where benchmarks actually buy signal.
MEM-56
ducnmm
approved these changes
May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why
Two benchmark runs with different extractor prompts currently produce indistinguishable result JSONs on disk — both just say "LOCOMO 54.3 J overall". When we go back in a month and ask "why did the score jump in week N?", we have to grep commit history to figure out which prompt version was live. That's attribution debt that compounds, and it lands hard on MEM-54 (extract prompt v2 for the importance signal) and MEM-55 (extract prompt revision for assistant-side fact scope), both of which are about to change the extractor.
What
GET /healthresponse gains aprompt_versions: { extract, ask }block sourced from the existingFACT_EXTRACTION_PROMPT_VERSION(services/extractor.rs) andASK_SYSTEM_PROMPT_VERSION(routes/admin.rs) constants.prompt_versionsat run start. Fails fast if the server doesn't expose both fields (no silent fallback to empty metadata).RunArtifactgains aprompt_versions: dict[str, str]field that's serialised into every result JSON.generate_comparison_tablerenders aprompt versionsrow showingextract.vN/ask.vMper preset.Solution
Server-side (commit
da85894): minimal — just add aPromptVersionsstruct and populate it in thehealth()handler from the existing version constants. The constants already lived in the right places from Phase 3 of ENG-1747 (with documentation explicitly anticipating this exposure);ASK_SYSTEM_PROMPT_VERSIONloses its#[allow(dead_code)]because it's now load-bearing.Harness-side (commit
ba2d945): themode == "benchmark"check already happens at server boot validation, andprompt_versionsis checked in the same block. On success, the dict is stashed onconfig["_server_prompt_versions"]sostage_evalpicks it up without a signature change. The dataclass default offield(default_factory=dict)is for backward-compat with legacy artifacts thatcomparemight still load — fresh runs always populate because the boot check aborts otherwise.Manually verified end-to-end against the running server:
/healthreturns{"extract":"extract.v1","ask":"ask.v1"}prompt_versionsblockTypes of Changes
Testing
Unit tests: 188/188 pass (was 187 — one new test pins the wire shape).
health_response_serializes_prompt_versions_block(types.rs) — pins the JSON keys (prompt_versions.extract/prompt_versions.ask) so a future rename can't silently break the harness pipeline.Manual integration verification:
curl localhost:3001/healthreturnsprompt_versionsblock{"extract":"extract.v1","ask":"ask.v1"}run.pystartup logsprompt versions: extract=extract.v1 ask=ask.v1RunArtifactserializes the field viadataclasses.asdictgenerate_comparison_tablerenders theprompt versionsrowNot benchmarked end-to-end because this PR doesn't touch scoring, extraction, or retrieval — pure metadata plumbing. A full LOCOMO + LongMemEval re-run would reproduce yesterday's ranker numbers within judge noise at ~$8 spend. Saving that budget for MEM-54 / MEM-55 / MEM-57 where benchmarks actually buy signal.
Checklist
Related Issues
Additional Notes
This is the infrastructure prerequisite for the next three sub-issues. Without it, MEM-54's importance-signal benchmark and MEM-55's scope-fix benchmark would produce result JSONs indistinguishable from each other or from MEM-53's baseline — making the cycle's J-Score deltas non-attributable a month from now.
Two logical commits to keep the server-side change reviewable separately from the harness-side change:
da85894—feat(server): expose prompt versions on /health (MEM-56)— types + handler wiring + one wire-shape testba2d945—feat(benchmarks): pin prompt_versions into run artifacts (MEM-56)— RunArtifact field, harness fail-fast, comparison table row