Feat: Pin prompt versions in benchmark run artifacts by hungtranphamminh · Pull Request #169 · MystenLabs/MemWal

hungtranphamminh · 2026-05-19T02:21:26Z

Summary

Why

Two benchmark runs with different extractor prompts currently produce indistinguishable result JSONs on disk — both just say "LOCOMO 54.3 J overall". When we go back in a month and ask "why did the score jump in week N?", we have to grep commit history to figure out which prompt version was live. That's attribution debt that compounds, and it lands hard on MEM-54 (extract prompt v2 for the importance signal) and MEM-55 (extract prompt revision for assistant-side fact scope), both of which are about to change the extractor.

What

Server's GET /health response gains a prompt_versions: { extract, ask } block sourced from the existing FACT_EXTRACTION_PROMPT_VERSION (services/extractor.rs) and ASK_SYSTEM_PROMPT_VERSION (routes/admin.rs) constants.
Benchmark harness reads prompt_versions at run start. Fails fast if the server doesn't expose both fields (no silent fallback to empty metadata).
RunArtifact gains a prompt_versions: dict[str, str] field that's serialised into every result JSON.
generate_comparison_table renders a prompt versions row showing extract.vN/ask.vM per preset.

Solution

Server-side (commit da85894): minimal — just add a PromptVersions struct and populate it in the health() handler from the existing version constants. The constants already lived in the right places from Phase 3 of ENG-1747 (with documentation explicitly anticipating this exposure); ASK_SYSTEM_PROMPT_VERSION loses its #[allow(dead_code)] because it's now load-bearing.

Harness-side (commit ba2d945): the mode == "benchmark" check already happens at server boot validation, and prompt_versions is checked in the same block. On success, the dict is stashed on config["_server_prompt_versions"] so stage_eval picks it up without a signature change. The dataclass default of field(default_factory=dict) is for backward-compat with legacy artifacts that compare might still load — fresh runs always populate because the boot check aborts otherwise.

Manually verified end-to-end against the running server:

/health returns {"extract":"extract.v1","ask":"ask.v1"}
Harness fail-fast triggered against a pre-MEM-56 server with the documented error message
Stand-alone Python script confirmed: server → harness → artifact JSON contains the prompt_versions block
Comparison table renders the row with synthetic data

Types of Changes

Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Performance optimization (non-breaking change which addresses a performance issue)
Refactor (non-breaking change which does not change existing behavior or add new functionality)
Library update (non-breaking change that will update one or more libraries to newer versions)
Documentation (non-breaking change that doesn't change code behavior, can skip testing)
Test (non-breaking change related to testing)
Security awareness (changes that affect permission scope, security scenarios)

Testing

I have tested this code locally
I have added/updated unit tests
I have added/updated integration tests
I have tested in multiple browsers (if applicable)

Unit tests: 188/188 pass (was 187 — one new test pins the wire shape).

health_response_serializes_prompt_versions_block (types.rs) — pins the JSON keys (prompt_versions.extract / prompt_versions.ask) so a future rename can't silently break the harness pipeline.

Manual integration verification:

Check	Result
`curl localhost:3001/health` returns `prompt_versions` block	✅ `{"extract":"extract.v1","ask":"ask.v1"}`
Harness fail-fast against pre-MEM-56 server (no field)	✅ Aborts with the documented error message
`run.py` startup logs `prompt versions: extract=extract.v1 ask=ask.v1`	✅ Visible in boot output
`RunArtifact` serializes the field via `dataclasses.asdict`	✅ Confirmed in stand-alone Python smoke
`generate_comparison_table` renders the `prompt versions` row	✅ Hand-tested with synthetic data

Not benchmarked end-to-end because this PR doesn't touch scoring, extraction, or retrieval — pure metadata plumbing. A full LOCOMO + LongMemEval re-run would reproduce yesterday's ranker numbers within judge noise at ~$8 spend. Saving that budget for MEM-54 / MEM-55 / MEM-57 where benchmarks actually buy signal.

Checklist

My code follows the code style of this project
My change requires a change to the documentation
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

Related Issues

Closes MEM-56
Part of MEM-52 (RAG quality, cycle 13)
Unblocks MEM-54, MEM-55, MEM-57 (all change the extract prompt and need their J-Score deltas to be attributable)

Additional Notes

This is the infrastructure prerequisite for the next three sub-issues. Without it, MEM-54's importance-signal benchmark and MEM-55's scope-fix benchmark would produce result JSONs indistinguishable from each other or from MEM-53's baseline — making the cycle's J-Score deltas non-attributable a month from now.

Two logical commits to keep the server-side change reviewable separately from the harness-side change:

da85894 — feat(server): expose prompt versions on /health (MEM-56) — types + handler wiring + one wire-shape test
ba2d945 — feat(benchmarks): pin prompt_versions into run artifacts (MEM-56) — RunArtifact field, harness fail-fast, comparison table row

Surface FACT_EXTRACTION_PROMPT_VERSION (extractor.rs) and ASK_SYSTEM_PROMPT_VERSION (admin.rs) on the /health response so the benchmark harness can pin them into result-artifact metadata at run start. Closes the attribution gap where two LOCOMO runs with different extractor prompts produced indistinguishable JSON on disk. HealthResponse gains a prompt_versions: PromptVersions block with extract + ask fields. Both fields are always populated — there is no "version unknown" state for a running server. ASK_SYSTEM_PROMPT_VERSION loses its #[allow(dead_code)] since it's now load-bearing. Pinned by health_response_serializes_prompt_versions_block so a future rename can't silently break the harness pipeline. 188/188 tests pass (was 187). MEM-56

Read prompt_versions from GET /health at run start, fail fast if the server doesn't expose them (no silent fallback to empty metadata), and thread the dict onto RunArtifact so every result JSON records which extract.v* / ask.v* produced it. Comparison table renders a 'prompt versions' row so a future 'score jumped in week N' delta is attributable to the prompt change vs the weights change rather than guessed at from git history. Changes: - core/types.py: RunArtifact gains prompt_versions: dict[str, str] with empty-dict default (legacy artifacts loaded by 'compare' still parse). Fresh runs always populate because the harness fails fast at startup when the field is missing. - run.py: at server boot check, after the mode validation, abort with a clear error if health.prompt_versions doesn't carry both extract and ask. On success, log the versions and stash them on config under _server_prompt_versions so stage_eval picks them up without a signature change. - core/report.py: generate_comparison_table renders a 'prompt versions' row showing extract.vN/ask.vM per preset. Empty cells for legacy artifacts so cross-cycle comparisons stay readable. Manually verified end-to-end against the running server: - /health returns {extract:extract.v1, ask:ask.v1} - Harness fail-fast triggered against a pre-MEM-56 server (no prompt_versions field) with the documented error message - Stand-alone Python script confirmed: server -> harness -> artifact JSON contains the prompt_versions block - Comparison table renders the row with synthetic data Not benchmarked end-to-end because this PR doesn't touch scoring, extraction, or retrieval — pure metadata plumbing. A full LOCOMO + LongMemEval re-run would reproduce yesterday's ranker numbers within judge noise at ~$8 spend. Saving that budget for MEM-54 / MEM-55 / MEM-57 where benchmarks actually buy signal. MEM-56

hungtranphamminh added 2 commits May 19, 2026 09:17

ducnmm approved these changes May 19, 2026

View reviewed changes

hungtranphamminh merged commit 47a1f6f into dev May 19, 2026
7 checks passed

hungtranphamminh deleted the feat/MEM-56-pin-prompt-versions branch May 19, 2026 05:50

hungtranphamminh mentioned this pull request May 19, 2026

Feat: extract.v2 — relax fact-extraction scope to both parties #173

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Pin prompt versions in benchmark run artifacts#169

Feat: Pin prompt versions in benchmark run artifacts#169
hungtranphamminh merged 2 commits into
devfrom
feat/MEM-56-pin-prompt-versions

hungtranphamminh commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hungtranphamminh commented May 19, 2026

Summary

Why

What

Solution

Types of Changes

Testing

Checklist

Related Issues

Additional Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants