Skip to content

Feat: Pin prompt versions in benchmark run artifacts#169

Merged
hungtranphamminh merged 2 commits into
devfrom
feat/MEM-56-pin-prompt-versions
May 19, 2026
Merged

Feat: Pin prompt versions in benchmark run artifacts#169
hungtranphamminh merged 2 commits into
devfrom
feat/MEM-56-pin-prompt-versions

Conversation

@hungtranphamminh
Copy link
Copy Markdown
Collaborator

Summary

Why

Two benchmark runs with different extractor prompts currently produce indistinguishable result JSONs on disk — both just say "LOCOMO 54.3 J overall". When we go back in a month and ask "why did the score jump in week N?", we have to grep commit history to figure out which prompt version was live. That's attribution debt that compounds, and it lands hard on MEM-54 (extract prompt v2 for the importance signal) and MEM-55 (extract prompt revision for assistant-side fact scope), both of which are about to change the extractor.

What

  • Server's GET /health response gains a prompt_versions: { extract, ask } block sourced from the existing FACT_EXTRACTION_PROMPT_VERSION (services/extractor.rs) and ASK_SYSTEM_PROMPT_VERSION (routes/admin.rs) constants.
  • Benchmark harness reads prompt_versions at run start. Fails fast if the server doesn't expose both fields (no silent fallback to empty metadata).
  • RunArtifact gains a prompt_versions: dict[str, str] field that's serialised into every result JSON.
  • generate_comparison_table renders a prompt versions row showing extract.vN/ask.vM per preset.

Solution

Server-side (commit da85894): minimal — just add a PromptVersions struct and populate it in the health() handler from the existing version constants. The constants already lived in the right places from Phase 3 of ENG-1747 (with documentation explicitly anticipating this exposure); ASK_SYSTEM_PROMPT_VERSION loses its #[allow(dead_code)] because it's now load-bearing.

Harness-side (commit ba2d945): the mode == "benchmark" check already happens at server boot validation, and prompt_versions is checked in the same block. On success, the dict is stashed on config["_server_prompt_versions"] so stage_eval picks it up without a signature change. The dataclass default of field(default_factory=dict) is for backward-compat with legacy artifacts that compare might still load — fresh runs always populate because the boot check aborts otherwise.

Manually verified end-to-end against the running server:

  • /health returns {"extract":"extract.v1","ask":"ask.v1"}
  • Harness fail-fast triggered against a pre-MEM-56 server with the documented error message
  • Stand-alone Python script confirmed: server → harness → artifact JSON contains the prompt_versions block
  • Comparison table renders the row with synthetic data

Types of Changes

  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Performance optimization (non-breaking change which addresses a performance issue)
  • Refactor (non-breaking change which does not change existing behavior or add new functionality)
  • Library update (non-breaking change that will update one or more libraries to newer versions)
  • Documentation (non-breaking change that doesn't change code behavior, can skip testing)
  • Test (non-breaking change related to testing)
  • Security awareness (changes that affect permission scope, security scenarios)

Testing

  • I have tested this code locally
  • I have added/updated unit tests
  • I have added/updated integration tests
  • I have tested in multiple browsers (if applicable)

Unit tests: 188/188 pass (was 187 — one new test pins the wire shape).

  • health_response_serializes_prompt_versions_block (types.rs) — pins the JSON keys (prompt_versions.extract / prompt_versions.ask) so a future rename can't silently break the harness pipeline.

Manual integration verification:

Check Result
curl localhost:3001/health returns prompt_versions block {"extract":"extract.v1","ask":"ask.v1"}
Harness fail-fast against pre-MEM-56 server (no field) ✅ Aborts with the documented error message
run.py startup logs prompt versions: extract=extract.v1 ask=ask.v1 ✅ Visible in boot output
RunArtifact serializes the field via dataclasses.asdict ✅ Confirmed in stand-alone Python smoke
generate_comparison_table renders the prompt versions row ✅ Hand-tested with synthetic data

Not benchmarked end-to-end because this PR doesn't touch scoring, extraction, or retrieval — pure metadata plumbing. A full LOCOMO + LongMemEval re-run would reproduce yesterday's ranker numbers within judge noise at ~$8 spend. Saving that budget for MEM-54 / MEM-55 / MEM-57 where benchmarks actually buy signal.

Checklist

  • My code follows the code style of this project
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

Related Issues

  • Closes MEM-56
  • Part of MEM-52 (RAG quality, cycle 13)
  • Unblocks MEM-54, MEM-55, MEM-57 (all change the extract prompt and need their J-Score deltas to be attributable)

Additional Notes

This is the infrastructure prerequisite for the next three sub-issues. Without it, MEM-54's importance-signal benchmark and MEM-55's scope-fix benchmark would produce result JSONs indistinguishable from each other or from MEM-53's baseline — making the cycle's J-Score deltas non-attributable a month from now.

Two logical commits to keep the server-side change reviewable separately from the harness-side change:

  1. da85894feat(server): expose prompt versions on /health (MEM-56) — types + handler wiring + one wire-shape test
  2. ba2d945feat(benchmarks): pin prompt_versions into run artifacts (MEM-56) — RunArtifact field, harness fail-fast, comparison table row

Surface FACT_EXTRACTION_PROMPT_VERSION (extractor.rs) and
ASK_SYSTEM_PROMPT_VERSION (admin.rs) on the /health response so the
benchmark harness can pin them into result-artifact metadata at run
start. Closes the attribution gap where two LOCOMO runs with different
extractor prompts produced indistinguishable JSON on disk.

HealthResponse gains a prompt_versions: PromptVersions block with
extract + ask fields. Both fields are always populated — there is no
"version unknown" state for a running server.
ASK_SYSTEM_PROMPT_VERSION loses its #[allow(dead_code)] since it's now
load-bearing.

Pinned by health_response_serializes_prompt_versions_block so a future
rename can't silently break the harness pipeline.

188/188 tests pass (was 187).

MEM-56
Read prompt_versions from GET /health at run start, fail fast if the
server doesn't expose them (no silent fallback to empty metadata), and
thread the dict onto RunArtifact so every result JSON records which
extract.v* / ask.v* produced it. Comparison table renders a
'prompt versions' row so a future 'score jumped in week N' delta is
attributable to the prompt change vs the weights change rather than
guessed at from git history.

Changes:

- core/types.py: RunArtifact gains prompt_versions: dict[str, str]
  with empty-dict default (legacy artifacts loaded by 'compare' still
  parse). Fresh runs always populate because the harness fails fast
  at startup when the field is missing.
- run.py: at server boot check, after the mode validation, abort with
  a clear error if health.prompt_versions doesn't carry both extract
  and ask. On success, log the versions and stash them on config
  under _server_prompt_versions so stage_eval picks them up without a
  signature change.
- core/report.py: generate_comparison_table renders a 'prompt versions'
  row showing extract.vN/ask.vM per preset. Empty cells for legacy
  artifacts so cross-cycle comparisons stay readable.

Manually verified end-to-end against the running server:
- /health returns {extract:extract.v1, ask:ask.v1}
- Harness fail-fast triggered against a pre-MEM-56 server (no
  prompt_versions field) with the documented error message
- Stand-alone Python script confirmed: server -> harness -> artifact
  JSON contains the prompt_versions block
- Comparison table renders the row with synthetic data

Not benchmarked end-to-end because this PR doesn't touch scoring,
extraction, or retrieval — pure metadata plumbing. A full LOCOMO +
LongMemEval re-run would reproduce yesterday's ranker numbers within
judge noise at ~$8 spend. Saving that budget for MEM-54 / MEM-55 /
MEM-57 where benchmarks actually buy signal.

MEM-56
@hungtranphamminh hungtranphamminh merged commit 47a1f6f into dev May 19, 2026
7 checks passed
@hungtranphamminh hungtranphamminh deleted the feat/MEM-56-pin-prompt-versions branch May 19, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants