Feat: extract.v2 — relax fact-extraction scope to both parties#173
Merged
Conversation
Relax the extractor prompt's user-only scope to cover memorable facts from either party in the conversation. The v1 prompt scoped extraction to "facts about the user", which systematically under-counted assistant-side content (recommendations, conclusions, summaries, plans). LongMemEval's `single_session_assistant` category sat at ~29.91 J because of this — the LLM was capable of distinguishing user-said from assistant-said facts when asked, but the prompt was preventing it from extracting the latter at all. Bumps `FACT_EXTRACTION_PROMPT_VERSION` from "extract.v1" to "extract.v2". The const is surfaced on `GET /health` (via MEM-56) so every benchmark run-artifact JSON carries the version it was produced under. Prompt-injection guard, NONE-on-no-facts behaviour, and the one-fact-per-line output shape are all preserved verbatim from v1. New rules cover what the assistant says vs what to skip (acks, restatements, formatting meta-talk). Benchmark headline (PlaintextEngine, gpt-4o judge, gpt-4o-mini answer): - LongMemEval overall: 72.15 → 76.6 (+4.45 J) - LongMemEval `single_session_assistant`: 29.91 → 74.2 (+44.3 J, the cycle's first significant single-category lift) - LongMemEval other categories: within judge noise on every other category (smallest move −2.1 on `temporal`) - LOCOMO overall: 53.88 → 53.7 (flat) - LOCOMO `single_hop`: 53.40 → 43.5 (−9.9 J, ~6 SEMs — real, not noise) The LOCOMO `single_hop` regression is a dilution effect at the recall `limit=10` cut: extract.v2 extracts +33% more facts per conversation, so the relevant user-side fact gets pushed below position 10 more often when synthetic single-fact-lookup queries hit. Fix path is MEM-54 (importance signal weighting user-said personal facts higher at ranker time) — landing next, this cycle. Three v2 prompt variants were explored during MEM-55 development to see if the regression could be addressed at the prompt layer. It can't — per-turn ingestion (one /api/analyze call per speaker turn) makes "dedup against context" impossible to implement reliably because the LLM doesn't see the other turns. The fix belongs at the ranker layer. Pre-commit validation gate was per-category: LOCOMO `single_hop` within ±2 J failed by a wide margin. Shipping anyway because the averaged-across-benchmarks delta is +2.13 J net and the fix path for the per-category regression is concrete and immediate (MEM-54). Documented loudly in the benchmark archive README rather than dressed up. MEM-55
Archive the LongMemEval + LOCOMO baselines that validated the extract.v2 prompt change. All on commit 47a1f6f (current dev tip with MEM-53 ranker + MEM-56 prompt-version pinning merged). Headlines documented in the README: - LongMemEval overall 76.6 (+4.45 vs v1) - LongMemEval `single_session_assistant` 74.2 (+44.3) - LOCOMO overall 53.7 (flat) - LOCOMO `single_hop` 43.5 (−9.9, real not noise) README is explicit about the validation-gate accounting: gate (3) "LOCOMO within ±2 J" failed on the `single_hop` per-category delta. Documents the dilution-at-recall-limit root cause, why we ship anyway (averaged net +2.13 J, MEM-54 is the next sub-issue and directly addresses the dilution), and what we'd do if MEM-54 doesn't deliver (revisit or revert). This is the first benchmark archive that exercises MEM-56's prompt-version attribution pipeline end-to-end: every artifact JSON carries `prompt_versions: {extract: extract.v2, ask: ask.v1}` in its metadata block, so future cross-run comparisons can attribute J-Score deltas to the prompt change without guessing. MEM-55
ducnmm
approved these changes
May 19, 2026
This was referenced May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why
LongMemEval's
single_session_assistantcategory sat at 29.91 J — by far the worst category on our benchmark scorecard. Root cause: theextract.v1prompt explicitly scoped fact extraction to "facts about the user", systematically under-counting assistant-side content (recommendations, conclusions, summaries, plans). The LLM was capable of distinguishing user-said from assistant-said facts when asked properly; the prompt was preventing it from extracting the latter at all.This is a real product gap. MemWal users build assistants that need to remember what the assistant recommended in previous conversations — a category-defining MemWal use case where we were scoring 30 J.
What
services/prompts/extract.txt: relaxed scope from "facts about the user" to "memorable facts from either party". New rules for what assistant content to capture (recommendations, plans agreed, factual claims, summaries) and what to skip (acks, restatements, formatting meta-talk).services/extractor.rs: bumpedFACT_EXTRACTION_PROMPT_VERSIONfrom"extract.v1"to"extract.v2". Surfaced on/healthvia MEM-56, so every benchmark artifact JSON now recordsextract.v2in its metadata block.NONE-on-no-facts behaviour, and the one-fact-per-line output shape preserved verbatim from v1.Solution
The prompt change itself is small. The honest part of this PR is what comes with it:
Three v2 iterations were explored during MEM-55 development to find a single prompt that maximises LongMemEval
single_session_assistantwithout regressing LOCOMOsingle_hop. The result is structural — per-turn ingestion (one/api/analyzecall per speaker turn) makes "dedup against context" impossible to implement reliably at the prompt layer, because the LLM doesn't see the other turns. The prompt-level Pareto frontier between LME assistant-side and LOCOMO user-side facts is real and can't be talked around.We're shipping the first v2 variant (relaxed scope, no internal dedup rules) because it gives the largest net delta when averaged across both benchmarks. The LOCOMO
single_hopregression that comes with it is a dilution-at-recall-limit problem —extract.v2extracts +33% more facts per conversation, and atlimit=10the user-side fact gets pushed out by competing assistant-side facts. The fix belongs at the ranker layer, not the prompt layer. That's MEM-54 (importance signal), which is the next sub-issue this cycle.Types of Changes
Testing
Unit tests: 188/188 pass (no new tests — the prompt is a versioned text asset, behaviour is exercised by the end-to-end benchmark runs below).
LongMemEval (the primary target — 500 queries, e2e mode,
gpt-4ojudge):single_session_assistantmulti_sessionpreferencesingle_session_userknowledge_updatetemporalSEM on overall is ±1.36 (stddev/√500); +4.45 is ~3 SEMs out — statistically significant. The
single_session_assistantlift of +44.3 J is the largest single-category move on this codebase to date.LOCOMO (regression check — 1986 queries):
adversarialmulti_hopopen_domainsingle_hoptemporalThe
single_hopregression is real, not noise. SEM = ±1.67 (stddev/√282); a −9.9 delta is ~6 SEMs out. Documented at length in the benchmark archive README.Full archive + per-category breakdown + methodology + root-cause analysis in
services/server/review/assessment/benchmark-runs/2026-05-19-mem55-extract-v2/README.md.Checklist
Related Issues
single_hopregression at the ranker layerAdditional Notes
Validation-gate accounting
The MEM-55 ticket's validation gate was per-category:
single_session_assistantimproved on LongMemEvaltemporalat −2.1 just oversingle_hopregressed −9.9Gate (3) failed by a wide margin on per-category reading. We are shipping anyway. Reasons:
single_session_assistantlift maps to a real product use case (assistant memory of past statements / recommendations); LOCOMOsingle_hopis synthetic single-fact lookup that doesn't reflect typical MemWal SDK usage.single_hopregression is concrete, scoped, and committed: MEM-54 (importance signal) is the next sub-issue. It weights user-said personal facts higher than assistant outputs at ranker time — directly addressing the dilution.If MEM-54 doesn't deliver the importance-signal recovery on
single_hop, the choice will be: tune the prompt to be less aggressive on assistant-side facts, or revert this PR. Either is on the table.MEM-56's prompt-version attribution working live
This is the first PR that actually exercises MEM-56's
prompt_versionsfield on/healthand in run artifacts. Every artifact JSON archived inreview/assessment/benchmark-runs/2026-05-19-mem55-extract-v2/results/carriesprompt_versions: {extract: extract.v2, ask: ask.v1}in its metadata block. Future cross-cycle comparisons attribute cleanly.Why no new unit tests
The prompt is a versioned text asset (
include_str!("prompts/extract.txt")). There's no Rust logic to unit-test — the rules are tested by the end-to-end benchmark runs above, against the same gold-truth dataset Mem0/Zep/Supermemory use. The version constant is pinned byhealth_response_serializes_prompt_versions_block(added in MEM-56).