Feat: extract.v2 — relax fact-extraction scope to both parties by hungtranphamminh · Pull Request #173 · MystenLabs/MemWal

hungtranphamminh · 2026-05-19T16:40:21Z

Summary

Why

LongMemEval's single_session_assistant category sat at 29.91 J — by far the worst category on our benchmark scorecard. Root cause: the extract.v1 prompt explicitly scoped fact extraction to "facts about the user", systematically under-counting assistant-side content (recommendations, conclusions, summaries, plans). The LLM was capable of distinguishing user-said from assistant-said facts when asked properly; the prompt was preventing it from extracting the latter at all.

This is a real product gap. MemWal users build assistants that need to remember what the assistant recommended in previous conversations — a category-defining MemWal use case where we were scoring 30 J.

What

services/prompts/extract.txt: relaxed scope from "facts about the user" to "memorable facts from either party". New rules for what assistant content to capture (recommendations, plans agreed, factual claims, summaries) and what to skip (acks, restatements, formatting meta-talk).
services/extractor.rs: bumped FACT_EXTRACTION_PROMPT_VERSION from "extract.v1" to "extract.v2". Surfaced on /health via MEM-56, so every benchmark artifact JSON now records extract.v2 in its metadata block.
Prompt-injection guard, NONE-on-no-facts behaviour, and the one-fact-per-line output shape preserved verbatim from v1.

Solution

The prompt change itself is small. The honest part of this PR is what comes with it:

Three v2 iterations were explored during MEM-55 development to find a single prompt that maximises LongMemEval single_session_assistant without regressing LOCOMO single_hop. The result is structural — per-turn ingestion (one /api/analyze call per speaker turn) makes "dedup against context" impossible to implement reliably at the prompt layer, because the LLM doesn't see the other turns. The prompt-level Pareto frontier between LME assistant-side and LOCOMO user-side facts is real and can't be talked around.

We're shipping the first v2 variant (relaxed scope, no internal dedup rules) because it gives the largest net delta when averaged across both benchmarks. The LOCOMO single_hop regression that comes with it is a dilution-at-recall-limit problem — extract.v2 extracts +33% more facts per conversation, and at limit=10 the user-side fact gets pushed out by competing assistant-side facts. The fix belongs at the ranker layer, not the prompt layer. That's MEM-54 (importance signal), which is the next sub-issue this cycle.

Types of Changes

Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Performance optimization (non-breaking change which addresses a performance issue)
Refactor (non-breaking change which does not change existing behavior or add new functionality)
Library update (non-breaking change that will update one or more libraries to newer versions)
Documentation (non-breaking change that doesn't change code behavior, can skip testing)
Test (non-breaking change related to testing)
Security awareness (changes that affect permission scope, security scenarios)

Testing

I have tested this code locally
I have added/updated unit tests
I have added/updated integration tests
I have tested in multiple browsers (if applicable)

Unit tests: 188/188 pass (no new tests — the prompt is a versioned text asset, behaviour is exercised by the end-to-end benchmark runs below).

LongMemEval (the primary target — 500 queries, e2e mode, gpt-4o judge):

Category	v1 (May 18)	extract.v2	Δ
`single_session_assistant`	29.91 ± 17.07	74.2 ± 33.9	+44.3
`multi_session`	78.57 ± 24.90	79.8 ± 25.4	+1.2
`preference`	77.83 ± 17.11	77.2 ± 21.9	−0.6
`single_session_user`	95.21 ± 16.40	94.7 ± 17.9	−0.5
`knowledge_update`	86.10 ± 20.98	84.7 ± 22.5	−1.4
`temporal`	62.03 ± 31.46	59.9 ± 31.9	−2.1
Overall	72.15 ± 30.50	76.6 ± 29.3	+4.45

SEM on overall is ±1.36 (stddev/√500); +4.45 is ~3 SEMs out — statistically significant. The single_session_assistant lift of +44.3 J is the largest single-category move on this codebase to date.

LOCOMO (regression check — 1986 queries):

Category	v1 (May 18)	extract.v2	Δ
`adversarial`	71.33 ± 34.44	75.7 ± 33.0	+4.4
`multi_hop`	47.08 ± 27.94	47.2 ± 26.3	+0.1
`open_domain`	52.22 ± 32.22	52.2 ± 32.8	0.0
`single_hop`	53.40 ± 29.26	43.5 ± 27.3	−9.9
`temporal`	36.42 ± 20.15	37.9 ± 21.4	+1.5
Overall	53.88 ± 32.43	53.7 ± 32.9	−0.18

The single_hop regression is real, not noise. SEM = ±1.67 (stddev/√282); a −9.9 delta is ~6 SEMs out. Documented at length in the benchmark archive README.

Full archive + per-category breakdown + methodology + root-cause analysis in services/server/review/assessment/benchmark-runs/2026-05-19-mem55-extract-v2/README.md.

Checklist

My code follows the code style of this project
My change requires a change to the documentation
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

Related Issues

Closes MEM-55
Part of MEM-52 (RAG quality, cycle 13)
Unblocked by MEM-56 (PR Feat: Pin prompt versions in benchmark run artifacts #169, the prompt-version attribution pipeline this PR is the first to actually exercise)
MEM-54 (importance signal) is the next sub-issue, addressing the LOCOMO single_hop regression at the ranker layer

Additional Notes

Validation-gate accounting

The MEM-55 ticket's validation gate was per-category:

Gate	Result
(1) `single_session_assistant` improved on LongMemEval	✅ +44.3 — beat the forecast (29 → 60ish) by ~14 points
(2) Other LongMemEval categories within ±2 J	⚠️ mostly yes; `temporal` at −2.1 just over
(3) LOCOMO within ±2 J	❌ overall flat (−0.18) but `single_hop` regressed −9.9

Gate (3) failed by a wide margin on per-category reading. We are shipping anyway. Reasons:

Averaged across both benchmarks, extract.v2 is +2.13 J net (LME 76.6, LOCOMO 53.7) vs v1 (72.15, 53.88) → 65.15 vs 63.02. The largest cycle-13 delta on either benchmark to date.
The LME single_session_assistant lift maps to a real product use case (assistant memory of past statements / recommendations); LOCOMO single_hop is synthetic single-fact lookup that doesn't reflect typical MemWal SDK usage.
The fix path for the single_hop regression is concrete, scoped, and committed: MEM-54 (importance signal) is the next sub-issue. It weights user-said personal facts higher than assistant outputs at ranker time — directly addressing the dilution.
The pre-commit framing in MEM-52 explicitly allows shipping partial wins when the next step closing the gap is in hand. "Merge only if the benchmark results make sense" — these results make sense if MEM-54 is the immediate follow-up.
The honest counter-argument is that we are breaking a per-category gate. Documenting that loudly here and in the archive README rather than dressing it up.

If MEM-54 doesn't deliver the importance-signal recovery on single_hop, the choice will be: tune the prompt to be less aggressive on assistant-side facts, or revert this PR. Either is on the table.

MEM-56's prompt-version attribution working live

This is the first PR that actually exercises MEM-56's prompt_versions field on /health and in run artifacts. Every artifact JSON archived in review/assessment/benchmark-runs/2026-05-19-mem55-extract-v2/results/ carries prompt_versions: {extract: extract.v2, ask: ask.v1} in its metadata block. Future cross-cycle comparisons attribute cleanly.

Why no new unit tests

The prompt is a versioned text asset (include_str!("prompts/extract.txt")). There's no Rust logic to unit-test — the rules are tested by the end-to-end benchmark runs above, against the same gold-truth dataset Mem0/Zep/Supermemory use. The version constant is pinned by health_response_serializes_prompt_versions_block (added in MEM-56).

Relax the extractor prompt's user-only scope to cover memorable facts from either party in the conversation. The v1 prompt scoped extraction to "facts about the user", which systematically under-counted assistant-side content (recommendations, conclusions, summaries, plans). LongMemEval's `single_session_assistant` category sat at ~29.91 J because of this — the LLM was capable of distinguishing user-said from assistant-said facts when asked, but the prompt was preventing it from extracting the latter at all. Bumps `FACT_EXTRACTION_PROMPT_VERSION` from "extract.v1" to "extract.v2". The const is surfaced on `GET /health` (via MEM-56) so every benchmark run-artifact JSON carries the version it was produced under. Prompt-injection guard, NONE-on-no-facts behaviour, and the one-fact-per-line output shape are all preserved verbatim from v1. New rules cover what the assistant says vs what to skip (acks, restatements, formatting meta-talk). Benchmark headline (PlaintextEngine, gpt-4o judge, gpt-4o-mini answer): - LongMemEval overall: 72.15 → 76.6 (+4.45 J) - LongMemEval `single_session_assistant`: 29.91 → 74.2 (+44.3 J, the cycle's first significant single-category lift) - LongMemEval other categories: within judge noise on every other category (smallest move −2.1 on `temporal`) - LOCOMO overall: 53.88 → 53.7 (flat) - LOCOMO `single_hop`: 53.40 → 43.5 (−9.9 J, ~6 SEMs — real, not noise) The LOCOMO `single_hop` regression is a dilution effect at the recall `limit=10` cut: extract.v2 extracts +33% more facts per conversation, so the relevant user-side fact gets pushed below position 10 more often when synthetic single-fact-lookup queries hit. Fix path is MEM-54 (importance signal weighting user-said personal facts higher at ranker time) — landing next, this cycle. Three v2 prompt variants were explored during MEM-55 development to see if the regression could be addressed at the prompt layer. It can't — per-turn ingestion (one /api/analyze call per speaker turn) makes "dedup against context" impossible to implement reliably because the LLM doesn't see the other turns. The fix belongs at the ranker layer. Pre-commit validation gate was per-category: LOCOMO `single_hop` within ±2 J failed by a wide margin. Shipping anyway because the averaged-across-benchmarks delta is +2.13 J net and the fix path for the per-category regression is concrete and immediate (MEM-54). Documented loudly in the benchmark archive README rather than dressed up. MEM-55

Archive the LongMemEval + LOCOMO baselines that validated the extract.v2 prompt change. All on commit 47a1f6f (current dev tip with MEM-53 ranker + MEM-56 prompt-version pinning merged). Headlines documented in the README: - LongMemEval overall 76.6 (+4.45 vs v1) - LongMemEval `single_session_assistant` 74.2 (+44.3) - LOCOMO overall 53.7 (flat) - LOCOMO `single_hop` 43.5 (−9.9, real not noise) README is explicit about the validation-gate accounting: gate (3) "LOCOMO within ±2 J" failed on the `single_hop` per-category delta. Documents the dilution-at-recall-limit root cause, why we ship anyway (averaged net +2.13 J, MEM-54 is the next sub-issue and directly addresses the dilution), and what we'd do if MEM-54 doesn't deliver (revisit or revert). This is the first benchmark archive that exercises MEM-56's prompt-version attribution pipeline end-to-end: every artifact JSON carries `prompt_versions: {extract: extract.v2, ask: ask.v1}` in its metadata block, so future cross-run comparisons can attribute J-Score deltas to the prompt change without guessing. MEM-55

hungtranphamminh added 2 commits May 19, 2026 23:36

ducnmm approved these changes May 19, 2026

View reviewed changes

hungtranphamminh merged commit 0c67c64 into dev May 20, 2026
8 checks passed

hungtranphamminh deleted the feat/MEM-55-assistant-fact-scope branch May 20, 2026 01:39

This was referenced May 20, 2026

Feat: MEM-54 — per-fact importance signal end-to-end + extract.v3 #177

Merged

Feat: MEM-57 — pre-extraction dedup context (Mem0 v3 pattern) + extract.v4 #178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: extract.v2 — relax fact-extraction scope to both parties#173

Feat: extract.v2 — relax fact-extraction scope to both parties#173
hungtranphamminh merged 2 commits into
devfrom
feat/MEM-55-assistant-fact-scope

hungtranphamminh commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hungtranphamminh commented May 19, 2026

Summary

Why

What

Solution

Types of Changes

Testing

Checklist

Related Issues

Additional Notes

Validation-gate accounting

MEM-56's prompt-version attribution working live

Why no new unit tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants