Skip to content

Feat: extract.v2 — relax fact-extraction scope to both parties#173

Merged
hungtranphamminh merged 2 commits into
devfrom
feat/MEM-55-assistant-fact-scope
May 20, 2026
Merged

Feat: extract.v2 — relax fact-extraction scope to both parties#173
hungtranphamminh merged 2 commits into
devfrom
feat/MEM-55-assistant-fact-scope

Conversation

@hungtranphamminh
Copy link
Copy Markdown
Collaborator

Summary

Why

LongMemEval's single_session_assistant category sat at 29.91 J — by far the worst category on our benchmark scorecard. Root cause: the extract.v1 prompt explicitly scoped fact extraction to "facts about the user", systematically under-counting assistant-side content (recommendations, conclusions, summaries, plans). The LLM was capable of distinguishing user-said from assistant-said facts when asked properly; the prompt was preventing it from extracting the latter at all.

This is a real product gap. MemWal users build assistants that need to remember what the assistant recommended in previous conversations — a category-defining MemWal use case where we were scoring 30 J.

What

  • services/prompts/extract.txt: relaxed scope from "facts about the user" to "memorable facts from either party". New rules for what assistant content to capture (recommendations, plans agreed, factual claims, summaries) and what to skip (acks, restatements, formatting meta-talk).
  • services/extractor.rs: bumped FACT_EXTRACTION_PROMPT_VERSION from "extract.v1" to "extract.v2". Surfaced on /health via MEM-56, so every benchmark artifact JSON now records extract.v2 in its metadata block.
  • Prompt-injection guard, NONE-on-no-facts behaviour, and the one-fact-per-line output shape preserved verbatim from v1.

Solution

The prompt change itself is small. The honest part of this PR is what comes with it:

Three v2 iterations were explored during MEM-55 development to find a single prompt that maximises LongMemEval single_session_assistant without regressing LOCOMO single_hop. The result is structural — per-turn ingestion (one /api/analyze call per speaker turn) makes "dedup against context" impossible to implement reliably at the prompt layer, because the LLM doesn't see the other turns. The prompt-level Pareto frontier between LME assistant-side and LOCOMO user-side facts is real and can't be talked around.

We're shipping the first v2 variant (relaxed scope, no internal dedup rules) because it gives the largest net delta when averaged across both benchmarks. The LOCOMO single_hop regression that comes with it is a dilution-at-recall-limit problemextract.v2 extracts +33% more facts per conversation, and at limit=10 the user-side fact gets pushed out by competing assistant-side facts. The fix belongs at the ranker layer, not the prompt layer. That's MEM-54 (importance signal), which is the next sub-issue this cycle.

Types of Changes

  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Performance optimization (non-breaking change which addresses a performance issue)
  • Refactor (non-breaking change which does not change existing behavior or add new functionality)
  • Library update (non-breaking change that will update one or more libraries to newer versions)
  • Documentation (non-breaking change that doesn't change code behavior, can skip testing)
  • Test (non-breaking change related to testing)
  • Security awareness (changes that affect permission scope, security scenarios)

Testing

  • I have tested this code locally
  • I have added/updated unit tests
  • I have added/updated integration tests
  • I have tested in multiple browsers (if applicable)

Unit tests: 188/188 pass (no new tests — the prompt is a versioned text asset, behaviour is exercised by the end-to-end benchmark runs below).

LongMemEval (the primary target — 500 queries, e2e mode, gpt-4o judge):

Category v1 (May 18) extract.v2 Δ
single_session_assistant 29.91 ± 17.07 74.2 ± 33.9 +44.3
multi_session 78.57 ± 24.90 79.8 ± 25.4 +1.2
preference 77.83 ± 17.11 77.2 ± 21.9 −0.6
single_session_user 95.21 ± 16.40 94.7 ± 17.9 −0.5
knowledge_update 86.10 ± 20.98 84.7 ± 22.5 −1.4
temporal 62.03 ± 31.46 59.9 ± 31.9 −2.1
Overall 72.15 ± 30.50 76.6 ± 29.3 +4.45

SEM on overall is ±1.36 (stddev/√500); +4.45 is ~3 SEMs out — statistically significant. The single_session_assistant lift of +44.3 J is the largest single-category move on this codebase to date.

LOCOMO (regression check — 1986 queries):

Category v1 (May 18) extract.v2 Δ
adversarial 71.33 ± 34.44 75.7 ± 33.0 +4.4
multi_hop 47.08 ± 27.94 47.2 ± 26.3 +0.1
open_domain 52.22 ± 32.22 52.2 ± 32.8 0.0
single_hop 53.40 ± 29.26 43.5 ± 27.3 −9.9
temporal 36.42 ± 20.15 37.9 ± 21.4 +1.5
Overall 53.88 ± 32.43 53.7 ± 32.9 −0.18

The single_hop regression is real, not noise. SEM = ±1.67 (stddev/√282); a −9.9 delta is ~6 SEMs out. Documented at length in the benchmark archive README.

Full archive + per-category breakdown + methodology + root-cause analysis in services/server/review/assessment/benchmark-runs/2026-05-19-mem55-extract-v2/README.md.

Checklist

  • My code follows the code style of this project
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

Related Issues

  • Closes MEM-55
  • Part of MEM-52 (RAG quality, cycle 13)
  • Unblocked by MEM-56 (PR Feat: Pin prompt versions in benchmark run artifacts #169, the prompt-version attribution pipeline this PR is the first to actually exercise)
  • MEM-54 (importance signal) is the next sub-issue, addressing the LOCOMO single_hop regression at the ranker layer

Additional Notes

Validation-gate accounting

The MEM-55 ticket's validation gate was per-category:

Gate Result
(1) single_session_assistant improved on LongMemEval ✅ +44.3 — beat the forecast (29 → 60ish) by ~14 points
(2) Other LongMemEval categories within ±2 J ⚠️ mostly yes; temporal at −2.1 just over
(3) LOCOMO within ±2 J ❌ overall flat (−0.18) but single_hop regressed −9.9

Gate (3) failed by a wide margin on per-category reading. We are shipping anyway. Reasons:

  • Averaged across both benchmarks, extract.v2 is +2.13 J net (LME 76.6, LOCOMO 53.7) vs v1 (72.15, 53.88) → 65.15 vs 63.02. The largest cycle-13 delta on either benchmark to date.
  • The LME single_session_assistant lift maps to a real product use case (assistant memory of past statements / recommendations); LOCOMO single_hop is synthetic single-fact lookup that doesn't reflect typical MemWal SDK usage.
  • The fix path for the single_hop regression is concrete, scoped, and committed: MEM-54 (importance signal) is the next sub-issue. It weights user-said personal facts higher than assistant outputs at ranker time — directly addressing the dilution.
  • The pre-commit framing in MEM-52 explicitly allows shipping partial wins when the next step closing the gap is in hand. "Merge only if the benchmark results make sense" — these results make sense if MEM-54 is the immediate follow-up.
  • The honest counter-argument is that we are breaking a per-category gate. Documenting that loudly here and in the archive README rather than dressing it up.

If MEM-54 doesn't deliver the importance-signal recovery on single_hop, the choice will be: tune the prompt to be less aggressive on assistant-side facts, or revert this PR. Either is on the table.

MEM-56's prompt-version attribution working live

This is the first PR that actually exercises MEM-56's prompt_versions field on /health and in run artifacts. Every artifact JSON archived in review/assessment/benchmark-runs/2026-05-19-mem55-extract-v2/results/ carries prompt_versions: {extract: extract.v2, ask: ask.v1} in its metadata block. Future cross-cycle comparisons attribute cleanly.

Why no new unit tests

The prompt is a versioned text asset (include_str!("prompts/extract.txt")). There's no Rust logic to unit-test — the rules are tested by the end-to-end benchmark runs above, against the same gold-truth dataset Mem0/Zep/Supermemory use. The version constant is pinned by health_response_serializes_prompt_versions_block (added in MEM-56).

Relax the extractor prompt's user-only scope to cover memorable facts
from either party in the conversation. The v1 prompt scoped extraction
to "facts about the user", which systematically under-counted
assistant-side content (recommendations, conclusions, summaries,
plans). LongMemEval's `single_session_assistant` category sat at
~29.91 J because of this — the LLM was capable of distinguishing
user-said from assistant-said facts when asked, but the prompt was
preventing it from extracting the latter at all.

Bumps `FACT_EXTRACTION_PROMPT_VERSION` from "extract.v1" to
"extract.v2". The const is surfaced on `GET /health` (via MEM-56)
so every benchmark run-artifact JSON carries the version it was
produced under.

Prompt-injection guard, NONE-on-no-facts behaviour, and the
one-fact-per-line output shape are all preserved verbatim from v1.
New rules cover what the assistant says vs what to skip (acks,
restatements, formatting meta-talk).

Benchmark headline (PlaintextEngine, gpt-4o judge, gpt-4o-mini answer):

- LongMemEval overall: 72.15 → 76.6  (+4.45 J)
- LongMemEval `single_session_assistant`: 29.91 → 74.2  (+44.3 J,
  the cycle's first significant single-category lift)
- LongMemEval other categories: within judge noise on every other
  category (smallest move −2.1 on `temporal`)
- LOCOMO overall: 53.88 → 53.7  (flat)
- LOCOMO `single_hop`: 53.40 → 43.5  (−9.9 J, ~6 SEMs — real,
  not noise)

The LOCOMO `single_hop` regression is a dilution effect at the
recall `limit=10` cut: extract.v2 extracts +33% more facts per
conversation, so the relevant user-side fact gets pushed below
position 10 more often when synthetic single-fact-lookup queries
hit. Fix path is MEM-54 (importance signal weighting user-said
personal facts higher at ranker time) — landing next, this cycle.

Three v2 prompt variants were explored during MEM-55 development
to see if the regression could be addressed at the prompt layer.
It can't — per-turn ingestion (one /api/analyze call per speaker
turn) makes "dedup against context" impossible to implement
reliably because the LLM doesn't see the other turns. The fix
belongs at the ranker layer.

Pre-commit validation gate was per-category: LOCOMO `single_hop`
within ±2 J failed by a wide margin. Shipping anyway because the
averaged-across-benchmarks delta is +2.13 J net and the fix path
for the per-category regression is concrete and immediate
(MEM-54). Documented loudly in the benchmark archive README
rather than dressed up.

MEM-55
Archive the LongMemEval + LOCOMO baselines that validated the
extract.v2 prompt change. All on commit 47a1f6f (current dev tip
with MEM-53 ranker + MEM-56 prompt-version pinning merged).

Headlines documented in the README:

- LongMemEval overall 76.6 (+4.45 vs v1)
- LongMemEval `single_session_assistant` 74.2 (+44.3)
- LOCOMO overall 53.7 (flat)
- LOCOMO `single_hop` 43.5 (−9.9, real not noise)

README is explicit about the validation-gate accounting: gate (3)
"LOCOMO within ±2 J" failed on the `single_hop` per-category
delta. Documents the dilution-at-recall-limit root cause, why we
ship anyway (averaged net +2.13 J, MEM-54 is the next sub-issue
and directly addresses the dilution), and what we'd do if MEM-54
doesn't deliver (revisit or revert).

This is the first benchmark archive that exercises MEM-56's
prompt-version attribution pipeline end-to-end: every artifact
JSON carries `prompt_versions: {extract: extract.v2, ask: ask.v1}`
in its metadata block, so future cross-run comparisons can
attribute J-Score deltas to the prompt change without guessing.

MEM-55
@hungtranphamminh hungtranphamminh merged commit 0c67c64 into dev May 20, 2026
8 checks passed
@hungtranphamminh hungtranphamminh deleted the feat/MEM-55-assistant-fact-scope branch May 20, 2026 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants