fix: temporal metadata in search results + eval flow by 7xuanlu · Pull Request #28 · 7xuanlu/origin

7xuanlu · 2026-04-27T00:58:28Z

Summary

Threads benchmark session dates through Origin's eval pipeline so the LLM-judged accuracy harness can reason about temporal questions (LongMemEval-TR was 42.1%, LoCoMo-temporal was 1.6% — the worst categories on each benchmark).

SearchResult exposes created_at: i64 populated from chunks.created_at. #[derive(Default)] added so downstream MIT consumers (origin-mcp) can spread struct literals safely.
New crates/origin-core/src/eval/dates.rs module: parse_lme_date ("2023/04/10 (Mon) 23:07"), parse_locomo_date ("1:56 pm on 8 May, 2023"), format_ymd, and a seed_last_modified helper that DRYs the 13 verbatim seed-site copies.
LongMemEvalMemory and LocomoMemory carry session_date: Option<String> populated from haystack_dates[i] and conversation.session_N_date_time respectively.
All 13 RawDocument seed sites in longmemeval.rs, locomo.rs, answer_quality.rs, and pipeline.rs now use seed_last_modified(mem.session_date.as_deref(), parser) instead of now().
generate_e2e_answers_for_question rewrites flat + structured context as "On YYYY-MM-DD: <content>" lines and accepts question_date: Option<&str> (LME passes Some(sample.question_date), LoCoMo passes None). Date-prefix preamble in the user prompt; "today is X" anchor in the system prompt for LME.
Adversarial review fixed pre-merge: Default derive, question_date plumbing.

Side fixes (CI / pre-push hygiene)

Strip cargo llvm-cov 90% coverage gate from pre-push. The instrumented rebuild took 5-15 min and overloaded memory on macOS, and main CI didn't even run it. Pre-push now runs cargo clippy --workspace --all-targets + cargo test --workspace --lib --quiet (~30s).
Add .github/workflows/coverage.yml — non-blocking informational coverage report on PRs (continue-on-error, scoped to origin-core + origin-server, skipping the Tauri app).
CLAUDE.md now documents the L1-L8 local/CI test responsibility matrix so future sessions don't re-litigate this.

Test plan

cargo test --workspace --lib — 1052 lib tests pass (origin-types: 17, origin-core: 959, origin-server: 40, origin lib: 76; 21 ignored are network-restricted)
cargo clippy --workspace --all-targets -- -D warnings — clean
Per-task code review (spec compliance + code quality, both Opus) for each of Tasks 1, 2, 3, 4+5, 7
Final adversarial integrated review — found 2 critical issues, both fixed in last cleanup commit
CI lane on this PR (clippy + workspace tests + frontend tests)
Coverage workflow lane on this PR (informational)
Local-only manual: ANTHROPIC_API_KEY=... cargo test -p origin --test eval_harness judge_e2e_batch -- --ignored — to measure the LME-TR / LoCoMo-temporal lift after this lands
Local-only manual: cargo test -p origin --test eval_harness benchmark_longmemeval_pipeline -- --ignored — full pipeline through the dated context

Out of scope (follow-up)

Wiring the dormant judge::lme_answer_prompt into generate_e2e_answers_for_question for full per-task-type prompt branches (today they share one generic dated prompt).
Tightening the timezone assumption in parse_lme_date / parse_locomo_date if dataset spec says non-UTC.
Adding an assert_contains test that verifies the user-prompt actually contains "On YYYY-MM-DD:" lines (today only format_ymd's output is unit-tested).

🤖 Generated with Claude Code

Add batch-based full-pipeline eval that runs Origin's complete enrichment pipeline (entity extraction + concept distillation) then generates answers via Anthropic Batch API (50% cheaper, parallel processing). Three-phase architecture: 1. Enrich on-device (free): seed DB, extract entities, distill concepts 2. Batch generate (cheap): submit all answer prompts in one API batch 3. Merge (instant): combine batch results + cached flat answers Key features: - Dual-LLM: on-device for enrichment, API for answers (saves cost) - Cache reuse: existing flat answers (lme_answered_haiku.json, etc.) converted to JudgmentTuples, skipping redundant API calls - Resume support: skips already-processed conversations/questions - Cost cap: configurable via EVAL_COST_CAP env (default $5) New harness tests: - generate_fullpipeline_locomo: all 10 convs, 1540 questions - generate_fullpipeline_lme: all 500 questions - judge_fullpipeline_locomo/lme: Batch API judging Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Better entity extraction and concept distillation quality than 4B. Falls back to 4B if 9B unavailable, then to API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The model registry uses "qwen3.5-9b" with a dot. The typo caused silent fallback to 4B. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause: when the LLM returns a single JSON object like `{"entities": [...], "observations": [...]}`, `extract_json_array` found the inner `[` from `"entities": [` before the `{` and extracted a garbage substring. The object-wrapping fallback never triggered. This silently broke all entity extraction in the refinery's per-memory path (extract_single_memory_entities) since PR #5. The knowledge graph, concepts, and enrichment pipeline were no-ops. Fix: check if `{` appears before `[` (indicating a single object response), validate extracted JSON actually parses, and fall through to the object-wrapping path when array extraction produces invalid JSON. Regression tests added for both single-object and array responses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two changes that make the eval mirror production: 1. Single DB per benchmark (no per-conversation isolation): LoCoMo seeds all 10 conversations into one DB, LME seeds all 500 questions' memories into one DB. No domain filter on search. This matches production where Origin has one DB per user. 2. Mark memories as enriched after entity extraction: find_distillation_clusters requires enrichment_steps rows (production writes these in the async post-ingest flow). Without them, 0 concepts were produced even with 176 entities. Added mark_all_memories_enriched_for_eval() bulk helper. The older eval pipelines (locomo.rs, longmemeval.rs, context_path.rs, pipeline.rs) still use per-conversation DBs. Migration tracked as follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Probe results (LoCoMo observations): - 4B (Qwen3-4B): batch_size=1 only. Cannot follow multi-observation extraction instructions. Returns single object for first observation. - 9B (Qwen3.5-9B): batch_size=5 works (11 entities, 6 observations). batch_size=10 hits 30s timeout (model can generate but too slow). Conclusion: on-device enrichment is 2-3 hours for LoCoMo (2531 obs). Batch API (Haiku) is ~5 min for ~$1. Cloud path is the clear winner for eval; on-device is for production (one memory at a time). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace sequential on-device entity extraction (~2 hours) with Anthropic Batch API (~5 min, ~$1). Collects all extraction prompts, submits one batch, parses results back into DB. Probe results showed on-device limits: - 4B: batch_size=1 only (can't follow multi-obs instructions) - 9B: batch_size=5 max (30s timeout at 10) - Batch API: unlimited parallel, no timeout Concept distillation stays on-device (few calls, quality-sensitive). Entity extraction + answer generation + judging all go through API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete pipeline: entity extraction, title enrichment, concept synthesis, and concept titles all go through Anthropic Batch API. No on-device LLM dependency for eval. ~$2 total, ~15 min. Non-LLM phases (embedding, clustering, search) stay on-device, testing Origin's actual infrastructure quality. No production code modified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Title enrichment: expanded query to match generic "X session N" titles (not just truncated ones). All 184 LoCoMo observations get semantic titles. Concept hallucination check: compared output against source_ids (just IDs) instead of actual memory content. Fixed to use cluster.contents. Smoke test results (1 conversation): - 472 entities, 184 titles enriched, 3 concepts distilled - Structured context +144 tokens over flat (real concept content) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tempfile::tempdir() deletes on exit/panic, losing all enrichment work (~$3 in Batch API calls). Now uses a stable path alongside the output file (e.g. fullpipeline_locomo_tuples.db/). On re-run, checks if DB has data. If yes, skips enrichment and goes straight to context collection + answer generation. Also raised default cost cap from $5 to $10 (answer generation for full LoCoMo exceeds $5). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous check (memory_count > 0) would skip enrichment on partial data. Now checks enriched_memory_count == memory_count. If partial, clears DB and starts fresh. If complete, skips enrichment. Tested: forced failure via cost cap, re-run correctly detects "0/2531 enriched" and starts fresh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds an i64 created_at field to SearchResult and populates it from chunks.created_at in row_to_search_result. Foundation for date filtering and date-aware eval prompts. Existing last_modified semantics unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Followup to the prior commit's struct change — origin-types' own search_result_serializes test still constructs SearchResult literally and needed the new field.

Carries the per-session haystack_dates through LongMemEvalMemory and into RawDocument.last_modified during retrieve_for_accuracy_eval. Adds parse_lme_date helper and round-trip tests. Foundation for date-aware temporal-reasoning prompts (Task 4). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds parse_locomo_date for the dataset's '1:56 pm on 8 May, 2023' format and threads conversation.session_N_date_time through LocomoMemory into RawDocument.last_modified at every seed site. Mirrors Task 2's LME treatment.

Threads session dates into the 5 remaining RawDocument seed sites in answer_quality.rs and pipeline.rs (covers both LoCoMo and LongMemEval E2E and pipeline runners). Adds eval::shared::format_ymd and rewrites generate_e2e_answers_for_question's context to emit 'On YYYY-MM-DD: ...' lines so the LLM judge can reason about temporal questions. Targets the temporal-reasoning weakness on both benchmarks (LME-TR 42.1%, LoCoMo-temporal 1.6% pre-change). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…dified Code-review followup: - Moves parse_lme_date, parse_locomo_date, and format_ymd to a single dates.rs module instead of three different homes. - Replaces the 13 verbatim copies of the '.as_deref().and_then(parser).unwrap_or_else(|| now())' chain with one seed_last_modified helper. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two adversarial-review followups before merge: 1. SearchResult was missing a Default derive; downstream MIT crates (origin-mcp) construct it field-by-field and would break when origin-types 0.1.5 publishes with the new created_at i64. Adding the derive lets them spread defaults. 2. LongMemEval per-question dates (sample.question_date) were not reaching the LLM. The seeded memories now carry session dates thanks to earlier commits, but without a 'today' anchor the LLM cannot ground relative phrases ('yesterday', 'a week ago') in the question. generate_e2e_answers_for_question now accepts Option<&str> and prepends 'The question was asked on X.' to the system prompt. LoCoMo passes None (no per-question date in dataset). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Pre-push previously enforced a 90% cargo llvm-cov gate. That ran an instrumented rebuild of the whole workspace including the Tauri app, peaking at 8-16GB RSS and 5-15min wall time. CI doesn't even run that check, so the local gate added friction without upstream protection. Three changes drawn from the long-run pattern: 1. .githooks/pre-push now runs only the fast checks: cargo clippy --workspace + cargo test --workspace + pnpm vitest --bail. No instrumented rebuild, no coverage gate, no memory pressure. 2. .github/workflows/coverage.yml posts coverage as an informational PR comment (continue-on-error: true) for origin-core + origin-server. The Tauri app crate is excluded; its surface is GUI proxies untestable without a runtime. 3. CLAUDE.md documents the L1-L8 matrix: who gates correctness (pre-push, CI), who measures quality (coverage workflow, manual scripts/coverage.sh), what stays laptop-only (GPU evals, Anthropic batch judge), and why. The gist: gates block on pass/fail, never on percentages. The slowest pre-push command sets push latency, so keep it under 60s. GPU/API work is human-paced. Mirror don't duplicate-slowly.

Mirror the LoCoMo _api pattern for LongMemEval: ClaudeCliProvider::haiku() for answers (no API key, uses Max plan via OAuth) and judge_with_claude_model 'haiku' for judging. Same answer/judge model on both sides keeps LME and LoCoMo numbers comparable. Both new test entry points exercise the dated-context formatter and the question_date system-prompt anchor introduced earlier in this branch: - generate_e2e_context_tuples_longmemeval_api - judge_e2e_context_longmemeval_api_haiku

7xuanlu and others added 19 commits April 26, 2026 18:30

fix: use Qwen3.5-9B for enrichment in full-pipeline eval

1ffd101

Better entity extraction and concept distillation quality than 4B. Falls back to 4B if 9B unavailable, then to API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct 9B model ID (qwen3.5-9b not qwen35-9b)

e496f6f

The model registry uses "qwen3.5-9b" with a dot. The typo caused silent fallback to 4B. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add created_at to SearchResult test fixture

ba9060a

Followup to the prior commit's struct change — origin-types' own search_result_serializes test still constructs SearchResult literally and needed the new field.

fix: propagate LoCoMo session dates into seeded chunks

075af55

Adds parse_locomo_date for the dataset's '1:56 pm on 8 May, 2023' format and threads conversation.session_N_date_time through LocomoMemory into RawDocument.last_modified at every seed site. Mirrors Task 2's LME treatment.

7xuanlu force-pushed the feature/temporal-metadata branch from a061561 to 199ad89 Compare April 27, 2026 07:03

7xuanlu changed the base branch from main to feature/fullpipeline-eval April 27, 2026 07:03

7xuanlu force-pushed the feature/fullpipeline-eval branch from 4e10d40 to 4bb6b48 Compare April 28, 2026 03:59

7xuanlu deleted the branch feature/fullpipeline-eval April 28, 2026 04:12

7xuanlu closed this Apr 28, 2026

7xuanlu mentioned this pull request Apr 28, 2026

fix: temporal metadata in search results + eval flow #31

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: temporal metadata in search results + eval flow#28

fix: temporal metadata in search results + eval flow#28
7xuanlu wants to merge 20 commits intofeature/fullpipeline-evalfrom
feature/temporal-metadata

7xuanlu commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

7xuanlu commented Apr 27, 2026

Summary

Side fixes (CI / pre-push hygiene)

Test plan

Out of scope (follow-up)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant