fix: full-pipeline eval + source overlap concept gate + 3 production bugs by 7xuanlu · Pull Request #29 · 7xuanlu/origin

7xuanlu · 2026-04-27T05:56:39Z

Summary

Full-pipeline eval infrastructure: entity extraction, title enrichment, concept distillation, answer generation, and judging all via Anthropic Batch API
Single-DB architecture per benchmark (mirrors production: one DB per user)
Found and fixed 3 production bugs that made enrichment a complete no-op since PR feat: knowledge graph quality - extraction, aliases, verification, rethink #5:
1. extract_json_array misparses single KG objects with inner arrays (entities never extracted)
2. Missing enrichment_steps rows (concepts never distilled)
3. Per-conversation DB isolation (doesn't match production)
Source overlap gate: filters irrelevant concepts at search time by checking if concept's source memories overlap with search results. Threshold configurable via DistillationConfig.concept_min_overlap (default 2). Applied to both production (/api/chat-context) and eval.
Task-specific judge prompts: verified against LongMemEval paper source code (character-for-character). Calibration: 100% Haiku/Sonnet agreement.
Relevance scores: search_concepts now returns normalized RRF scores on Concept.relevance_score (was discarded).
CLI judge with configurable concurrency (EVAL_CLI_CONCURRENCY, default 8)
Persistent enriched DB for crash-safe resume + extensive eval data preservation

Final Results (post-gate)

LME (500 questions, 5533 memories single-DB):

State	Task-avg	Δ vs pre-gate
Flat baseline (memories only)	57.2%	—
Pre-gate (33.7%)	33.7%	—
Post-gate (default min_overlap=2)	39.9%	+6.2pp

LoCoMo (1540 questions, 2531 memories single-DB):

State	Task-avg	Δ vs pre-gate
Pre-gate, generic judge	29.7%	—
Pre-gate, task-specific judge	32.0%	—
Post-gate, task-specific judge	30.5%	-1.5pp

Context tokens dropped from ~5500 → ~1500 in both benchmarks.

Honest tradeoff

Gate is net positive on noisy/diverse data (LME +6.2pp), small loss on coherent topical data (LoCoMo open-domain -5.2pp).
LME post-gate still 17pp below flat — concept content is too generic when synthesized from 183-405 source memories. Upstream fix (cap entity-linked clusters) is follow-up work.
Threshold is configurable so operators can adjust per use case.

Verification

cargo check --workspace clean
cargo test -p origin-core --lib concepts::tests — 6/6 pass (filter_concepts_by_source_overlap unit tests)
cargo test -p origin-core --lib tuning — 10/10 pass including new concept_min_overlap default test
Full LoCoMo + LME post-gate eval batches completed (~$0.50 total cost)
All eval data preserved with _pregate.json and _postgate.json backups

Follow-ups (separate PRs)

Re-judge LoCoMo with task-specific prompts when API credits allow (eval-only, doesn't affect runtime)
Cap entity-linked clusters in distillation (the LME concept quality issue)
Search precision at scale (LoCoMo single-hop 41.9%, LME SSA 57.1% retrieval misses) — task chore(main): release 0.3.0 #13

Generated with Claude Code

Three pre-merge fixes for PR #29: 1. Judge prompt: removed system prompt from CLI path (was double instructions). Both CLI and Batch API now use identical user prompt via shared judge_prompt() function. No system prompt. 2. Concept cap: reverted token budget in build_contexts to match production. /api/chat-context includes top-3 concepts with no cap. LME -26pp regression is a real product gap (concept noise from unrelated data), not an eval bug to mask. 3. Flat dropped: enriched pipeline only generates structured answers. Flat baseline exists in retrieval-only caches. Halves cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add batch-based full-pipeline eval that runs Origin's complete enrichment pipeline (entity extraction + concept distillation) then generates answers via Anthropic Batch API (50% cheaper, parallel processing). Three-phase architecture: 1. Enrich on-device (free): seed DB, extract entities, distill concepts 2. Batch generate (cheap): submit all answer prompts in one API batch 3. Merge (instant): combine batch results + cached flat answers Key features: - Dual-LLM: on-device for enrichment, API for answers (saves cost) - Cache reuse: existing flat answers (lme_answered_haiku.json, etc.) converted to JudgmentTuples, skipping redundant API calls - Resume support: skips already-processed conversations/questions - Cost cap: configurable via EVAL_COST_CAP env (default $5) New harness tests: - generate_fullpipeline_locomo: all 10 convs, 1540 questions - generate_fullpipeline_lme: all 500 questions - judge_fullpipeline_locomo/lme: Batch API judging Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Better entity extraction and concept distillation quality than 4B. Falls back to 4B if 9B unavailable, then to API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The model registry uses "qwen3.5-9b" with a dot. The typo caused silent fallback to 4B. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause: when the LLM returns a single JSON object like `{"entities": [...], "observations": [...]}`, `extract_json_array` found the inner `[` from `"entities": [` before the `{` and extracted a garbage substring. The object-wrapping fallback never triggered. This silently broke all entity extraction in the refinery's per-memory path (extract_single_memory_entities) since PR #5. The knowledge graph, concepts, and enrichment pipeline were no-ops. Fix: check if `{` appears before `[` (indicating a single object response), validate extracted JSON actually parses, and fall through to the object-wrapping path when array extraction produces invalid JSON. Regression tests added for both single-object and array responses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two changes that make the eval mirror production: 1. Single DB per benchmark (no per-conversation isolation): LoCoMo seeds all 10 conversations into one DB, LME seeds all 500 questions' memories into one DB. No domain filter on search. This matches production where Origin has one DB per user. 2. Mark memories as enriched after entity extraction: find_distillation_clusters requires enrichment_steps rows (production writes these in the async post-ingest flow). Without them, 0 concepts were produced even with 176 entities. Added mark_all_memories_enriched_for_eval() bulk helper. The older eval pipelines (locomo.rs, longmemeval.rs, context_path.rs, pipeline.rs) still use per-conversation DBs. Migration tracked as follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Probe results (LoCoMo observations): - 4B (Qwen3-4B): batch_size=1 only. Cannot follow multi-observation extraction instructions. Returns single object for first observation. - 9B (Qwen3.5-9B): batch_size=5 works (11 entities, 6 observations). batch_size=10 hits 30s timeout (model can generate but too slow). Conclusion: on-device enrichment is 2-3 hours for LoCoMo (2531 obs). Batch API (Haiku) is ~5 min for ~$1. Cloud path is the clear winner for eval; on-device is for production (one memory at a time). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace sequential on-device entity extraction (~2 hours) with Anthropic Batch API (~5 min, ~$1). Collects all extraction prompts, submits one batch, parses results back into DB. Probe results showed on-device limits: - 4B: batch_size=1 only (can't follow multi-obs instructions) - 9B: batch_size=5 max (30s timeout at 10) - Batch API: unlimited parallel, no timeout Concept distillation stays on-device (few calls, quality-sensitive). Entity extraction + answer generation + judging all go through API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete pipeline: entity extraction, title enrichment, concept synthesis, and concept titles all go through Anthropic Batch API. No on-device LLM dependency for eval. ~$2 total, ~15 min. Non-LLM phases (embedding, clustering, search) stay on-device, testing Origin's actual infrastructure quality. No production code modified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Title enrichment: expanded query to match generic "X session N" titles (not just truncated ones). All 184 LoCoMo observations get semantic titles. Concept hallucination check: compared output against source_ids (just IDs) instead of actual memory content. Fixed to use cluster.contents. Smoke test results (1 conversation): - 472 entities, 184 titles enriched, 3 concepts distilled - Structured context +144 tokens over flat (real concept content) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tempfile::tempdir() deletes on exit/panic, losing all enrichment work (~$3 in Batch API calls). Now uses a stable path alongside the output file (e.g. fullpipeline_locomo_tuples.db/). On re-run, checks if DB has data. If yes, skips enrichment and goes straight to context collection + answer generation. Also raised default cost cap from $5 to $10 (answer generation for full LoCoMo exceeds $5). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous check (memory_count > 0) would skip enrichment on partial data. Now checks enriched_memory_count == memory_count. If partial, clears DB and starts fresh. If complete, skips enrichment. Tested: forced failure via cost cap, re-run correctly detects "0/2531 enriched" and starts fresh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Entity extraction and title enrichment are independent — submit both Batch API requests simultaneously via tokio::join\!, then wait for concept distillation (depends on enrichment_steps). Cuts enrichment time by ~40% (one fewer sequential batch wait). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Entity extraction + title enrichment now run in parallel via tokio::join\! (cuts enrichment time ~40%) - Added judge_fullpipeline_lme_cli test using claude -p (Max plan, no API key needed) for when Batch API credits are exhausted Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three pre-merge fixes for PR #29: 1. Judge prompt: removed system prompt from CLI path (was double instructions). Both CLI and Batch API now use identical user prompt via shared judge_prompt() function. No system prompt. 2. Concept cap: reverted token budget in build_contexts to match production. /api/chat-context includes top-3 concepts with no cap. LME -26pp regression is a real product gap (concept noise from unrelated data), not an eval bug to mask. 3. Flat dropped: enriched pipeline only generates structured answers. Flat baseline exists in retrieval-only caches. Halves cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace generic "strict judge" prompt with task_judge_prompt() dispatcher - LME prompts match LongMemEval paper source code (evaluate_qa.py) verbatim - LoCoMo "temporal" mapped to same off-by-one tolerance as LME temporal-reasoning - Add category field to JudgmentTuple (#[serde(default)] for backward compat) - Both CLI and Batch API paths call same dispatcher (single source of truth) - Collapse redundant SSU/SSA/MS branch into default (identical text) - Default branch now includes equivalence+subset guidance (was missing) Calibration: 100% Haiku/Sonnet agreement on 16 valid comparisons (20 sample). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add EVAL_CLI_CONCURRENCY env var (default 8) for parallel claude -p calls - Add EVAL_CLI_MODEL env var (default haiku) for model selection - Add judge_fullpipeline_locomo_cli test (was missing) - Extract print_judge_report() with task-averaged accuracy - DRY: Batch API judge tests reuse print_judge_report() Usage: cargo test -p origin --test eval_harness judge_fullpipeline_lme_cli -- --ignored --nocapture EVAL_CLI_CONCURRENCY=4 cargo test ... judge_fullpipeline_locomo_cli -- --ignored --nocapture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

search_concepts computed RRF scores internally but discarded them before returning Vec<Concept>. This prevented callers from filtering irrelevant concepts, causing -23.5pp accuracy drop on LME (concept noise). - Add relevance_score: f32 to Concept (serde default 0.0, skip if zero) - Normalize RRF scores to 0.0-1.0 (same formula as search_memory) - Attach scores to concepts before returning (no API break) - Log concept scores in eval build_structured_context for distribution analysis Next step: run score distributions on LoCoMo vs LME enriched DBs, then set data-driven threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

search_concepts returned top-3 concepts regardless of whether they were about the query's topic. With diverse memories in one DB, this added ~5500 tokens of garbage concepts that caused -23.5pp accuracy drop. The fix: only include a concept if its source memories overlap with the memories that search_memory returned for the same query. This answers "is this concept about the thing I'm searching for?" by construction. Data-driven validation (from enriched eval DBs): - LoCoMo: relevant concepts have 10/10 overlap (all search results came from the concept's sources). Correctly kept. - LME: noise concepts have 0-1/10 overlap (random hits from 183-405 source pool). Correctly filtered at min_overlap=2. Applied to both production (/api/chat-context) and eval (build_structured_context). Also adds probe_concept_scores test for inspecting score distributions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses adversarial code review findings: - Remove dead flat_cache_path param from both fullpipeline functions (flat was dropped in earlier commit, param was unused) - Remove dead enrichment_llm param (all enrichment via Batch API) - Remove dead _prompts/_tuning construction (filesystem I/O for nothing) - Remove dead load_flat_cache_locomo/lme functions (0 warnings now) - Add 6 unit tests for filter_concepts_by_source_overlap: keeps matching, filters low overlap, empty sources, empty search, zero threshold, mixed keeps/filters - Fix misleading normalization comment in search_concepts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The source overlap gate threshold was hardcoded to 2 in two places. Make it tunable via DistillationConfig.concept_min_overlap (default 2) so operators can adjust per use case based on their data structure. Production: routes.rs reads from ServerState.tuning.distillation Eval: build_structured_context reads same default, with EVAL_CONCEPT_MIN_OVERLAP env override for sweep testing. Empirical defaults at min_overlap=2 (2026-04-27 batch API run, ~$0.50): - LME (noisy diverse data, 5533 mems): 33.7% -> 39.9% (+6.2pp) - LoCoMo (coherent topical data, 2531 mems): 32.0% -> 30.5% (-1.5pp) - Both: ~70% concept token reduction Tradeoff documented in tuning.rs doc comments. Default chosen because LME-style data is the more dangerous failure mode (concepts can drag accuracy down by tens of pp) while LoCoMo-style loss is bounded (~1-2pp). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-push previously enforced a 90% cargo llvm-cov gate. That ran an instrumented rebuild of the whole workspace including the Tauri app, peaking at 8-16GB RSS and 5-15min wall time. CI doesn't even run that check, so the local gate added friction without upstream protection. Three changes drawn from the long-run pattern: 1. .githooks/pre-push now runs only the fast checks: cargo clippy --workspace + cargo test --workspace + pnpm vitest --bail. No instrumented rebuild, no coverage gate, no memory pressure. 2. .github/workflows/coverage.yml posts coverage as an informational PR comment (continue-on-error: true) for origin-core + origin-server. The Tauri app crate is excluded; its surface is GUI proxies untestable without a runtime. 3. CLAUDE.md documents the L1-L8 matrix: who gates correctness (pre-push, CI), who measures quality (coverage workflow, manual scripts/coverage.sh), what stays laptop-only (GPU evals, Anthropic batch judge), and why. The gist: gates block on pass/fail, never on percentages. The slowest pre-push command sets push latency, so keep it under 60s. GPU/API work is human-paced. Mirror don't duplicate-slowly.

The new pre-push runs cargo clippy --workspace --all-targets which catches this in the probe_concept_scores test. Inline the "Question" header literal into the format string instead of passing it as a positional arg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adversarial review caught that PR #29's new run_fullpipeline_locomo_batch and run_fullpipeline_lme_batch seeded all memories with chrono::Utc::now() and used a date-blind system prompt — silently nullifying the temporal metadata work for the fullpipeline path (every memory would have been prefixed "On 2026-04-27" regardless of original session date). Fixes: - Extract build_e2e_system_prompt(question_date) helper; refactor generate_e2e_answers_for_question to use it. - Thread mem.session_date through seed_last_modified at both batch seed sites (LoCoMo: parse_locomo_date, LME: parse_lme_date). - Thread question_date into batch system prompts: LoCoMo uses the latest session_date in the sample (questions follow the conversation); LME uses sample.question_date directly. - seed_last_modified now log::warn\!s on parse failures so future silent regressions to now() are visible in eval logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

7xuanlu changed the title ~~fix: full-pipeline eval with Batch API enrichment + 3 production bugs~~ fix: full-pipeline eval + source overlap concept gate + 3 production bugs Apr 28, 2026

7xuanlu and others added 22 commits April 27, 2026 20:46

fix: use Qwen3.5-9B for enrichment in full-pipeline eval

87b3bfd

Better entity extraction and concept distillation quality than 4B. Falls back to 4B if 9B unavailable, then to API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct 9B model ID (qwen3.5-9b not qwen35-9b)

6cbf286

The model registry uses "qwen3.5-9b" with a dot. The typo caused silent fallback to 4B. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

7xuanlu force-pushed the feature/fullpipeline-eval branch from 4e10d40 to 4bb6b48 Compare April 28, 2026 03:59

7xuanlu merged commit e8923b7 into main Apr 28, 2026
1 of 2 checks passed

7xuanlu mentioned this pull request Apr 28, 2026

chore(main): release 0.2.0 #25

Open

7xuanlu deleted the feature/fullpipeline-eval branch April 28, 2026 04:12

7xuanlu mentioned this pull request Apr 28, 2026

fix: temporal metadata in search results + eval flow #31

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: full-pipeline eval + source overlap concept gate + 3 production bugs#29

fix: full-pipeline eval + source overlap concept gate + 3 production bugs#29
7xuanlu merged 22 commits intomainfrom
feature/fullpipeline-eval

7xuanlu commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

7xuanlu commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Final Results (post-gate)

Honest tradeoff

Verification

Follow-ups (separate PRs)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

7xuanlu commented Apr 27, 2026 •

edited

Loading