feat: LLM-as-judge evaluation pipeline + observability cleanup (v2.2.3) by Hidden-History · Pull Request #63 · Hidden-History/ai-memory

Hidden-History · 2026-03-14T08:14:35Z

Summary

Wave 1: Langfuse observability cleanup (TD-275 semantic tags, TD-228 per-search traces, TD-217/218 detect-secrets + GH Actions)
Wave 2: LLM-as-judge evaluation pipeline (PLAN-012 Phase 2+3) — engine, 6 evaluators, 5 golden datasets, regression tests + CI gate
6 V3 SDK bug fixes discovered during live testing (BUG-211 to BUG-216)
Ollama cloud API support for evaluator pipeline
Thinking model support (Qwen3 reasoning field fallback)

Test plan

2,490 unit tests pass (0 failures)
CI green — Lint, Unit Tests (3.10/3.11/3.12), Integration, Regression, CodeQL
Live evaluation: 4 scores attached to Langfuse via Ollama cloud (qwen3.5:397b)
Score configs verified in Langfuse (6 configs: 3 NUMERIC, 2 BOOLEAN, 1 CATEGORICAL)
Installer Option 1 verified — evaluator module installs correctly

Note

Merging to main WITHOUT tagging v2.2.3. A follow-up sprint (v2.2.3B) will address Langfuse observability gaps (TD-280 to TD-288: tag propagation, observation-level evaluation, cron scheduler, retry logic) before the release tag.

1. update_shared_scripts() now copies templates/ dir alongside scripts, ensuring new templates (e.g., decision-log.md) deploy on Option 1 installs — not just full installs. 2. story-complete.md: observability section with structured logging, Prometheus metrics, Grafana dashboard checklist items. 3. production-ready.md: specific observability items replacing generic "Monitoring in place" / "Logging adequate" lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ersions, TD-219 screenshot dir - TD-100: sanitize_log_input already applied to all user inputs in monitoring/main.py (pre-existing) - TD-217: move detect-secrets from prod deps to dev optional-dependencies - TD-218: downgrade GH Actions to confirmed stable versions (checkout@v4, setup-python@v5, github-script@v7, release-changelog-builder-action@v5) - TD-219: ensure_screenshot_dir fixture already exists in tests/e2e/conftest.py (pre-existing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…D-275) Updates tags= parameters across 21 files to use the canonical tag mapping: - user_prompt_capture/store_async: ["capture", "user_prompt"] - agent_response_capture/store_async: ["capture", "agent_response"] - error_detection/pattern_capture/store_async: ["capture", "error_detection"] - context_injection_tier2: ["injection", "tier2"] - session_start: ["injection", "tier1", "bootstrap"] - pre_compact_save: ["capture", "session_summary"] - post_tool_capture/new_file_trigger/first_edit_trigger/store_async: ["capture", "trigger"] - best_practices_retrieval: ["retrieval", "best_practices"] - manual_save_memory: ["capture", "manual"] - classification_worker/process_classification_queue: ["classification"] - jira/sync: ["sync", "jira"] - src/memory/search: ["search", "retrieval"] - src/memory/decay: ["search", "decay"] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ath (TD-228) - Fix resume trace tags: ["injection", "tier1", "bootstrap"] → ["injection", "resume"] (DEC-054) - Add per-search trace (memory_retrieval_search) for session summaries in non-Parzival compact path - Add per-search trace (memory_retrieval_search) for decisions in non-Parzival compact path - Add greedy-fill trace after inject_with_priority with budget/selected/dropped counts - All new events tagged ["injection", "compact"] with session_id, guarded try/except - Non-Parzival compact path now has 4 emit_trace_event calls (+3 from 1 before) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ding F-5) - F-1: Complete TD-275 — 12 remaining single-element tags updated to dual-element - F-2: store_async.py tags restored to ["capture", "store"] - F-3: error_detection.py injection event correctly tagged ["injection", "error_detection"] - F-4: Unique event_types for compact session/decision traces - F-6: best_practices_retrieval.py tag aligned to ["search", "best_practices"] - F-7: GH Actions upload-artifact@v4, cache@v4 - F-8: Removed hardcoded score from session summary traces Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…assification_queue.py

…-012 Phase 2) - EV-01 retrieval_relevance: NUMERIC 0-1, filters search/retrieval tags, 5% sampling - EV-02 injection_value: BOOLEAN, filters injection/tier2 + injection/compact tags, 5% sampling - EV-03 capture_completeness: BOOLEAN, filters capture tags, 5% sampling - EV-04 classification_accuracy: CATEGORICAL (correct/partially_correct/incorrect), filters classification tags, 10% sampling - EV-05 bootstrap_quality: NUMERIC 0-1, filters injection/tier1/bootstrap tags, 100% sampling - EV-06 session_coherence: NUMERIC 0-1, filters capture/session_summary tags, 100% sampling Each evaluator has a companion prompt with chain-of-thought, rubric, and JSON response format. Prompts are model-agnostic (Ollama llama3.2:8b compatible) with clear scoring criteria. Tag filters use canonical values from codebase (TD-275 WP-0a). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…er support (PLAN-012 Phase 2) - src/memory/evaluator/__init__.py: Package init exporting EvaluatorConfig, EvaluatorRunner - src/memory/evaluator/provider.py: Multi-provider client (Ollama, OpenRouter, Anthropic, OpenAI, custom) - Ollama: OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") - Anthropic: native Anthropic SDK (NOT OpenAI compat) per PM #190 - Cloud providers raise ValueError if env var not set (fail-fast) - ZERO hardcoded API keys - src/memory/evaluator/runner.py: Core pipeline with cursor-based pagination (PM #190 fix) - Deterministic score_id via md5(trace_id:evaluator_name:since) for idempotency - Uses get_client() V3 singleton, never Langfuse() constructor - Calls langfuse.flush() after all evaluations, shutdown() takes no args - scripts/run_evaluations.py: CLI with --config, --evaluator, --dry-run, --since, --batch-size - scripts/create_score_configs.py: Idempotent 6-config Score Config setup in Langfuse - evaluator_config.yaml: Default config (Ollama/llama3.2:8b), ZERO secrets - tests/test_evaluator_provider.py, test_evaluator_runner.py, test_evaluator_config.py: 73 tests passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ew (PLAN-012) UF-1: Remove evaluators/ prefix from prompt_file in all 6 evaluator YAMLs UF-2: Flatten list-of-lists tags to flat string lists in all 6 evaluator YAMLs UF-3: Wrap per-trace eval block in try/except to isolate individual failures UF-4: Move _matches_filter check before sampling so sample from matching traces only UF-5: Move langfuse.flush() into finally block so it runs even on error UF-6: Fix EV-01 tags to [search, retrieval, best_practices] per AC-2 spec UF-7: Remove [:TRACE_CONTENT_MAX] truncation from LLM prompt in provider.py UF-8: Replace all {{...}} Jinja placeholders in 6 prompts with trace section refs UF-9: Fix missing | on 0.5-0.69 rubric row in ev05_bootstrap_quality_prompt.md UF-10: Define TRACE_CONTENT_MAX once in __init__.py, import in provider/runner UF-11: Align max_tokens to 1024 in evaluator_config.yaml UF-12: Add _client caching to get_client() to avoid recreating clients per call UF-13: Make since a required keyword-only arg in run(); remove internal default UF-14: Simplify test_no_sk_prefix_keys assertion to assert "sk-" not in source UF-15: Add test_load_prompt_uses_actual_file_not_fallback to test_evaluator_runner.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…12 Phase 3) Creates scripts/create_datasets.py with 5 Langfuse golden datasets: - DS-01: Retrieval Golden Set (25 items, all 5 collections, >=3 per collection) - DS-02: Error Pattern Match (13 items, Python/JavaScript/bash) - DS-03: Bootstrap Round-Trip (7 items, agent_id=parzival tenant isolation) - DS-04: Keyword Trigger Routing (68 items, all patterns from triggers.py) - DS-05: Chunking Quality (10 items, IntelligentChunker routing decisions) Also creates tests/test_create_datasets.py with 35 tests covering: - Dataset completeness (all 5 defined, required keys/metadata) - DS-04 item count matches triggers.py exactly (68 patterns: error_detection=5, decision_keywords=20, session_history_keywords=16, best_practices_keywords=27) - Item structure validation per dataset type - No placeholder data (case-sensitive check) - Idempotency/dry-run mode works without Langfuse connection - V3 SDK compliance (get_client(), flush(), no constructor with explicit creds) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…hase 3) - tests/test_regression.py: 4 regression tests against golden datasets DS-01 (retrieval relevance >= 0.7), DS-02 (error match >= 80%), DS-03 (bootstrap round-trip), DS-04 (keyword routing 100%) - .github/workflows/regression-tests.yml: CI workflow triggers on PR to main/develop when src/memory/** or .claude/hooks/scripts/** change; blocks merge on regression; ZERO hardcoded secrets - pyproject.toml: add 'regression' marker; default pytest excludes regression tests (-m 'not regression') Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ale counts - F-1: Update all DS-04 "63" references to "68" (module docstring, section header, breakdown comment, dataset description string) - F-2: Generate deterministic id="{dataset_name}-item-{i:03d}" for each create_dataset_item() call to prevent duplicate creation on re-runs - F-3: Update DS-02 fabricated error message to "assert 60 == 68" and fix_summary to include error_detection (5) = 68 total - F-4: Remove unused bare `import importlib` from test_create_datasets.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ndings - F-1+F-2 (CRITICAL): add detect_error_signal import+branch, fix trigger name strings to match dataset exact values (best_practices_keywords, session_history_keywords, error_detection) so all 68 DS-04 items route - F-3: update stale "63" → "68" in comment and docstring - NF-1: remove misleading {input}/{output} placeholders from fallback prompt in runner.py (replaced with plain "Evaluate the following trace data.") - NF-2: add test_single_trace_error_does_not_kill_run to test_evaluator_runner.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…atibility Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…in CI

…aude/skills/)

…r to dry-run step

Remove: RAG-CHUNKING-BEST-PRACTICES-2026.md, claude-code-hooks-best-practices.md, conversation-memory-best-practices.md, sdk-vs-hooks-comparison.md Update 5 files with broken references to point to HOOKS.md instead.

…a cloud support - create_score_configs.py: Replace non-existent lf.create_score_config() with lf.api.score_configs.create(request=CreateScoreConfigRequest(...)) (V3 SDK) - provider.py: Read OLLAMA_API_KEY from env instead of hardcoded "ollama", enabling Ollama cloud API access while preserving local default - evaluator_config.yaml: Document cloud base_url (https://ollama.com/v1) and OLLAMA_API_KEY env var Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PLAN-012 evaluator pipeline supports Ollama cloud models which require an API key from https://ollama.com/settings/keys. Local Ollama users can leave this blank. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- trace.list: from_timestamp/to_timestamp (not start_time/end_time) - trace.list: page-based pagination (not cursor-based) - create_score: score_id parameter (not id) - Updated tests to match corrected pagination pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- provider.py: Fall back to msg.reasoning when msg.content is empty, supporting Qwen3 and other thinking models that use a reasoning field - evaluator_config.yaml: Increase max_tokens from 1024 to 4096 to accommodate thinking tokens + output for reasoning models - Updated test assertion for new default max_tokens Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

WB Solutions and others added 26 commits March 13, 2026 17:58

fix(langfuse): review cycle 2 — last single-element tag in process_cl…

9c27f2c

…assification_queue.py

docs: add v2.2.3 changelog, evaluator docs, and story spec fix

a3b4190

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(ci): resolve ruff lint errors and regression workflow for CI comp…

1981e05

…atibility Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: apply black formatting to evaluator files

bc61d0b

fix(ci): use pip install -e .[dev] in regression workflow to match ma…

b2495d8

…in CI

feat: add dev:pre-push validation skill with CI parity and PR creation

3acb28b

revert: remove dev:pre-push skill from repo (dev-only, lives in ~/.cl…

1257bd1

…aude/skills/)

fix(ci): support duration strings in --since and add continue-on-erro…

78de782

…r to dry-run step

chore: remove obsolete docs and update broken references

efe142f

Remove: RAG-CHUNKING-BEST-PRACTICES-2026.md, claude-code-hooks-best-practices.md, conversation-memory-best-practices.md, sdk-vs-hooks-comparison.md Update 5 files with broken references to point to HOOKS.md instead.

style: fix lint issues in evaluator files (ruff, black, isort)

31bdc04

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hidden-History merged commit 78892ef into main Mar 14, 2026
13 checks passed

Hidden-History deleted the feature/v2.2.3-cleanup branch March 14, 2026 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: LLM-as-judge evaluation pipeline + observability cleanup (v2.2.3)#63

feat: LLM-as-judge evaluation pipeline + observability cleanup (v2.2.3)#63
Hidden-History merged 26 commits intomainfrom
feature/v2.2.3-cleanup

Hidden-History commented Mar 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Hidden-History commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Hidden-History commented Mar 14, 2026 •

edited

Loading