feat: LLM-as-judge evaluation pipeline + observability cleanup (v2.2.3)#63
Merged
Hidden-History merged 26 commits intomainfrom Mar 14, 2026
Merged
feat: LLM-as-judge evaluation pipeline + observability cleanup (v2.2.3)#63Hidden-History merged 26 commits intomainfrom
Hidden-History merged 26 commits intomainfrom
Conversation
1. update_shared_scripts() now copies templates/ dir alongside scripts, ensuring new templates (e.g., decision-log.md) deploy on Option 1 installs — not just full installs. 2. story-complete.md: observability section with structured logging, Prometheus metrics, Grafana dashboard checklist items. 3. production-ready.md: specific observability items replacing generic "Monitoring in place" / "Logging adequate" lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ersions, TD-219 screenshot dir - TD-100: sanitize_log_input already applied to all user inputs in monitoring/main.py (pre-existing) - TD-217: move detect-secrets from prod deps to dev optional-dependencies - TD-218: downgrade GH Actions to confirmed stable versions (checkout@v4, setup-python@v5, github-script@v7, release-changelog-builder-action@v5) - TD-219: ensure_screenshot_dir fixture already exists in tests/e2e/conftest.py (pre-existing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…D-275) Updates tags= parameters across 21 files to use the canonical tag mapping: - user_prompt_capture/store_async: ["capture", "user_prompt"] - agent_response_capture/store_async: ["capture", "agent_response"] - error_detection/pattern_capture/store_async: ["capture", "error_detection"] - context_injection_tier2: ["injection", "tier2"] - session_start: ["injection", "tier1", "bootstrap"] - pre_compact_save: ["capture", "session_summary"] - post_tool_capture/new_file_trigger/first_edit_trigger/store_async: ["capture", "trigger"] - best_practices_retrieval: ["retrieval", "best_practices"] - manual_save_memory: ["capture", "manual"] - classification_worker/process_classification_queue: ["classification"] - jira/sync: ["sync", "jira"] - src/memory/search: ["search", "retrieval"] - src/memory/decay: ["search", "decay"] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ath (TD-228) - Fix resume trace tags: ["injection", "tier1", "bootstrap"] → ["injection", "resume"] (DEC-054) - Add per-search trace (memory_retrieval_search) for session summaries in non-Parzival compact path - Add per-search trace (memory_retrieval_search) for decisions in non-Parzival compact path - Add greedy-fill trace after inject_with_priority with budget/selected/dropped counts - All new events tagged ["injection", "compact"] with session_id, guarded try/except - Non-Parzival compact path now has 4 emit_trace_event calls (+3 from 1 before) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ding F-5) - F-1: Complete TD-275 — 12 remaining single-element tags updated to dual-element - F-2: store_async.py tags restored to ["capture", "store"] - F-3: error_detection.py injection event correctly tagged ["injection", "error_detection"] - F-4: Unique event_types for compact session/decision traces - F-6: best_practices_retrieval.py tag aligned to ["search", "best_practices"] - F-7: GH Actions upload-artifact@v4, cache@v4 - F-8: Removed hardcoded score from session summary traces Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…assification_queue.py
…-012 Phase 2) - EV-01 retrieval_relevance: NUMERIC 0-1, filters search/retrieval tags, 5% sampling - EV-02 injection_value: BOOLEAN, filters injection/tier2 + injection/compact tags, 5% sampling - EV-03 capture_completeness: BOOLEAN, filters capture tags, 5% sampling - EV-04 classification_accuracy: CATEGORICAL (correct/partially_correct/incorrect), filters classification tags, 10% sampling - EV-05 bootstrap_quality: NUMERIC 0-1, filters injection/tier1/bootstrap tags, 100% sampling - EV-06 session_coherence: NUMERIC 0-1, filters capture/session_summary tags, 100% sampling Each evaluator has a companion prompt with chain-of-thought, rubric, and JSON response format. Prompts are model-agnostic (Ollama llama3.2:8b compatible) with clear scoring criteria. Tag filters use canonical values from codebase (TD-275 WP-0a). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er support (PLAN-012 Phase 2) - src/memory/evaluator/__init__.py: Package init exporting EvaluatorConfig, EvaluatorRunner - src/memory/evaluator/provider.py: Multi-provider client (Ollama, OpenRouter, Anthropic, OpenAI, custom) - Ollama: OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") - Anthropic: native Anthropic SDK (NOT OpenAI compat) per PM #190 - Cloud providers raise ValueError if env var not set (fail-fast) - ZERO hardcoded API keys - src/memory/evaluator/runner.py: Core pipeline with cursor-based pagination (PM #190 fix) - Deterministic score_id via md5(trace_id:evaluator_name:since) for idempotency - Uses get_client() V3 singleton, never Langfuse() constructor - Calls langfuse.flush() after all evaluations, shutdown() takes no args - scripts/run_evaluations.py: CLI with --config, --evaluator, --dry-run, --since, --batch-size - scripts/create_score_configs.py: Idempotent 6-config Score Config setup in Langfuse - evaluator_config.yaml: Default config (Ollama/llama3.2:8b), ZERO secrets - tests/test_evaluator_provider.py, test_evaluator_runner.py, test_evaluator_config.py: 73 tests passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ew (PLAN-012)
UF-1: Remove evaluators/ prefix from prompt_file in all 6 evaluator YAMLs
UF-2: Flatten list-of-lists tags to flat string lists in all 6 evaluator YAMLs
UF-3: Wrap per-trace eval block in try/except to isolate individual failures
UF-4: Move _matches_filter check before sampling so sample from matching traces only
UF-5: Move langfuse.flush() into finally block so it runs even on error
UF-6: Fix EV-01 tags to [search, retrieval, best_practices] per AC-2 spec
UF-7: Remove [:TRACE_CONTENT_MAX] truncation from LLM prompt in provider.py
UF-8: Replace all {{...}} Jinja placeholders in 6 prompts with trace section refs
UF-9: Fix missing | on 0.5-0.69 rubric row in ev05_bootstrap_quality_prompt.md
UF-10: Define TRACE_CONTENT_MAX once in __init__.py, import in provider/runner
UF-11: Align max_tokens to 1024 in evaluator_config.yaml
UF-12: Add _client caching to get_client() to avoid recreating clients per call
UF-13: Make since a required keyword-only arg in run(); remove internal default
UF-14: Simplify test_no_sk_prefix_keys assertion to assert "sk-" not in source
UF-15: Add test_load_prompt_uses_actual_file_not_fallback to test_evaluator_runner.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…12 Phase 3) Creates scripts/create_datasets.py with 5 Langfuse golden datasets: - DS-01: Retrieval Golden Set (25 items, all 5 collections, >=3 per collection) - DS-02: Error Pattern Match (13 items, Python/JavaScript/bash) - DS-03: Bootstrap Round-Trip (7 items, agent_id=parzival tenant isolation) - DS-04: Keyword Trigger Routing (68 items, all patterns from triggers.py) - DS-05: Chunking Quality (10 items, IntelligentChunker routing decisions) Also creates tests/test_create_datasets.py with 35 tests covering: - Dataset completeness (all 5 defined, required keys/metadata) - DS-04 item count matches triggers.py exactly (68 patterns: error_detection=5, decision_keywords=20, session_history_keywords=16, best_practices_keywords=27) - Item structure validation per dataset type - No placeholder data (case-sensitive check) - Idempotency/dry-run mode works without Langfuse connection - V3 SDK compliance (get_client(), flush(), no constructor with explicit creds) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hase 3) - tests/test_regression.py: 4 regression tests against golden datasets DS-01 (retrieval relevance >= 0.7), DS-02 (error match >= 80%), DS-03 (bootstrap round-trip), DS-04 (keyword routing 100%) - .github/workflows/regression-tests.yml: CI workflow triggers on PR to main/develop when src/memory/** or .claude/hooks/scripts/** change; blocks merge on regression; ZERO hardcoded secrets - pyproject.toml: add 'regression' marker; default pytest excludes regression tests (-m 'not regression') Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ale counts
- F-1: Update all DS-04 "63" references to "68" (module docstring, section
header, breakdown comment, dataset description string)
- F-2: Generate deterministic id="{dataset_name}-item-{i:03d}" for each
create_dataset_item() call to prevent duplicate creation on re-runs
- F-3: Update DS-02 fabricated error message to "assert 60 == 68" and fix_summary
to include error_detection (5) = 68 total
- F-4: Remove unused bare `import importlib` from test_create_datasets.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ndings
- F-1+F-2 (CRITICAL): add detect_error_signal import+branch, fix trigger
name strings to match dataset exact values (best_practices_keywords,
session_history_keywords, error_detection) so all 68 DS-04 items route
- F-3: update stale "63" → "68" in comment and docstring
- NF-1: remove misleading {input}/{output} placeholders from fallback prompt
in runner.py (replaced with plain "Evaluate the following trace data.")
- NF-2: add test_single_trace_error_does_not_kill_run to test_evaluator_runner.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atibility Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r to dry-run step
Remove: RAG-CHUNKING-BEST-PRACTICES-2026.md, claude-code-hooks-best-practices.md, conversation-memory-best-practices.md, sdk-vs-hooks-comparison.md Update 5 files with broken references to point to HOOKS.md instead.
…a cloud support - create_score_configs.py: Replace non-existent lf.create_score_config() with lf.api.score_configs.create(request=CreateScoreConfigRequest(...)) (V3 SDK) - provider.py: Read OLLAMA_API_KEY from env instead of hardcoded "ollama", enabling Ollama cloud API access while preserving local default - evaluator_config.yaml: Document cloud base_url (https://ollama.com/v1) and OLLAMA_API_KEY env var Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PLAN-012 evaluator pipeline supports Ollama cloud models which require an API key from https://ollama.com/settings/keys. Local Ollama users can leave this blank. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- trace.list: from_timestamp/to_timestamp (not start_time/end_time) - trace.list: page-based pagination (not cursor-based) - create_score: score_id parameter (not id) - Updated tests to match corrected pagination pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- provider.py: Fall back to msg.reasoning when msg.content is empty, supporting Qwen3 and other thinking models that use a reasoning field - evaluator_config.yaml: Increase max_tokens from 1024 to 4096 to accommodate thinking tokens + output for reasoning models - Updated test assertion for new default max_tokens Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
Note
Merging to main WITHOUT tagging v2.2.3. A follow-up sprint (v2.2.3B) will address Langfuse observability gaps (TD-280 to TD-288: tag propagation, observation-level evaluation, cron scheduler, retry logic) before the release tag.