Skip to content

feat: LLM-as-judge evaluation pipeline + observability cleanup (v2.2.3)#63

Merged
Hidden-History merged 26 commits intomainfrom
feature/v2.2.3-cleanup
Mar 14, 2026
Merged

feat: LLM-as-judge evaluation pipeline + observability cleanup (v2.2.3)#63
Hidden-History merged 26 commits intomainfrom
feature/v2.2.3-cleanup

Conversation

@Hidden-History
Copy link
Owner

@Hidden-History Hidden-History commented Mar 14, 2026

Summary

  • Wave 1: Langfuse observability cleanup (TD-275 semantic tags, TD-228 per-search traces, TD-217/218 detect-secrets + GH Actions)
  • Wave 2: LLM-as-judge evaluation pipeline (PLAN-012 Phase 2+3) — engine, 6 evaluators, 5 golden datasets, regression tests + CI gate
  • 6 V3 SDK bug fixes discovered during live testing (BUG-211 to BUG-216)
  • Ollama cloud API support for evaluator pipeline
  • Thinking model support (Qwen3 reasoning field fallback)

Test plan

  • 2,490 unit tests pass (0 failures)
  • CI green — Lint, Unit Tests (3.10/3.11/3.12), Integration, Regression, CodeQL
  • Live evaluation: 4 scores attached to Langfuse via Ollama cloud (qwen3.5:397b)
  • Score configs verified in Langfuse (6 configs: 3 NUMERIC, 2 BOOLEAN, 1 CATEGORICAL)
  • Installer Option 1 verified — evaluator module installs correctly

Note

Merging to main WITHOUT tagging v2.2.3. A follow-up sprint (v2.2.3B) will address Langfuse observability gaps (TD-280 to TD-288: tag propagation, observation-level evaluation, cron scheduler, retry logic) before the release tag.

WB Solutions and others added 26 commits March 13, 2026 17:58
1. update_shared_scripts() now copies templates/ dir alongside scripts,
   ensuring new templates (e.g., decision-log.md) deploy on Option 1
   installs — not just full installs.

2. story-complete.md: observability section with structured logging,
   Prometheus metrics, Grafana dashboard checklist items.

3. production-ready.md: specific observability items replacing generic
   "Monitoring in place" / "Logging adequate" lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ersions, TD-219 screenshot dir

- TD-100: sanitize_log_input already applied to all user inputs in monitoring/main.py (pre-existing)
- TD-217: move detect-secrets from prod deps to dev optional-dependencies
- TD-218: downgrade GH Actions to confirmed stable versions (checkout@v4, setup-python@v5, github-script@v7, release-changelog-builder-action@v5)
- TD-219: ensure_screenshot_dir fixture already exists in tests/e2e/conftest.py (pre-existing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…D-275)

Updates tags= parameters across 21 files to use the canonical tag mapping:
- user_prompt_capture/store_async: ["capture", "user_prompt"]
- agent_response_capture/store_async: ["capture", "agent_response"]
- error_detection/pattern_capture/store_async: ["capture", "error_detection"]
- context_injection_tier2: ["injection", "tier2"]
- session_start: ["injection", "tier1", "bootstrap"]
- pre_compact_save: ["capture", "session_summary"]
- post_tool_capture/new_file_trigger/first_edit_trigger/store_async: ["capture", "trigger"]
- best_practices_retrieval: ["retrieval", "best_practices"]
- manual_save_memory: ["capture", "manual"]
- classification_worker/process_classification_queue: ["classification"]
- jira/sync: ["sync", "jira"]
- src/memory/search: ["search", "retrieval"]
- src/memory/decay: ["search", "decay"]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ath (TD-228)

- Fix resume trace tags: ["injection", "tier1", "bootstrap"] → ["injection", "resume"] (DEC-054)
- Add per-search trace (memory_retrieval_search) for session summaries in non-Parzival compact path
- Add per-search trace (memory_retrieval_search) for decisions in non-Parzival compact path
- Add greedy-fill trace after inject_with_priority with budget/selected/dropped counts
- All new events tagged ["injection", "compact"] with session_id, guarded try/except
- Non-Parzival compact path now has 4 emit_trace_event calls (+3 from 1 before)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ding F-5)

- F-1: Complete TD-275 — 12 remaining single-element tags updated to dual-element
- F-2: store_async.py tags restored to ["capture", "store"]
- F-3: error_detection.py injection event correctly tagged ["injection", "error_detection"]
- F-4: Unique event_types for compact session/decision traces
- F-6: best_practices_retrieval.py tag aligned to ["search", "best_practices"]
- F-7: GH Actions upload-artifact@v4, cache@v4
- F-8: Removed hardcoded score from session summary traces

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-012 Phase 2)

- EV-01 retrieval_relevance: NUMERIC 0-1, filters search/retrieval tags, 5% sampling
- EV-02 injection_value: BOOLEAN, filters injection/tier2 + injection/compact tags, 5% sampling
- EV-03 capture_completeness: BOOLEAN, filters capture tags, 5% sampling
- EV-04 classification_accuracy: CATEGORICAL (correct/partially_correct/incorrect), filters classification tags, 10% sampling
- EV-05 bootstrap_quality: NUMERIC 0-1, filters injection/tier1/bootstrap tags, 100% sampling
- EV-06 session_coherence: NUMERIC 0-1, filters capture/session_summary tags, 100% sampling

Each evaluator has a companion prompt with chain-of-thought, rubric, and JSON response format.
Prompts are model-agnostic (Ollama llama3.2:8b compatible) with clear scoring criteria.
Tag filters use canonical values from codebase (TD-275 WP-0a).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er support (PLAN-012 Phase 2)

- src/memory/evaluator/__init__.py: Package init exporting EvaluatorConfig, EvaluatorRunner
- src/memory/evaluator/provider.py: Multi-provider client (Ollama, OpenRouter, Anthropic, OpenAI, custom)
  - Ollama: OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
  - Anthropic: native Anthropic SDK (NOT OpenAI compat) per PM #190
  - Cloud providers raise ValueError if env var not set (fail-fast)
  - ZERO hardcoded API keys
- src/memory/evaluator/runner.py: Core pipeline with cursor-based pagination (PM #190 fix)
  - Deterministic score_id via md5(trace_id:evaluator_name:since) for idempotency
  - Uses get_client() V3 singleton, never Langfuse() constructor
  - Calls langfuse.flush() after all evaluations, shutdown() takes no args
- scripts/run_evaluations.py: CLI with --config, --evaluator, --dry-run, --since, --batch-size
- scripts/create_score_configs.py: Idempotent 6-config Score Config setup in Langfuse
- evaluator_config.yaml: Default config (Ollama/llama3.2:8b), ZERO secrets
- tests/test_evaluator_provider.py, test_evaluator_runner.py, test_evaluator_config.py: 73 tests passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ew (PLAN-012)

UF-1: Remove evaluators/ prefix from prompt_file in all 6 evaluator YAMLs
UF-2: Flatten list-of-lists tags to flat string lists in all 6 evaluator YAMLs
UF-3: Wrap per-trace eval block in try/except to isolate individual failures
UF-4: Move _matches_filter check before sampling so sample from matching traces only
UF-5: Move langfuse.flush() into finally block so it runs even on error
UF-6: Fix EV-01 tags to [search, retrieval, best_practices] per AC-2 spec
UF-7: Remove [:TRACE_CONTENT_MAX] truncation from LLM prompt in provider.py
UF-8: Replace all {{...}} Jinja placeholders in 6 prompts with trace section refs
UF-9: Fix missing | on 0.5-0.69 rubric row in ev05_bootstrap_quality_prompt.md
UF-10: Define TRACE_CONTENT_MAX once in __init__.py, import in provider/runner
UF-11: Align max_tokens to 1024 in evaluator_config.yaml
UF-12: Add _client caching to get_client() to avoid recreating clients per call
UF-13: Make since a required keyword-only arg in run(); remove internal default
UF-14: Simplify test_no_sk_prefix_keys assertion to assert "sk-" not in source
UF-15: Add test_load_prompt_uses_actual_file_not_fallback to test_evaluator_runner.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…12 Phase 3)

Creates scripts/create_datasets.py with 5 Langfuse golden datasets:
- DS-01: Retrieval Golden Set (25 items, all 5 collections, >=3 per collection)
- DS-02: Error Pattern Match (13 items, Python/JavaScript/bash)
- DS-03: Bootstrap Round-Trip (7 items, agent_id=parzival tenant isolation)
- DS-04: Keyword Trigger Routing (68 items, all patterns from triggers.py)
- DS-05: Chunking Quality (10 items, IntelligentChunker routing decisions)

Also creates tests/test_create_datasets.py with 35 tests covering:
- Dataset completeness (all 5 defined, required keys/metadata)
- DS-04 item count matches triggers.py exactly (68 patterns: error_detection=5,
  decision_keywords=20, session_history_keywords=16, best_practices_keywords=27)
- Item structure validation per dataset type
- No placeholder data (case-sensitive check)
- Idempotency/dry-run mode works without Langfuse connection
- V3 SDK compliance (get_client(), flush(), no constructor with explicit creds)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hase 3)

- tests/test_regression.py: 4 regression tests against golden datasets
  DS-01 (retrieval relevance >= 0.7), DS-02 (error match >= 80%),
  DS-03 (bootstrap round-trip), DS-04 (keyword routing 100%)
- .github/workflows/regression-tests.yml: CI workflow triggers on PR
  to main/develop when src/memory/** or .claude/hooks/scripts/** change;
  blocks merge on regression; ZERO hardcoded secrets
- pyproject.toml: add 'regression' marker; default pytest excludes
  regression tests (-m 'not regression')

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ale counts

- F-1: Update all DS-04 "63" references to "68" (module docstring, section
  header, breakdown comment, dataset description string)
- F-2: Generate deterministic id="{dataset_name}-item-{i:03d}" for each
  create_dataset_item() call to prevent duplicate creation on re-runs
- F-3: Update DS-02 fabricated error message to "assert 60 == 68" and fix_summary
  to include error_detection (5) = 68 total
- F-4: Remove unused bare `import importlib` from test_create_datasets.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ndings

- F-1+F-2 (CRITICAL): add detect_error_signal import+branch, fix trigger
  name strings to match dataset exact values (best_practices_keywords,
  session_history_keywords, error_detection) so all 68 DS-04 items route
- F-3: update stale "63" → "68" in comment and docstring
- NF-1: remove misleading {input}/{output} placeholders from fallback prompt
  in runner.py (replaced with plain "Evaluate the following trace data.")
- NF-2: add test_single_trace_error_does_not_kill_run to test_evaluator_runner.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atibility

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove: RAG-CHUNKING-BEST-PRACTICES-2026.md, claude-code-hooks-best-practices.md,
conversation-memory-best-practices.md, sdk-vs-hooks-comparison.md

Update 5 files with broken references to point to HOOKS.md instead.
…a cloud support

- create_score_configs.py: Replace non-existent lf.create_score_config() with
  lf.api.score_configs.create(request=CreateScoreConfigRequest(...)) (V3 SDK)
- provider.py: Read OLLAMA_API_KEY from env instead of hardcoded "ollama",
  enabling Ollama cloud API access while preserving local default
- evaluator_config.yaml: Document cloud base_url (https://ollama.com/v1) and
  OLLAMA_API_KEY env var

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PLAN-012 evaluator pipeline supports Ollama cloud models which require
an API key from https://ollama.com/settings/keys. Local Ollama users
can leave this blank.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- trace.list: from_timestamp/to_timestamp (not start_time/end_time)
- trace.list: page-based pagination (not cursor-based)
- create_score: score_id parameter (not id)
- Updated tests to match corrected pagination pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- provider.py: Fall back to msg.reasoning when msg.content is empty,
  supporting Qwen3 and other thinking models that use a reasoning field
- evaluator_config.yaml: Increase max_tokens from 1024 to 4096 to
  accommodate thinking tokens + output for reasoning models
- Updated test assertion for new default max_tokens

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Hidden-History Hidden-History merged commit 78892ef into main Mar 14, 2026
13 checks passed
@Hidden-History Hidden-History deleted the feature/v2.2.3-cleanup branch March 14, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant