Skip to content

feat: v2.1.0 — Langfuse V3 migration, data quality, observability#52

Merged
Hidden-History merged 11 commits intomainfrom
feature/v2.1.0-quality-sprint
Mar 6, 2026
Merged

feat: v2.1.0 — Langfuse V3 migration, data quality, observability#52
Hidden-History merged 11 commits intomainfrom
feature/v2.1.0-quality-sprint

Conversation

@Hidden-History
Copy link
Owner

Summary

  • Langfuse V3 SDK migration — All instrumentation migrated from V2 to V3 SDK. Compliance review resolved 2 critical, 6 standard, and 9 warning-level issues across 18 files.
  • Data quality improvements — Fixed false-positive error pattern capture, added type filters to context injection tier 2, renamed error_fixerror_pattern across 36 files.
  • Trace observability — Standardized TRACE_CONTENT_MAX constant, propagated session_id to search traces, added atexit handlers for graceful Langfuse shutdown in Docker services.
  • Infrastructure — ClickHouse memory limit reduced to 16 GiB, installer chmod fix for subdirectories.
  • Testing — Fixed flaky rate limiter integration test (BUG-175) with deterministic mock approach.

Scope

  • 75 files changed (+1,005 / -581 lines)
  • 5 commits covering quality sprint (PM #144–148)
  • 2,111 tests pass locally

Test Plan

  • CI pipeline passes (unit + integration tests)
  • Langfuse V3 compliance: no V2 patterns (Langfuse() constructor, start_span(), langfuse_context)
  • TRACE_CONTENT_MAX used consistently (no hardcoded [:300] or [:10000] literals)
  • Error pattern capture uses regex indicators, not substring matching
  • Context injection tier 2 filters by memory_type
  • Docker services have atexit Langfuse shutdown handlers

WB Solutions and others added 11 commits March 2, 2026 18:20
PART A — Zero claude_code_session traces (langfuse_stop_hook.py):
- Fix transcript format mismatch: add _get_entry_role()/_get_entry_content()
  helpers to handle Claude Code V2.x format (type/message.content) alongside
  older role/content format. _pair_turns() and first/last text extraction
  updated to use helpers — was silently producing empty traces.
- Increase FLUSH_TIMEOUT_SECONDS 5→15, configurable via env var
  LANGFUSE_FLUSH_TIMEOUT_SECONDS. 5s SIGALRM was firing before flush could
  complete, causing all queued trace data to be discarded.
- Upgrade credential-missing diagnostic from INFO→WARNING so it's visible
  without DEBUG_HOOKS; add transcript parse counts to INFO log.

PART B — TD-241: 26/44 hook pipeline traces missing session_id:
- agent_response_capture, user_prompt_capture, post_tool_capture,
  error_pattern_capture: set os.environ["CLAUDE_SESSION_ID"] = session_id
  in the hook process after reading from hook_input, so downstream
  library calls using emit_trace_event() pick it up via env fallback.
- Propagate CLAUDE_SESSION_ID into subprocess_env for background
  store_async subprocesses, alongside the existing LANGFUSE_TRACE_ID.

Fixes: BUG-200 (zero session traces), BUG-201 (TD-241 session_id orphans)
Refs: SPEC-022 S2, BUG-151–BUG-157

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ClickHouse memory 4→16 GiB (Langfuse recommended). Fix langfuse_stop_hook
V2.x transcript format (root cause of zero session traces). Fix
post_tool_capture validate_hook_input (root cause of zero implementation
patterns). Add agent_id to all 4 store hook payloads. Move Tier 2 type
exclusion to Qdrant must_not filter. Propagate session_id to store
subprocesses (fix orphaned spans). Fix [:500] trace limits in new_file
and first_edit triggers. Add chmod +x for scripts subdirectories in both
install paths.

Rename error_fix → error_pattern across 36 files (TD-236/238/239).
Merge duplicate LLM prompt type definitions. Deduplicate VALID_TYPES and
SKIP_RECLASSIFICATION_TYPES. Fix cross-agent collision artifacts.

Dual adversarial review (opus + sonnet): 5 MUST-FIX resolved. 2137 tests
pass, 0 failures.

Resolves: TD-236, TD-238, TD-239, TD-240, TD-241, TD-243

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mpliance

Enforce Langfuse Python SDK V3 across the entire codebase per
LANGFUSE-INTEGRATION-SPEC.md (created this session, 766 lines).

Key changes:
- langfuse_stop_hook.py: V2 start_span() → V3 start_as_current_observation()
  + propagate_attributes() for session/user linking
- langfuse_config.py: V2 Langfuse() constructor → V3 get_client() singleton
- github/sync.py + code_sync.py: add propagate_attributes() + flush()
- jira/sync.py: add session_id to all 3 emit_trace_event calls, graceful
  import fallback, try/except guards, TRACE_CONTENT_MAX constant
- 26 files: add LANGFUSE-INTEGRATION-SPEC header comments (Path A/B/Infra)
- tests: update 5 stop hook tests for V3 mock patterns (40/40 pass)

Review: 9 findings (3 MUST-FIX resolved, 3 SHOULD-FIX deferred as TD-244/245/246)
Tests: 2109 passed, 0 failures

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ARNINGS resolved

PM #146: Full Langfuse V3 SDK compliance review sprint.
77 files audited by 6 parallel review agents + 2 adversarial opus reviewers.
2 dev→review cycles converged to zero remaining issues.

CRITICAL fixes:
- trace_flush_worker.py: V2 start_span/start_generation → V3 start_observation
  with trace_context, observation.end() TypeError safety
- test_langfuse_config.py: V2 Langfuse() constructor mocks → V3 get_client mocks
  + ImportError path test

ISSUES fixed:
- decay.py, freshness.py: guarded Langfuse imports, TRACE_CONTENT_MAX slicing,
  session_id propagation
- injection.py: session_id propagation, contextlib.suppress → try/except
- process_classification_queue.py: TRACE_CONTENT_MAX, dict serialization, session_id
- github/sync.py: flush try/finally, atexit shutdown handler (TD-245)
- github/code_sync.py: atexit shutdown handler (TD-246)
- jira/sync.py: TRACE_CONTENT_MAX slicing

WARNINGS fixed:
- search.py, config.py: Langfuse spec header comments
- 4 classifier providers: header corrected to "Path A upstream"
- langfuse_config.py: dead module-level imports removed

Tests: 2111 pass (62/62 Langfuse-specific tests pass).
TD-245 FIXED, TD-246 FIXED.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… BUG-175

Category A code fixes from PM #147 comprehensive review:

- search.py: Add session_id to 4 emit_trace_event() calls so search
  traces link to their Claude Code session in Langfuse UI. Replace 4x
  hardcoded [:10000] with [:TRACE_CONTENT_MAX] constant per spec §9.2.
- process_classification_queue.py: Add atexit Langfuse shutdown handler
  for classification worker daemon, matching sync.py TD-245 pattern.
  Ensures buffered OTel spans are flushed on Docker SIGTERM.
- test_async_sdk_wrapper.py: Fix BUG-175 flaky test by mocking
  asyncio.sleep instead of relying on real timing. Root cause: 2 req/min
  rate limiter needs 30s refill, conflicting with 30s pytest timeout.
  Remove @pytest.mark.quarantine — test now deterministic.

All changes reviewed (sonnet adversarial + Parzival verification).
2111 tests pass, 0 failures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add agent_name/agent_role to all emit_trace_event() metadata dicts
(23 files, 89 occurrences) for per-agent Langfuse filtering.
Defaults to main/user for non-team sessions.

Review fixes (adversarial review, 2x sonnet+opus):
- F-1: Add error_fix to must_not_types in context_injection_tier2.py
- F-2: Move atexit handler outside AnthropicInstrumentor try block
- F-3: Add error_fix to error_detection.py search filter
- F-4: Add warning log to trace_flush_worker end_time fallback
- F-5: Add 3 must_not_types test cases (2114 total tests pass)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix I001 (import sorting), E402 (noqa for intentional lazy imports),
SIM105 (contextlib.suppress), RUF100 (unused noqa), RUF003 (unicode).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add agent_name/agent_role to 4 missing emit_trace_event() calls
  in classification_worker.py and process_classification_queue.py
- Wrap FLUSH_TIMEOUT_SECONDS int() parse in try/except (FINDING-1)
- Add v2.1.0 upgrade instructions to CHANGELOG
- Update ClickHouse resource table (1 GB → 16 GiB cap)
- Fix error_detection.py docstring to match dual-type filter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run black formatter on 18 files (9 from our changes, 9 pre-existing).
Clarify CHANGELOG ClickHouse memory entry: 16 GiB cap (up from 4 GiB,
down from unlimited default).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Bump pyproject.toml version from 2.0.8 to 2.1.0
- Fix test_generate_hook_config_stop: isolate LANGFUSE_ENABLED env var
- Add test_generate_hook_config_stop_with_langfuse: verify dual Stop hooks
- Fix TestLangfuseDefaults: autouse fixture clears Langfuse env vars
  (pydantic reads env vars even with _env_file=None)
- Fix 6 test files: replace relative fixture/sys.path with
  Path(__file__)-based resolution for cwd-independent execution
- All 2115 tests pass, black + ruff clean

Reviewed-by: reviewer-sonnet (PASS, 0 findings)
Reviewed-by: reviewer-opus (PASS, 2 LOW non-blocking)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Hidden-History Hidden-History merged commit 83035ec into main Mar 6, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant