Skip to content

Add Langfuse observability and improve Parzival context injection#45

Merged
Hidden-History merged 27 commits intomainfrom
feature/v2.0.9-observability
Mar 2, 2026
Merged

Add Langfuse observability and improve Parzival context injection#45
Hidden-History merged 27 commits intomainfrom
feature/v2.0.9-observability

Conversation

@Hidden-History
Copy link
Owner

@Hidden-History Hidden-History commented Mar 1, 2026

Summary

v2.0.9 — Injection quality sprint (PLAN-010) + Langfuse observability + installer fixes.

  • Dedicated github Qdrant collection — eliminates 79.6% noise from discussions
  • Structured error pattern detection — eliminates false positives in code-patterns
  • Tier 2 context injection type filters — prevents low-value content injection
  • Langfuse trace visibility restored — 15 hardcoded [:300] truncations removed, TRACE_CONTENT_MAX=10000 standardized
  • Parzival layered priority bootstrap (L1-L4 deterministic + semantic)
  • Content quality gate for low-value messages
  • Installer Option 1 recursive copy fix (BUG-205)
  • Migration script: purge false positives + rename error_fix → error_pattern

Test Results

  • CI: All checks green (Lint, Unit Tests 3.10/3.11/3.12, Integration, CodeQL, Installation Ubuntu+macOS)
  • Local: 2,108 pass, 0 fail, 477 skipped
  • Live verification: 277 tests across 11 domains — 249 pass (89.9%), 0 critical issues
  • Langfuse audit: 9 traces, 42 spans — zero truncation found, full pipeline visible
  • Hook verification: All 14 scripts checksummed, zero [:300] instances, TRACE_CONTENT_MAX=10000 confirmed
  • System health: 16/16 Docker services healthy, 5/5 Qdrant collections verified

Commits (27)

27 commits covering: PLAN-010 injection quality, Langfuse observability, Parzival bootstrap, BUG-197/198/199/200/201/204/205, TD-237, CodeQL fix, installer recursive copy + chmod.

Upgrade Path

See CHANGELOG.md for complete upgrade instructions (Option 1 + container rebuild + migration).

WB Solutions and others added 24 commits March 1, 2026 09:33
…ion, and session_start

Instruments the retrieval pipeline with Langfuse trace events via the
fire-and-forget trace buffer pattern (emit_trace_event). Previously,
search.py and injection.py had zero Langfuse visibility, and
session_start.py only emitted aggregate counts.

- search.py: search_query, dual_collection_search, cascading_search events
  with collection, model, duration, and score metadata
- injection.py: bootstrap_retrieval (per-category counts + per-result scores),
  greedy_fill (budget utilization, dedup/score-gap skip counts, cached token
  counts), format_injection (output size tracking)
- session_start.py: enhanced 4 existing trace events with per-result detail
  (type, collection, score, tokens) for both startup and resume/compact paths

All emit calls guarded by if emit_trace_event + try/except Exception: pass.
No direct Langfuse SDK imports. No existing signatures or behavior changed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…maries

Previous trace events only showed aggregate counts like "Found 5 results,
top score: 0.8283" — making it impossible to see WHAT was actually
retrieved and injected. Now all trace event outputs include the actual
content with per-result previews (type, collection, score, content).

- search_query: output now shows each result's content (500 char preview)
- bootstrap_retrieval: output shows per-result content with type/score
- greedy_fill: output shows selected results with token counts
- format_injection: output shows the actual <retrieved_context> content
- cascading_search + dual_collection_search: same content preview pattern

Also added TRACE_CONTENT_MAX constants (2000/10000 chars) for truncation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n_id

All hook scripts now propagate LANGFUSE_TRACE_ID and CLAUDE_SESSION_ID
env vars before calling library functions (search.py, injection.py).
This ensures all emit_trace_event() calls within a single hook execution
share the same Langfuse trace, enabling proper span nesting and session
correlation.

Changes:
- trace_buffer.py: Add CLAUDE_SESSION_ID env var fallback for session_id
- session_start.py: Propagate env vars in startup + resume/compact paths,
  remove random trace_id on session_bootstrap events
- context_injection_tier2.py: Propagate env vars before search loop
- best_practices_retrieval.py: Generate trace_id + extract session_id
  from hook_input, propagate via env vars

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: _make_parent_context() with INVALID_SPAN_ID causes OTel to
create separate traces per event instead of nesting under one trace.

Fix: Each hook generates a root_span_id (LANGFUSE_ROOT_SPAN_ID env var).
Library functions (search.py, injection.py) get this as parent_span_id
via env fallback in trace_buffer.py, creating proper parent→child nesting.

Root span events pass parent_span_id=None explicitly to skip env fallback.
Sentinel _UNSET differentiates "not provided" (use env) from "explicitly
None" (root span, no parent).

Changes:
- trace_buffer.py: _UNSET sentinel for parent_span_id env fallback
- session_start.py: Root span IDs for startup + resume/compact paths
- context_injection_tier2.py: Root span ID + wall-clock start_time
- best_practices_retrieval.py: Root span ID for PreToolUse hook

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntext

Root cause: SpanContext with INVALID_SPAN_ID fails is_valid(), causing OTel
to ignore the trace_id and create a new trace. Root spans and child spans
ended up in different Langfuse traces.

Fix: Generate random valid span_id for root spans (no parent_span_id).
With is_remote=True, OTel won't look for this span locally, but will
inherit the trace_id ensuring all spans share the same Langfuse trace.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace flat score-sorted pool with layered priority retrieval when
parzival_enabled=True. Layers are processed in order by greedy fill:
Layer 1 (handoff via get_recent), Layer 2 (decisions via get_recent),
Layer 3 (insights via search), Layer 4 (GitHub via search).
Conventions query removed from Parzival path (noise for PM oversight).
Non-Parzival path unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When parzival_enabled, the resume/compact path now uses get_recent()
for deterministic timestamp-sorted retrieval of session summaries
and decisions only (skipping conventions and code-patterns noise).
Non-Parzival path remains exactly as before with all three vector
search queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
F-1: Score gap filter excludes deterministic score=1.0 from best_score
F-2: Session summary get_recent() wrapped in try/except
F-3: searcher.close() guaranteed via try/finally
F-4: Success-path Prometheus metrics added to get_recent()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BUG-197: The top-level import of async_sdk_wrapper eagerly pulls in
anthropic as a transitive dependency. Environments without anthropic
(e.g. embedding container) crash on `import memory`. Wrap the import
in try/except ImportError so the rest of the module loads cleanly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TD-219: E2E tests fail with FileNotFoundError when saving screenshots
because tests/e2e/screenshots/ doesn't exist. Add a session-scoped
autouse fixture that creates the directory with os.makedirs(exist_ok).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TD-225: When searching the discussions collection with a memory_type
filter that includes github_code_blob, route to the code embedding
model instead of the default prose model. This matches the storage-side
routing in MemoryStorage._get_embedding_model (SPEC-010 Section 4.2),
ensuring query embeddings use the same model as stored embeddings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TD-228: The compact/resume path only had a high-level session_bootstrap
trace event with no per-retrieval detail. Add memory_retrieval_* trace
events for each get_recent() and retrieve_session_summaries() call:

- Parzival path: trace events for session summaries (get_recent) and
  decisions (get_recent) with timing and result counts
- Non-Parzival path: trace event for session summaries (scroll) —
  the search() calls already self-trace via SPEC-021

Each event includes trigger, collection, method, result_count, and
retrieval_ms for full observability in Langfuse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add strict=False to zip() calls in select_results_greedy trace block
- Replace try/except Exception: pass with contextlib.suppress in format_injection_output
- Move E402 imports (activity_log, metrics_push) to top of search.py
- Remove f-prefix from f-string without placeholders in dual_collection_search trace

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eval

Update 4 failing tests to match PM #136 changes where Parzival bootstrap
switched from flat-pool vector search to layered priority retrieval:
- test_parzival_bootstrap_uses_agent_id: handoff now uses get_recent, not search
- test_parzival_bootstrap_full_qdrant_down: mock get_recent + add github_sync_enabled
- test_parzival_bootstrap_graceful_degradation: same mock fixes
- test_parzival_bootstrap_includes_github_enrichment: route handoff through get_recent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r regex, add quality gate

P10-5: L4 GitHub enrichment now queries COLLECTION_GITHUB instead of COLLECTION_DISCUSSIONS
P10-9: Tier 2 discussions search filters to high-value types only (decision, guideline, session, agent_insight, agent_handoff, agent_memory)
P10-7: Rewrite detect_error_indicators() to use structured error patterns instead of bare keyword matching, preventing false positives from filenames containing 'error'
P10-10: Add quality gate to user_prompt_store_async.py and agent_response_store_async.py to skip low-value short messages (< 4 words or known low-value phrases)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add dedicated `github` Qdrant collection to separate GitHub data from
discussions. Updates config constants, setup-collections indexes,
and all github connector modules (schema, code_sync, sync, __init__).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
F-1: Fix test import DISCUSSIONS_COLLECTION → GITHUB_COLLECTION
F-2: Add missing error patterns to migration ERROR_INDICATORS regex
F-3: Replace hardcoded "discussions" string with COLLECTION_DISCUSSIONS constant
F-4: Replace hardcoded GITHUB_COLLECTION with import from memory.config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add v2.0.9 CHANGELOG section covering PLAN-010 injection quality sprint,
Langfuse observability, and Parzival layered priority bootstrap. Update
architecture doc from three-collection to five-collection (github +
jira-data). Fix GitHub integration docs to reference github collection.
Update error detection docs for structured pattern matching rewrite.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Data flow diagram: Stop hook → PreCompact hook for session summaries
- Mistake 2: align with Mistake 9 (PreCompact is correct, not Stop)
- Summary table: Stop hook → PreCompact hook
- Comparison table: mark code-patterns/conventions as conditionally
  searched in Tier 2 and SessionStart (non-Parzival path)
- GitHub indexes: memory_type → type (matches Qdrant field name)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
test_get_collection_stats_includes_jira_data_when_exists expected 4
collections but PLAN-010 added the github collection making it 5.
Also verify github collection is present in stats.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… TD-237

BUG-204: Remove 15 hardcoded [:300] truncation limits from 5 hook scripts.
Standardize TRACE_CONTENT_MAX=10000 across all files (search.py,
langfuse_stop_hook.py, classification_worker.py).

BUG-200 (completion): Correct type="error_fix" to "error_pattern" in
error_store_async.py (5 locations). Update retrieval in error_detection.py,
triggers.py, and classifier config.py to handle both type names.

BUG-201 (completion): Add post-filter in context_injection_tier2.py to
exclude error_fix/error_pattern from Tier 2 code-patterns injection.

TD-237: Add error_pattern type to classifier LLM prompt template so the
classifier can distinguish automated error captures from user-reported fixes.

TD-235: Fix install.sh log message "3 collections" to "5 collections".

BUG-197: Confirmed fixed (lazy imports in memory/__init__.py).

3-round adversarial review (Sonnet+Opus): Round 1 fixed error_detection.py,
triggers.py, classifier config.py. Round 2 fixed test_triggers.py assertion.
Round 3: ZERO ISSUES — APPROVED.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WB Solutions and others added 3 commits March 2, 2026 10:55
CodeQL CWE-117: migrate_v209_github_collection.py logged last 3 chars
of QDRANT_API_KEY in mismatch warning. Replace with key length to
preserve debugging utility without exposing secret material.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
update_shared_scripts() used a non-recursive glob (*.py) that only
copied top-level Python files, missing scripts/memory/ (33 files)
and all .sh files (6 files). Replaced with cp -r matching
copy_files() pattern. Added chmod +x for executable permissions
parity with copy_files().

Fixes: BUG-205

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Migration script now renames any error_fix entries that survive the
false-positive purge to the correct error_pattern type (BUG-200).
Updated CHANGELOG with BUG-204/205, TD-237, CodeQL fix, and
classifier-worker rebuild requirement in upgrade instructions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Hidden-History Hidden-History merged commit b99b510 into main Mar 2, 2026
12 checks passed
@Hidden-History Hidden-History deleted the feature/v2.0.9-observability branch March 2, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant