feat(instrument): Error-aware event emission on framework exceptions across 10 lighter adapters (cross-poll #2)#115
Closed
mmercuri wants to merge 4 commits into
Closed
Conversation
) * SDK samples: 70+ production-ready samples, docs, and tests (rebased on main) Rebased onto latest main (e8a8033) which includes: - CLI with auth (PR #72) - layerlens.instrument tracing + adapters (PR #66, #69) - Scorers resource, integrations resource - API naming convention fixes (PR #61) No impact on samples: Stratix() constructor is backward-compatible, use_bearer_auth defaults to False, all existing API signatures unchanged. Samples include: core (18), industry (10), cowork (5), modalities (3), integrations (2), cicd (2+workflow), openclaw (10+skill), mcp (1), copilotkit (2+UI), claude-code skills (6), sample data (23 files). 469 non-live tests passing. 54 live tests available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove marc-only/ from tracking, add to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move examples/cli/ to samples/cli/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add instrumentation and integration management samples from examples/ Copy 3 new files from examples/ that had no equivalent in samples/: - samples/integrations/openai_instrumented.py (instrument_openai + @trace + span) - samples/integrations/langchain_instrumented.py (LangChainCallbackHandler) - samples/core/integration_management.py (client.integrations CRUD) Update docs/instrumentation/providers.md and frameworks.md with Related Samples links. Update samples/integrations/README.md and samples/core/README.md. Update samples/README.md integrations count (2 → 4). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Consolidate examples/ into samples/: remove duplicates, integrate unique patterns - Remove 14 examples/ files already covered by samples/core equivalents - Create samples/core/benchmark_evaluation.py for model+benchmark workflow (evaluations.create → wait_for_completion → results.get/get_all) - Integrate 12 unique patterns from remaining examples/ into samples/: - trace_evaluation.py: add get_results().steps iteration, get_many() without filter - compare_evaluations.py: add compare_models(), outcome_filter, result field access - judge_optimization.py: add BadRequestError catch, optimization result fields - model_benchmark_management.py: add models.add/remove, benchmarks.add/remove, filters - evaluation_filtering.py: document both camelCase and snake_case sort_by conventions - paginated_results.py: add results.get_by_id() alternative - public_catalog.py: add evaluation summary fields, get_prompts search/sort params - async_workflow.py: add evaluation instance methods (wait_for_completion_async, etc) - Add Related Samples to docs/examples/creating-evaluations.md - Add Related Samples to docs/instrumentation/providers.md and frameworks.md - Update all READMEs for new files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove hardcoded retrieval score from rag_assessment.py (CLAUDE.md Rule 3) The "0.92" similarity score was fabricated and displayed as if computed by a real retrieval engine. Removed the fake score -- retrieval is by document ID, and actual quality scoring comes from the judge evaluation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add per-sample SDK call assertions for all 58 samples (10/10 compliance) Every sample now has specific assertions verifying which SDK methods it calls (not just "didn't crash"). Covers: - 20 core samples (benchmark_evaluation, integration_management added) - 5 cowork samples (code_review, pair_programming, rag_assessment, etc) - 3 modality samples (text, brand, document evaluation) - 4 integration samples (openai/anthropic traced + instrumented) - 2 cicd samples Also adds mock setup for client.integrations and client.trace_evaluations.get_many. 495 non-live tests passing, 58 live tests deselected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove examples/ entirely, remap all 53 doc references to samples/ All example files have been either: - Removed (14 duplicates already covered by samples/core equivalents) - Removed after integrating unique patterns into samples/ (12 files) - Replaced by samples/core/benchmark_evaluation.py (3 client workflow files) Updated all 53 doc references in docs/examples/ to point to samples/core/. Updated docs/examples/README.md with new file table. examples/ directory no longer exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add comprehensive MCP server tests (29 tests) Tests cover all 6 tool handlers, dispatch logic, error handling, asyncio.to_thread wrapping, and helper functions: - TestToolCatalogue: server creation and handler existence - TestHandleListTraces: summary output, default limit, empty/null responses - TestHandleGetTrace: detail output, not-found handling - TestHandleRunEvaluation: creation output, failure handling - TestHandleGetEvaluation: status+results, not-found, pending state - TestHandleCreateJudge: creation output, failure handling - TestHandleListJudges: list output, empty/null responses - TestDispatchAndErrors: unknown tool, SDK exceptions, helper functions - TestAsyncWrapping: all 5 handlers verified to use asyncio.to_thread Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix samples --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: m-peko <marinpeko5@gmail.com>
Bootstraps the LayerLens instrument layer with the abstract base classes,
adapter registry, capture configuration, event sinks, vendored event
schemas, and pydantic v1/v2 compatibility shim that every concrete
adapter (frameworks, protocols, providers) will depend on.
Scope
-----
- src/layerlens/instrument/__init__.py: lean re-export surface
- src/layerlens/instrument/_vendored/: frozen ateam event schemas (no
runtime ateam dependency)
- src/layerlens/instrument/adapters/_base/: BaseAdapter, AdapterRegistry,
AdapterStatus, AdapterHealth, AdapterCapability, ReplayableTrace,
CaptureConfig, EventSink, TraceStoreSink, IngestionPipelineSink,
PydanticCompat
- src/layerlens/_compat/pydantic.py: model_dump/model_validate shim
spanning pydantic v1 + v2
- scripts/{port_adapter,port_protocol,emit_adapter_manifest,
regen_dep_baselines}.py: codegen helpers used to port the rest of M1
- tests/instrument/{test_base_layer,test_lazy_imports,
test_default_install,test_resolved_dep_tree}.py + _baselines/
- .github/workflows/dep-tree-guard.yaml: CI gate that locks the default
install footprint
- docs/adapters/: CONTRIBUTING, STATUS, pydantic-compatibility, testing,
PERSONA_REVIEW
Blast radius
------------
- Pure additions. No public surface changes outside the new
layerlens.instrument namespace.
- Default `pip install layerlens` install set is unchanged (verified by
test_default_install.py against the new baseline).
- Lazy adapter discovery: importing layerlens.instrument MUST NOT pull
in any optional adapter dep (verified by test_lazy_imports.py).
Test plan
---------
- uv run pytest tests/instrument/test_base_layer.py
tests/instrument/test_lazy_imports.py -x -> 45 passed
- The dep-tree-guard workflow exercises test_default_install.py and
test_resolved_dep_tree.py against the new baselines on every PR.
LAY-3400 umbrella: this PR is the prerequisite for the M1.B/M1.C/M1.D
adapter ports, M7 protocol certification, and M8 Cohere/Mistral.
Ports the twelve agent-tier framework adapters from the ateam
reference implementation onto the new layerlens.instrument base layer:
Semantic Kernel, LlamaIndex, OpenAI Agents, Pydantic-AI, Agno,
Strands, SmolAgents, MS Agent Framework, Google ADK,
Bedrock Agents, Embedding (vector store hooks), Benchmark Import
Pairs with feat/instrument-frameworks-orchestration (M1.C part 1)
which lands LangChain, LangGraph, CrewAI, AutoGen, Langfuse, and
Agentforce. Together they complete M1.C.
Scope
-----
- src/layerlens/instrument/adapters/frameworks/{semantic_kernel,
llama_index,openai_agents,pydantic_ai,agno,strands,smolagents,
ms_agent_framework,google_adk,bedrock_agents,embedding,
benchmark_import}/: per-framework packages
- tests/instrument/adapters/frameworks/test_*_adapter.py + the
test_bulk_ported_smoke.py harness (which exercises every ported
adapter against canned trace fixtures so partial framework SDKs
on a given runner don't drop coverage to zero)
- samples/instrument/<framework>/: runnable per-framework samples
- docs/adapters/frameworks-<framework>.md: per-framework integration
guide
- pyproject.toml: twelve new optional extras
(semantic-kernel, llama-index, openai-agents, pydantic-ai, agno,
strands, smolagents, ms-agent-framework, google-adk,
bedrock-agents, embedding, benchmark-import) with python_version
markers; pyright/ruff exclusions for the dynamic monkey-patching
framework code
Blast radius
------------
- Default `pip install layerlens` install set is unchanged. Each
framework's heavy deps are gated behind their own extra.
- No changes to existing public API surface.
- Importing layerlens.instrument still does NOT pull in any framework
module (lazy registry lookup).
Test plan
---------
- uv run pytest tests/instrument/adapters/frameworks/ -x ->
184 passed, 1 skipped (test_bulk_ported_smoke.py covers all 12
agent-tier adapters plus the orchestration-tier ones from part 1
via the same harness)
Stacks on
---------
- feat/instrument-base-foundation (M1.A) — required for the
BaseAdapter surface this PR consumes.
Sibling of
----------
- feat/instrument-frameworks-orchestration (M1.C part 1) — both
branches stack on the base foundation independently and don't
conflict; they can land in either order.
LAY-3400 umbrella (M1.C part 2).
Cross-pollination wave #2 from A:/tmp/adapter-cross-pollination-audit.md §2.3. When a framework callback raises (rate limit, API down, malformed prompt, tool exception), the corresponding lifecycle event used to appear as a "start" with no matching "end" — dashboards rendered this as a hung request, not a failure. The mature adapters (LangChain on_*_error, AutoGen wrappers.py:94-108, Agentforce trust layer) already surface the exception as a structured event before re-raise; this PR ports that contract to the ten lighter runtime adapters. Shared helper ------------- - `src/layerlens/instrument/adapters/_base/errors.py` — `emit_error_event()` with PII-safe message truncation (500 chars), traceback truncation (8 frames / 4000 chars), secret-pattern scrubbing (api_key=, Bearer, sk-*), allow-listed framework-context keys, and multi-tenant org_id propagation from the adapter's stratix client. - `_base/capture.py` — adds `agent.error`, `tool.error`, `model.error` to ALWAYS_ENABLED_EVENT_TYPES so error events bypass capture-config gating (silent error drops are exactly the failure mode the helper exists to prevent). - `_base/__init__.py` — re-exports the public surface. Per-adapter wiring (all 10 targets) ----------------------------------- - agno: `_create_traced_run{,_sync}` emit policy.violation on raise; `on_tool_use` emits tool.error when the optional `error` kwarg is set. - openai_agents: span-end handlers detect `span.error` on agent / generation / function spans and route through a shared `_emit_span_error` helper; `on_run_end` / `on_tool_use` likewise. - llama_index: `_handle_event` checks for `event.exception` on every routed event and emits model.error / tool.error / agent.error based on the LlamaIndex event type prefix. - google_adk: new `_maybe_emit_callback_error` helper inspects the callback context / llm_response / tool_output for `error`, `exception`, `error_message` attributes (or dict keys); also wires `on_agent_end` / `on_tool_use`. - strands: `_create_traced_call` and `on_tool_use` mirror the agno pattern. - pydantic_ai: both async and sync run wrappers, plus `on_tool_use`. - smolagents: `_create_traced_run` and `on_tool_use`. - bedrock_agents: `_after_invoke_agent` detects `parsed.failureTrace` and top-level SDK error keys via new `_maybe_emit_invoke_error`; `on_invoke_end` / `on_tool_use` also wired. - ms_agent_framework: both `traced_invoke` (async generator) and `traced_invoke_stream` emit on raise; `on_tool_use` covers the programmatic surface. - embedding: all three provider wrappers (OpenAI, Cohere, SentenceTransformer) catch upstream exceptions, emit model.error, and re-raise. Re-raise semantics are preserved everywhere — the helper only emits; callers are responsible for `raise`. PII safety is enforced by the allow-list of context keys (no raw user input ever propagates), by the secret pattern scrubber on message + traceback, and by hard length caps. Multi-tenant deployments get org_id automatically from the stratix client. Tests ----- - `tests/instrument/adapters/_base/test_errors.py` — 28 tests covering helper correctness: default event type, exception type / message / module, org_id propagation + override + omission, message + traceback truncation, context-key filtering, secret redaction (api_key, Bearer, sk-*, idempotency), re-raise pattern, custom event type, exception with broken __str__, exception with no traceback, circuit-breaker respect, trace-buffer participation, build_error_payload immutability, empty-framework defense, SAFE_CONTEXT_KEYS lint, internal helper white-box tests. - 10 per-adapter test extensions adding 32 new tests across the lighter adapters, verifying both the framework-callback path (raise → policy.violation / agent.error / tool.error / model.error) and the explicit `on_*` lifecycle hook path. New `test_embedding_adapter.py` is the first test file for the embedding wrapper. Acceptance gates (all passing) ------------------------------ - `pytest tests/instrument/adapters/_base/test_errors.py -x` — 28/28 - `pytest tests/instrument/adapters/frameworks/ -x` — 149/149 (no regression on the 120 baseline tests + 29 new) - `mypy --strict src/layerlens/instrument/adapters/_base/errors.py` — clean - `ruff check src/layerlens/instrument/adapters/ tests/instrument/adapters/` — clean
Merged
7 tasks
m-peko
pushed a commit
that referenced
this pull request
May 12, 2026
…laceholder from #116) (#126) Replaces the M7 placeholder shipped in PR #116 (truncation policy) with the full BrowserUseAdapter — every lifecycle hook wired, every event emitted, and every cross-cutting CLAUDE.md contract enforced from day one. What changed ------------ Full lifecycle adapter (src/layerlens/instrument/adapters/frameworks/ browser_use/lifecycle.py): * connect / disconnect / health_check / get_adapter_info / serialize_for_replay (all five abstract BaseAdapter methods). * on_session_start, on_session_end, on_navigation, on_action, on_screenshot, on_dom_extraction, on_llm_call (every spec'd hook). * Capability declaration: TRACE_TOOLS + TRACE_MODELS + TRACE_STATE + STREAMING + REPLAY (no longer the placeholder's TRACE_TOOLS-only set). * Canonical events: browser.session.start, browser.navigate, browser.action, browser.screenshot, browser.dom.extract, tool.call, model.invoke, agent.input/output/state.change, cost.record, environment.config — plus agent.error / tool.error / model.error per the PR #115 error-aware emission contract. * Per-callback resilience wrapper per PR #117 — observability errors NEVER crash the customer's agent, surfaced via resilience_snapshot(). * Multi-tenant org_id propagation per PR #118 — bound at construction (kwarg or resolved from stratix.org_id), stamped defensively on every emit, caller-supplied values overwritten to prevent cross-tenant leaks. * Truncation policy from day one (DEFAULT_POLICY) — screenshot bytes DROPPED to deterministic SHA-256 references, DOM/HTML capped at 16 KiB, prompts/completions/tool I/O at 4/2 KiB. * Browser-event layer mapping (_BROWSER_EVENT_LAYERS) so unknown browser.* event types respect CaptureConfig gating without falling through the unknown-event-drops-by-default path. * requires_pydantic = PydanticCompat.V2_ONLY (browser_use is a v2 lib). Public surface (src/layerlens/instrument/adapters/frameworks/ browser_use/__init__.py): * ADAPTER_CLASS = BrowserUseAdapter (registry). * instrument_agent(agent, stratix=, capture_config=, org_id=) one-liner returning the connected, wrapping adapter. * STRATIXBrowserUseAdapter top-level binding (legacy alias) — fires DeprecationWarning on construction. Exposed as a static binding so the manifest consistency lint's AST walk finds it. Pyproject: * Adds 'browser-use' optional extra: browser-use>=0.1.0,<2 with the python_version >= '3.11' marker (browser_use's own constraint). Tests (tests/instrument/adapters/frameworks/test_browser_use_adapter.py): * Replaces the 7-test scaffold from #116 with 40 tests covering: wiring + alias + lifecycle round-trip + truncation (screenshot drop, hash determinism, HTML cap, short-payload no-audit) + multi-tenancy (kwarg, client attribute, defensive overwrite) + resilience (poison stratix, exploding agent attribute access) + error-aware emission (agent.error / tool.error / model.error) + per-hook coverage + sync + async wrapping + replay round-trip + 10-case provider detection table. Sample (samples/instrument/browser_use/{main.py,__init__.py,README.md}): * Runs OFFLINE — no browser-use install, no Playwright, no API key, no network. Three-step duck-typed agent + happy/--fail paths exercise the full event surface and demonstrate screenshot drop + org_id stamping + agent.error emission before re-raise. Doc (docs/adapters/frameworks-browser_use.md): * Install + quickstart + capabilities matrix + 14-event reference table + truncation policy table + multi-tenancy + resilience + error-aware emission + capture config + browser_use specifics + BYOK + replay sections. Manifest (scripts/emit_adapter_manifest.py): * Promotes browser_use from _LIFECYCLE_PREVIEW to _MATURE — every required artifact (test file with >= 12 funcs, sample, doc, STRATIX→LayerLens deprecation alias) ships in this PR. Verification ------------ * uv run pytest tests/instrument/adapters/frameworks/test_browser_use_adapter.py → 40 passed * mypy --strict src/layerlens/instrument/adapters/frameworks/browser_use → Success: no issues found in 2 source files * ruff check on src + test + script → All checks passed! * Sample runs cleanly offline (happy + --fail) * pip install -e .[browser-use] resolves cleanly (browser-use only pulled on Python 3.11+ per the env marker) * tests/instrument/adapters/test_manifest_consistency.py:: test_mature_adapters_have_required_artifacts[browser_use] passes * Full instrument suite (excl. pre-existing crewai/protocols references not on this branch): 312 passed, 1 skipped, 12 xfailed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cross-pollination wave #2 from
A:/tmp/adapter-cross-pollination-audit.md§2.3.Mature adapters (LangChain
on_*_error, AutoGenwrappers.py:94-108, Agentforce trust layer) emit a structured event when framework callbacks raise — so the trace dashboard shows a real failure with exception type, message, and bounded traceback, instead of a hung "start" event with no matching "end". This PR ports that contract to all 10 lighter runtime adapters:agno,openai_agents,llama_index,google_adk,strands,pydantic_ai,smolagents,bedrock_agents,ms_agent_framework,embedding.What's in the PR
Shared helper
src/layerlens/instrument/adapters/_base/errors.py—emit_error_event(adapter, exc, context, severity, event_type, org_id)with:api_key=,Bearer …,sk-…)SAFE_CONTEXT_KEYS— only framework attribution keys propagate; PII-shaped keys are droppedorg_idautomatically resolved fromadapter._stratix.org_id(ortenant_id); explicit override supportedBaseAdapter.emit_dict_event— errors in error-handling never mask the original framework errorraise_base/capture.py— addsagent.error/tool.error/model.errortoALWAYS_ENABLED_EVENT_TYPESso error events are never silently dropped by capture-config gating_base/__init__.py— re-exports the public surfacePer-adapter wiring (all 10 targets)
traced_run(async) +traced_run_sync;on_tool_use(error=…)span.erroron agent / generation / function spans (shared_emit_span_errorhelper);on_run_end/on_tool_use_handle_eventchecksevent.exceptionon every routed event and emitsmodel.error/tool.error/agent.errorbased on event-type prefix;on_agent_end/on_tool_use_maybe_emit_callback_errorhelper inspectserror/exception/error_messageon callback context, llm_response, and tool_output;on_agent_end/on_tool_use_create_traced_callandon_tool_useon_tool_use_create_traced_runandon_tool_use_maybe_emit_invoke_errorreadsparsed.trace.failureTrace.failureReasonand top-levelerrorMessage/errorCode;on_invoke_end/on_tool_usetraced_invoke(async generator) andtraced_invoke_stream;on_tool_useTests
tests/instrument/adapters/_base/test_errors.py— 28 helper tests (default event type, exception metadata, org_id propagation/override/omission, message + traceback truncation, context-key filtering, secret redaction in message + tb, idempotency, re-raise pattern, custom event type, broken__str__defense, no-traceback defense, circuit-breaker respect, trace-buffer participation,build_error_payloadimmutability, empty-framework defense, SAFE_CONTEXT_KEYS lint, internal-helper white-box)test_embedding_adapter.py— 29 new tests covering both framework-callback and expliciton_*lifecycle paths, asserting event type / framework / phase / exception_type fieldsCLAUDE.md commercial-grade compliance
org_idon every event)Test plan
pytest tests/instrument/adapters/_base/test_errors.py -x— 28/28 passpytest tests/instrument/adapters/frameworks/ -x— 149/149 pass (no regression on 120 baseline + 29 new)mypy --strict src/layerlens/instrument/adapters/_base/errors.py— cleanruff check src/layerlens/instrument/adapters/ tests/instrument/adapters/— cleanpolicy.violationconsumer expectations on the atlas-app sideSAFE_CONTEXT_KEYSallow-list does not need additions for any tenant-specific adapter the team has internallyCross-references
A:/tmp/adapter-cross-pollination-audit.md§2.3 (origin: AutoGenwrappers.py:94-108, LangChaincallbacks.py:402-436)Notes for reviewer
development(notmain) sincemaindoes not yet containsrc/layerlens/instrument/. Same base as the other adapter PRs.