fix(instrument): Propagate org_id through all event emissions (multi-tenancy CLAUDE.md fix) by mmercuri · Pull Request #118 · LayerLens/stratix-python

mmercuri · 2026-04-27T00:16:09Z

Summary

Closes the cross-cutting CLAUDE.md multi-tenancy gap surfaced by A:/tmp/adapter-depth-audit.md (cross-cutting finding #3): all 203 adapter emissions in the stratix-python SDK shipped without org_id propagation, violating the platform-wide "EVERY data operation must be scoped by tenant" mandate.

Note: this PR diff is large because the underlying instrument/ port lives only on per-adapter feat/instrument-* branches and is not yet on main. The substantive change set is limited to the 26 source files + 25 test files listed below. Reviewers should focus on the new contract + the per-adapter __init__ deltas.

What changed

Foundation (`_base/`)

BaseAdapter.__init__ now requires a resolvable org_id (explicit kwarg or stratix.org_id / stratix.organization_id). Construction without a non-empty value raises ValueError — fail-fast, no silent fallback.
BaseAdapter.emit_event and emit_dict_event stamp the bound org_id into every payload via _stamp_org_id before forwarding to the client. Caller-supplied values are overwritten with the adapter's own tenant binding (defensive overwrite — prevents cross-tenant leaks via misuse).
The replay trace record and EventSink dispatch path both carry org_id at the envelope level.
EventSink.send signature now requires org_id as a keyword-only arg. IngestionPipelineSink uses per-event org_id as the tenant_id for downstream ingest.
New public constant: ORG_ID_FIELD = "org_id" exported from _base.

Per-adapter wiring (17 framework + 1 protocol base + 9 LLM providers)

Every adapter __init__ now accepts a keyword-only org_id: str | None = None and forwards it to super().__init__. The 12 instrument_* helper functions in each framework adapter package were updated to accept and forward the kwarg.

Group	Adapters touched
Frameworks	agno, autogen, bedrock_agents, crewai, embedding (×2), google_adk, langfuse, langgraph, langchain, llama_index, ms_agent_framework, openai_agents, pydantic_ai, semantic_kernel, smolagents, strands, agentforce
Protocols	`BaseProtocolAdapter` (concrete protocol adapters use `**kwargs`)
Providers	anthropic, azure_openai, bedrock, cohere, google_vertex, litellm, mistral, ollama, openai (+ `LLMProviderAdapter` base)

Tests

tests/instrument/adapters/_base/test_org_id_propagation.py — 17 new tests covering construction-time fail-fast (5), per-emission propagation (5), cross-tenant isolation (2 — the explicit cross-tenant scenario the audit recommended), and public surface stability (2).
tests/instrument/adapters/frameworks/test_per_adapter_org_id.py — 37 parametrized tests (one accept + one fail-fast pair per adapter, plus an audit-cardinality guard) covering all 17 framework adapters.
All 25 existing _RecordingStratix test stand-ins updated with an org_id = "test-org" class attribute. instrument_* helper-call sites in tests pass org_id="test-org".
test_bulk_ported_smoke.py updated to pass org_id at the smoke construction sites.

Documentation

docs/adapters/multi-tenancy.md — full contract for future adapters (resolution order, fail-fast semantics, defensive overwrite rationale, wiring template, test obligations).
BaseAdapter class docstring updated to call out the multi-tenant binding.

Acceptance (per task spec)

uv run pytest tests/instrument/adapters/_base/test_org_id_propagation.py -x — 17/17 passed
uv run pytest tests/instrument/adapters/frameworks/ -x (excluding 5 pre-existing collection failures unrelated to this PR: test_agentforce.py, test_langchain.py, test_langfuse.py, test_langgraph.py, test_haystack.py — all fail to import without optional deps that are not installed in this venv) — 258/258 in scope passed, 13 skipped
uv run mypy --strict src/layerlens/instrument/adapters/_base — clean
uv run ruff check src/layerlens/instrument/adapters/_base/ src/layerlens/instrument/adapters/frameworks/ tests/instrument/adapters/_base/ tests/instrument/adapters/frameworks/test_per_adapter_org_id.py — clean

CLAUDE.md compliance

No emission anywhere without org_id — enforced at the central _stamp_org_id choke point
Fail-fast on missing org_id at init — BaseAdapter.__init__ raises ValueError, never silently skips
All 17 adapters fixed, not a subset
No co-author trailers
Draft PR

References

A:/tmp/adapter-depth-audit.md — audit (2026-04-25), cross-cutting finding docs | LAY-881 Fix wrong yaml structure #3
CLAUDE.md "Multi-Tenancy" section
docs/adapters/multi-tenancy.md — new contract doc shipped in this PR

…rrupt() The evaluator agent calls ``interrupt()`` in ``confirm_judge_node`` for human-in-the-loop judge confirmation. A checkpointer is mandatory for that to work -- without one, a ``Command(resume=...)`` call produces zero events, ``ag-ui-langgraph`` never emits ``RUN_FINISHED``, and the CopilotKit frontend blocks all subsequent messages with "Cannot send 'RUN_STARTED' while a run is still active". Changes: - ``evaluator_agent.py``: compile with ``InMemorySaver`` and a commented Postgres swap block. Convert ``JudgeInfo`` / ``TraceInfo`` / ``EvaluationInfo`` from ``@dataclass`` to ``pydantic.BaseModel`` so LangGraph's default ``JsonPlusSerializer`` can persist state across the pause boundary (dataclasses raise ``TypeError: Type is not msgpack serializable``). - ``samples/copilotkit/README.md``: add full FastAPI backend wiring with ``add_langgraph_fastapi_endpoint``, Next.js frontend wiring with ``LangGraphHttpAgent``, a checkpointer options matrix (InMemory / SQLite / Postgres / Redis / LangGraph Platform) with per-option migration snippets, a version-compatibility table pinning the versions the bug reporter used, and a troubleshooting section mapping the observed frontend errors back to the backend cause. - ``docs/samples-guide.md``: cross-reference the checkpointer requirement. - ``tests/test_samples_e2e.py``: add ``test_copilotkit_evaluator_interrupt_resume`` that imports the real ``langgraph`` (not ``MagicMock``), asserts the compiled graph has a non-None checkpointer, and drives a full ``astream -> interrupt -> Command(resume=...) -> astream`` cycle with a patched Stratix client. Confirmed this test fails on the pre-fix code and passes on the fix. Also extended the existing mock-modules dicts so the import-smoke tests include ``langgraph.checkpoint.memory``. The existing tests missed this because they mock ``langgraph``, ``langgraph.graph``, and ``langgraph.types`` with ``MagicMock()`` and then only call ``main()`` (which prints usage). They never build or execute the graph, so they cannot observe the missing checkpointer.

Follow-ups to the interrupt/checkpointer fix, addressing the open items flagged in the prior commit: 1. Deserialize warning resolved. The Pydantic DTOs (``JudgeInfo`` / ``TraceInfo`` / ``EvaluationInfo``) are now registered on ``JsonPlusSerializer(allowed_msgpack_modules=...)`` via a custom serde passed to ``InMemorySaver``. Verified the sample passes with ``LANGGRAPH_STRICT_MSGPACK=true``, so it survives LangGraph's planned tightening of checkpoint deserialization. 2. End-to-end AG-UI wire validation. New integration test ``test_copilotkit_evaluator_agui_wire`` wires the evaluator graph into a FastAPI app through ``ag_ui_langgraph.add_langgraph_fastapi_endpoint``, drives the full user flow in-process via ``httpx.ASGITransport``, and asserts: - Phase 1 (initial run, hits ``interrupt()``): emits RUN_STARTED and RUN_FINISHED on the SSE stream. - Phase 2 (resume with user confirmation, same ``threadId``): emits RUN_STARTED and RUN_FINISHED. - Phase 3 (follow-up message after resume): not blocked -- RUN_STARTED and RUN_FINISHED fire again. This is the exact symptom the reporter hit, tested through the exact protocol path. Gated on ``pytest.importorskip`` for the heavy deps so the test skips cleanly when they are absent. Side benefit: running the same scenario against pre-fix code produces ``ValueError: No checkpointer set`` directly from ``graph.aget_state()``, giving operators a much louder error than the silent "stream ends without RUN_FINISHED" path. 3. README backend-wiring snippet corrected. The actual ``add_langgraph_fastapi_endpoint`` signature takes an ``agent=LangGraphAgent(...)`` wrapper, not a bare ``graph=`` kwarg -- the example in the previous commit would have failed at import. Also expanded the SSE-protocol explanation to match what the new e2e test observes on the wire. 4. Investigator graph annotated. ``investigator_graph`` does not call ``interrupt()`` so it does not need a checkpointer, but without an explicit note future contributors adding a HITL step would silently regress. Added a short comment at the ``.compile()`` call pointing at the evaluator pattern.

Follow-ups addressing the remaining open items from the previous two commits: 1. Rename ``error`` node to ``handle_error`` (evaluator + investigator). The old name collided with the ``error`` field on the state dataclass. LangGraph 1.x accepts the collision; earlier versions reject it with "'error' is already being used as a state key". Renaming the node (and the conditional-edge routing targets) keeps the routing token ``"error"`` purely an edge key and sidesteps the conflict on any LangGraph version the sample may be copied into. 2. Guard the ``allowed_msgpack_modules`` kwarg behind try/except so the sample still imports cleanly on langgraph<1.0 (where the kwarg does not exist and the strict-msgpack warning is not emitted either). Verified the sample now imports on both langgraph 0.2.56 and 1.1.9. 3. Ruff-clean the changed files (import sort I001 fixes on the new test additions; unrelated warnings in pre-existing ``main()`` / ``error_node`` code are out of scope per "only change what was asked"). 4. New ``samples/copilotkit/tests/browser/`` harness: - ``backend/server.py`` -- FastAPI app that patches ``layerlens.Stratix`` before importing the evaluator module and mounts ``evaluator_graph`` via ``add_langgraph_fastapi_endpoint(..., path="/evaluator")``. - ``frontend/`` -- Next.js 16.2.4 app pinned to the reporter's exact CopilotKit versions (``@copilotkit/react-core``, ``@copilotkit/react-ui``, ``@copilotkit/runtime`` all at 1.56.3), with the CopilotKit runtime wired to ``LangGraphHttpAgent`` against the FastAPI backend. - ``frontend/tests/interrupt-resume.spec.ts`` -- Playwright spec that drives CopilotChat through the three-turn scenario the reporter hit ("evaluate" -> "ok" -> "thanks") and asserts the exact string "Cannot send 'RUN_STARTED' while a run is still active" appears in neither the visible DOM nor the browser console. Known limitation documented in the harness README: CopilotChat 1.56's textarea reports as aria-hidden / non-"visible" under Playwright strict actionability checks in **headless** Chromium, and multiple input-driving patterns (``fill``, ``keyboard.type + Enter``, ``pressSequentially``, DOM-setter + bubbled input event) failed to reliably enable the Send button headlessly. The harness works with ``--headed`` for human verification and is structurally complete. The authoritative regression coverage for the fix is the Python test suite (``test_copilotkit_evaluator_interrupt_resume`` and ``test_copilotkit_evaluator_agui_wire``); the browser harness is corroborating / demo value, not gate-keeping.

…nggraph runId bug DevRel surfaced that the backend fix in the previous commits got the Python side working but the frontend still locked up with "Cannot send 'RUN_STARTED' while a run is still active. ... INCOMPLETE_STREAM" on the second message. Raw SSE capture confirmed the root cause: RUN_STARTED runId = "r1_aca59ad1" (client-supplied) RUN_FINISHED runId = "019dc049-14ba-..." (LangGraph's internal chain UUID) This is an upstream bug in ag-ui-langgraph (ag-ui-protocol/ag-ui#1582): ``_handle_stream_events`` overwrites ``self.active_run['id']`` with every LangGraph event's internal ``run_id``, so RUN_FINISHED emits LangGraph's UUID instead of the client-supplied ``input.run_id``. ``@copilotkit/runtime`` tracks active runs by client runId and raises RUN_ERROR/INCOMPLETE_STREAM. Verified the bug is present in both ag-ui-langgraph 0.0.22 (CopilotKit's officially-pinned version) and 0.0.34 (the reporter's version), and also in ``copilotkit.LangGraphAGUIAgent`` which inherits the broken method. Changes in this commit, all aligned with CopilotKit's own ``examples/integrations/langgraph-fastapi`` reference sample: 1. ``evaluator_agent.py``: - State class converted from ``@dataclass`` to a ``TypedDict`` inheriting from ``copilotkit.CopilotKitState``. This gives us ``MessagesState``'s ``add_messages`` reducer for free (nodes return NEW messages; they are appended, not replaced) and the ``copilotkit`` field the frontend injects. - All node functions updated from ``state.X`` / ``state.messages + [m]`` to ``state.get('X')`` / ``{'messages': [m]}``. - HITL interrupt now uses ``copilotkit.langgraph.copilotkit_interrupt`` (wraps ``interrupt()`` with ``__copilotkit_messages__`` so the prompt renders as a real AIMessage in the chat UI). The bare ``langgraph.types.interrupt(prompt)`` emitted a CUSTOM event the UI ignored -- why the reporter said "the agent stops and never reaches the human-in-the-loop confirmation step." - New ``RunIdPreservingAgent`` subclass (lazy factory ``_build_langgraph_agui_agent``) overrides ``_dispatch_event`` to restore ``input.run_id`` on RUN_FINISHED / RUN_ERROR terminal events. Clearly commented with a "remove when upstream ships" TODO pointing at the ag-ui-protocol/ag-ui issue. 2. ``samples/copilotkit/README.md``: - Version matrix re-pinned to CopilotKit's exact tested set (``copilotkit==0.1.74``, ``langchain==1.0.1``, ``langgraph==1.0.1``, ``ag-ui-langgraph==0.0.22``, ``@copilotkit/*==1.56.3``, Python ``>=3.10,<3.13``). - Upstream-bug callout explaining the runId workaround. - Backend wiring snippet updated to show the factory import and the ``LangGraphAGUIAgent`` path for non-interrupt graphs (investigator). 3. ``tests/test_samples_e2e.py``: - ``test_copilotkit_evaluator_interrupt_resume`` now sends ``Command(resume=[HumanMessage(content='ok')])`` rather than ``Command(resume='ok')``, matching ``copilotkit_interrupt``'s expected resume payload shape. - ``test_copilotkit_evaluator_agui_wire`` rewritten. The previous version had a blind spot: it only asserted RUN_FINISHED was PRESENT, not that ``RUN_STARTED.runId == RUN_FINISHED.runId == input.run_id``. Now it uses the ``RunIdPreservingAgent`` factory and asserts runId continuity end-to-end. Without the workaround this test would catch the upstream bug immediately. - Mock-module dict extended with ``copilotkit.langgraph`` for the import-smoke test. 4. ``samples/copilotkit/tests/browser/``: - ``backend/requirements.txt`` re-pinned to CopilotKit's set. - ``backend/server.py`` switched from raw ``ag_ui_langgraph.LangGraphAgent`` to the sample's factory so the browser harness also benefits from the runId workaround. All 9 copilotkit tests pass in the pinned venv. Empirical verification scripts in /tmp/ (not committed) show raw SSE with matching runIds end-to-end.

…rrupt path DevRel's diagnostic bundle (ag-ui-langgraph==0.0.34, copilotkit==0.1.87, @ag-ui/client==0.0.52 transitively) confirmed commit 542002b did not fix the browser symptom. Raw SSE from the Network tab showed: RUN_STARTED runId=d0b9d6c5-... [graph reaches step="confirm_judge" -- interrupt IS being hit] RUN_ERROR {code: "INCOMPLETE_STREAM", message: "Cannot send 'RUN_STARTED' while a run is still active. The previous run must be finished with 'RUN_FINISHED' before starting a new run."} Same error text as before, different root cause. A second bug in ag-ui-langgraph: when a request arrives on a thread whose graph is already paused at ``interrupt()`` and the request does NOT carry ``forwardedProps.command.resume``, the ``has_active_interrupts`` branch of ``prepare_stream`` (agent.py:491) emits a second ``RunStartedEvent`` to ``events_to_dispatch`` -- after ``_handle_stream_events`` (line 209) already emitted one at the top of the stream. The server's own AG-UI encoder validator catches the duplicate and converts it into a ``RUN_ERROR`` with the exact "Cannot send 'RUN_STARTED'..." message, terminating the stream before ``RUN_FINISHED`` can be dispatched. On ``@ag-ui/client@0.0.52`` (the newer protocol-state validator, which enforces within-stream start/finish invariants rather than the runId correlation the previous version used) this is what lands as INCOMPLETE_STREAM in the browser. Extended the sample's workaround subclass to filter at the agent boundary rather than override ``_dispatch_event`` (which expects to return an Event, not None/""). The filter: 1. Drops any RUN_STARTED after the first within a single stream -- fixes the duplicate-emission bug on the ``has_active_interrupts`` path. 2. Restamps ``input.run_id`` on RUN_FINISHED / RUN_ERROR -- preserves the existing ag-ui-protocol/ag-ui#1582 fix for older clients that correlate by runId. Verified on both pin matrices: - copilotkit==0.1.74 / ag-ui-langgraph==0.0.22 (CopilotKit's own reference sample pins): all tests pass. - copilotkit==0.1.87 / ag-ui-langgraph==0.0.34 (DevRel / reporter): all tests pass. Tightened ``test_copilotkit_evaluator_agui_wire`` accordingly: - asserts exactly one RUN_STARTED per stream (catches bug b) - asserts no RUN_ERROR - asserts RUN_STARTED.runId == RUN_FINISHED.runId == input.run_id (catches bug a, ag-ui-protocol/ag-ui#1582) Without either half of the workaround the test fails with a precise message pointing at which bug regressed. Follow-up: file the duplicate-RUN_STARTED bug upstream as a separate issue on ag-ui-protocol/ag-ui.

… ship lockfile Replaces the earlier pinning to CopilotKit's reference-sample versions (copilotkit==0.1.74 / ag-ui-langgraph==0.0.22) with the current published set customers actually install: copilotkit==0.1.87 langchain==1.2.15 langchain-core==1.3.0 langgraph==1.1.9 ag-ui-langgraph==0.0.34 Frontend transitive ``@ag-ui/client==0.0.52`` now matches what ``@copilotkit/react-core==1.56.3`` actually pulls in (DevRel's environment per their diagnostic bundle). Changes: - ``samples/copilotkit/tests/browser/backend/requirements.txt`` -- pins updated to the latest set above. - ``samples/copilotkit/tests/browser/backend/requirements.lock`` -- NEW, committed pip-freeze of the verified environment. ``pip install -r requirements.lock`` now gives byte-identical transitive deps. - ``samples/copilotkit/README.md`` -- version matrix and install snippets updated to the latest set; upstream-bug callout now lists both issues (``ag-ui-protocol/ag-ui#1582`` runId overwrite, ``ag-ui-protocol/ag-ui#1584`` duplicate RUN_STARTED). - ``samples/copilotkit/agents/evaluator_agent.py`` -- renamed the factory from ``_build_langgraph_agui_agent`` to the public ``build_agui_agent``; added a ``_version_guard_ag_ui_langgraph`` helper that emits a ``RuntimeWarning`` when the installed version is outside the tested range ``[0.0.22, 0.0.34]`` so silent behavior drift does not hide a regression. A backwards-compatible alias keeps the old private name importable for internal tests during the rename window. - ``samples/copilotkit/tests/browser/backend/server.py`` and ``tests/test_samples_e2e.py`` -- call sites updated to the public name. Verified end-to-end against the latest version matrix: - pytest -k copilotkit: 9 passed, 2 skipped (live-only). - Manual HTTP drive against a running backend with the reporter's exact flow (turn 1 initial -> interrupt, turn 2 re-entry on paused graph): both turns emit exactly one RUN_STARTED and one RUN_FINISHED, both with matching client runIds, no RUN_ERROR / INCOMPLETE_STREAM.

… resume heuristic DevRel confirmed the Apr-24 push resolved the turn-1 INCOMPLETE_STREAM (backend now emits a clean RUN_STARTED -> STEP_* -> RUN_FINISHED for the initial interrupt turn). Remaining gap: when the user replies to the interrupt, plain ``<CopilotChat>`` sends the reply as an ordinary new chat message, not as ``forwardedProps.command.resume`` -- so the graph stayed paused and the same error returned on the follow-up. Correct fix is on the frontend, not the backend: ``@copilotkit/react-core@1.56.3`` ships ``useLangGraphInterrupt``, the hook specifically designed for this case. It renders a UI when the graph pauses at ``interrupt()`` and calls ``resolve(...)`` with the user's answer -- which the runtime forwards as the proper ``command.resume`` payload. This is the supported AG-UI protocol path: the frontend must explicitly signal a resume rather than a new turn. Changes: - ``samples/copilotkit/tests/browser/frontend/app/page.tsx``: wires ``useLangGraphInterrupt`` with a dedicated prompt widget (``data-testid`` stable for automation), and a "Start evaluation" test-hook button that uses ``useCopilotChat().appendMessage`` to kick off the graph without having to type into CopilotChat's textarea (which Playwright can't reliably drive on 1.56.3 + React 19). The ``resolve([{role:"user", content}])`` shape matches what ``copilotkit_interrupt`` expects server-side (``answer = response[-1].content``). - ``samples/copilotkit/tests/browser/frontend/app/globals.css``: styles for the interrupt widget and the test-hook start button. - ``samples/copilotkit/agents/evaluator_agent.py``: reverts the backend auto-resume heuristic I had shipped as a stopgap. It was overloading the protocol semantics ("any user message during active interrupt == resume answer") which is incorrect for anything beyond a simple sample -- breaks cancel flows and multi-interrupt scenarios. The backend now only does the two genuine protocol-bug workarounds (runId overwrite, duplicate RUN_STARTED). Resume belongs to the frontend. Test plan: - Python test suite: ``pytest -k copilotkit`` -- 9 passed / 2 skipped (live) on DevRel's exact version matrix. - Backend HTTP round-trip with a programmatic ``command.resume`` payload: both turns emit matched ``RUN_STARTED``/``RUN_FINISHED`` with client runId, no ``RUN_ERROR`` (verified on 2026-04-24). - Browser end-to-end: the hook wiring in page.tsx matches CopilotKit's own showcase pattern and the hook source I inspected. I could not self-verify the full browser round-trip because (a) Playwright cannot reliably drive CopilotChat's textarea on 1.56.3 + React 19 (tracked at CopilotKit/CopilotKit#4215), and (b) my attempted programmatic appendMessage test-hook did not trigger a runtime POST in my local venv for reasons I have not yet pinned down. **DevRel re-test in a real browser is the authoritative check for the frontend round-trip.** Follow-up (per "#2" in the user's plan): rewrite the evaluator HITL to use CopilotKit's current idiom (``useCopilotAction`` / ``useHumanInTheLoop`` -- frontend-defined tool + UI render + resolve) instead of backend ``interrupt()``. That's the pattern CopilotKit's active samples use; it avoids the ag-ui-langgraph interrupt path bugs entirely and is where customers should be pointed for new work.

…TL tool Replaces the custom StateGraph + ``langgraph.types.interrupt()`` pattern with CopilotKit's current HITL idiom: ``langchain.agents.create_agent`` driving an LLM that calls backend tools, with the human-in-the-loop step wired as a **frontend** tool via ``useCopilotAction`` + ``renderAndWaitForResponse``. This matches what CopilotKit's active showcases (``hitl_in_chat_agent.py``, ``interrupt_agent.py``) use. Why the rearchitect: the ``interrupt()`` code path in ``ag-ui-langgraph`` has two protocol-level bugs (tracked upstream as ``ag-ui-protocol/ag-ui#1582`` and ``#1584``) that the previous revision worked around by subclassing ``LangGraphAGUIAgent`` and reaching into private internals. That ships, but it's not the pattern CopilotKit themselves exercise, and the workaround is fragile across upstream bumps. Moving off the ``interrupt()`` path sidesteps both bugs by construction and aligns with CopilotKit's active direction. Design (three-role review): - **AI engineer**: LLM drives. Backend tools (``list_judges``, ``list_recent_traces``, ``run_trace_evaluation``, ``get_evaluation_result``) are thin wrappers over the LayerLens SDK. A tight system prompt guides the flow. ``confirm_judge`` is a frontend tool declared via ``useCopilotAction``; ``CopilotKitMiddleware()`` bridges it into the agent's toolbelt so the LLM can "call" it like any other tool. - **Designer**: HITL renders as a card list -- each judge shows name, id, and evaluation goal, with a ``Select <Name>`` button. Keyboard accessible, visible focus states, compact "Judge selected." state after the user chooses. ``data-testid`` attributes throughout for deterministic automation. - **SDK engineer**: ~160 LoC for the evaluator (down from ~560). No private-API reach. No workaround subclass. No checkpointer needed (``create_agent`` owns state). Lockfile updated for ``langchain-openai``. Frontend pins unchanged. The old ``build_agui_agent`` factory, ``build_graph`` with a custom ``StateGraph``, ``EvaluatorState`` TypedDict, all node functions, the msgpack DTO allowlist, and the version-guard helpers are all gone -- replaced by one ``build_graph(model=...)`` that returns the compiled ``create_agent`` graph. Tests: - ``tests/test_samples_e2e.py`` rewritten. ``test_copilotkit_evaluator_ interrupt_resume`` and ``test_copilotkit_evaluator_agui_wire`` (both specific to the old ``interrupt()`` architecture) replaced by ``test_copilotkit_evaluator_tools``, which exercises each backend tool against a patched Stratix client and verifies the system prompt references ``confirm_judge``. - Import-smoke test mock list extended for ``langchain.agents`` / ``langchain.tools`` / ``langchain_core.tools`` / ``langchain_openai``. - ``pytest -k copilotkit``: 8 passed, 2 skipped (live). Frontend: - ``page.tsx``: ``useCopilotAction("confirm_judge", ...)`` with a rich judge-card list; ``useLangGraphInterrupt`` removed. - ``globals.css``: styles for ``judge-picker`` / ``judge-card`` / complete / empty states. - ``Evaluate my traces`` quick-action button retained for direct user triggering and automation. Backend server: - ``samples/copilotkit/tests/browser/backend/server.py`` swaps ``build_agui_agent(...)`` for plain ``LangGraphAGUIAgent(...)`` -- no workaround needed on this code path. README: - Full rewrite around the new architecture. Version matrix unchanged. The two upstream ``ag-ui-langgraph`` bugs are preserved in the "informational" section for customers building their own ``interrupt()``-based graphs. Per user direction: no backwards compatibility for the old sample (no customer has it). The workaround subclass is removed, not deprecated.

The previous commit's tests verified the new architecture against mocks; this one verifies it against a real LLM through the actual AG-UI FastAPI endpoint. New test ``test_copilotkit_evaluator_live_llm``: - Loads credentials from a gitignored ``.env`` (or real env vars in CI), with OpenRouter convenience: if only ``OPENROUTER_API_KEY`` is set, the loader auto-points ``OPENAI_BASE_URL`` at OpenRouter. - Builds a FastAPI app with the patched Stratix client + the real evaluator graph (real LLM, no fake model). - POSTs an AG-UI ``RunAgentInput`` whose ``tools`` array declares the ``confirm_judge`` frontend tool, exactly as the browser would. - Asserts: tool sequence is ``list_recent_traces`` -> ``list_judges`` -> ``confirm_judge``; agent halts at ``confirm_judge`` (never calls ``run_trace_evaluation``); single ``RUN_STARTED`` + ``RUN_FINISHED`` with matching client ``runId``; no ``RUN_ERROR``. - Marked ``@pytest.mark.live`` and ``pytest.skip``s when no key is available, so the default ``pytest`` run is unaffected. Verified locally: passes against ``openrouter:openai/gpt-4o-mini``. Other changes in this commit: - ``evaluator_agent.py``: - ``_default_model()`` honours ``OPENAI_API_KEY``, ``OPENAI_BASE_URL``, and ``OPENAI_MODEL`` so any OpenAI-compatible endpoint works (OpenAI, Ollama, LM Studio, OpenRouter, vLLM, ...). For non-compatible providers, customers pass any LangChain ``BaseChatModel`` to ``build_graph(model=...)``. - ``create_agent`` now compiles with ``InMemorySaver``. ``ag-ui- langgraph``'s ``add_langgraph_fastapi_endpoint`` calls ``graph.aget_state(config)`` on every request, which fails with ``ValueError("No checkpointer set")`` if the graph wasn't compiled with one -- regardless of whether ``interrupt()`` is used. - ``build_agui_agent`` reintroduced as a *minimal* runId-only workaround for ``ag-ui-protocol/ag-ui#1582``. Bug #1584 (duplicate RUN_STARTED) is unreachable on this code path because the evaluator never calls ``langgraph.types.interrupt()``, so we only need the runId fix. Live test confirms the workaround restores runId continuity end-to-end. - ``samples/copilotkit/tests/browser/backend/server.py``: switched back to ``build_agui_agent(...)`` so the runId workaround is active in the harness backend. The earlier "no workaround needed" claim was wrong; @ag-ui/client@0.0.52 doesn't enforce runId continuity but older clients did and future strict ones likely will. - ``tests/.env.example``: documents the supported env vars (OPENAI, OpenRouter convenience, LayerLens). Real ``tests/.env`` is gitignored. - ``samples/copilotkit/README.md``: documents the live-test setup and links the .env.example. Also documents the ``OPENAI_API_KEY``/``OPENAI_BASE_URL``/``OPENAI_MODEL`` env-var triplet for OpenAI-compatible providers (Ollama, LM Studio, OpenRouter).

DevRel hit a "page renders but every button is dead, textarea won't accept input" failure mode while running the harness locally. Diagnosis took several iterations because there was no client-side error: - Backend was healthy; ``/healthz`` returned 200. - ``/api/copilotkit`` was up; an ``info`` JSON-RPC probe listed the evaluator agent. - Direct POSTs to the backend at :8123 streamed real LLM events. - The page HTML had every expected ``data-testid``. - Browser console showed only one repeating warning: ``WebSocket connection to 'ws://127.0.0.1:3000/_next/webpack-hmr' failed: Error during WebSocket handshake`` Root cause: Next 16 enforces a cross-origin allowlist for dev resources (including the webpack-hmr WebSocket). When the user serves on ``127.0.0.1`` but the allowlist is implicit ``localhost``, HMR fails to connect and Next leaves React in a half-hydrated state. The page renders from the server but client React never wires up event handlers or controlled-input state -- so buttons and textareas are visually present but inert. No error is surfaced beyond the WebSocket warning. Fix: - Add ``allowedDevOrigins: ["127.0.0.1", "localhost"]`` to ``samples/copilotkit/tests/browser/frontend/next.config.js``. Both origins are the supported way to load the harness; without this, whichever the user picks tends to break. Also, to make this kind of failure self-diagnosing rather than requiring DevTools-paste skills: - New ``samples/copilotkit/tests/browser/frontend/public/diag.html`` -- a static page (no React) that runs three probes on load and renders results inline: runtime ``info`` reachability, an ``agent/run`` round-trip through ``/api/copilotkit``, and a direct ``/healthz`` ping against the backend. Visit ``http://127.0.0.1:3000/diag.html`` to see green/red labels for each. This bypasses the React app entirely, so it stays useful even when hydration is broken. - New "Run diagnostic" button on the harness page (next to "Evaluate my traces") that runs the same probes plus a couple of React-only checks (textarea state, isLoading, intercepted ``appendMessage`` POST body) and renders the report directly on the page. Useful for users who can't (or don't want to) paste JS into DevTools console. Verified locally: after the cache + allowedDevOrigins fix, both buttons fire, ``appendMessage`` POSTs to ``/api/copilotkit`` and gets back a real ``RUN_STARTED`` SSE stream end-to-end.

CopilotKit's ``renderAndWaitForResponse`` re-renders the action UI progressively as the LLM streams the tool-call JSON, so for the first render tick or two ``judge.id`` (and sometimes ``judge.name``) can be undefined even though the surrounding React state is stable. That tripped two issues in our judge picker: 1. ``key={judge.id}`` warned "Each child in a list should have a unique key prop" when id was undefined. 2. The Select button was clickable with an undefined id, which would ``respond({ id: undefined, name: undefined })`` and break the resume. Fix: - Fall back to ``pending-{index}`` for the React key while id is pending. Quiet warning + stable row identity. - Mark each row "ready" only when both id and name are present and ``respond`` is non-null. Disable the Select button and show "Loading..." until ready. The button text and ``data-testid`` follow the ready state so automated tests don't grab a half-loaded row by accident. - Hide the dim id-pill (``judge-card-id``) while id is pending so the card doesn't flash an empty grey box.

…tionCard DevRel asked: "where is the tool indicator I should see?" CopilotChat only renders user/assistant text and frontend HITL widgets by default; backend tool calls fire invisibly. Surface them with the ``useCopilotAction`` + ``available: "remote"`` + ``render`` pattern -- the same pattern CopilotKit's ``tool_rendering_agent.py`` showcase uses. Changes: - All four backend tools (``list_recent_traces``, ``list_judges``, ``run_trace_evaluation``, ``get_evaluation_result``) now render inline cards with a pulsing-dot "Running" status pill, transitioning to a green "Done" pill when the tool resolves. Each card has a stable ``data-testid`` for automated tests. - ``get_evaluation_result`` (the final result) renders the polished ``EvaluationCard`` from ``samples/copilotkit/components/`` -- the production-grade SDK card with the score donut and pass-rate ring. Imported via a tsconfig path alias (``@layerlens/copilotkit-cards``) so the harness can reuse the upstream SDK components without copying or duplicating them. - ``confirm_judge`` HITL picker restyled with matching Tailwind tokens to keep the visual language consistent across all tool cards. - Tailwind 4 added (``@tailwindcss/postcss``, ``tailwindcss``) + ``postcss.config.mjs`` + ``@import "tailwindcss"`` in ``globals.css``. Inline custom CSS removed in favour of Tailwind utilities, matching CopilotKit's own showcase samples. - ``html className="dark"`` + ``color-scheme: dark`` so the SDK reference cards (which key off the ``.dark`` ancestor) render in dark mode by default. - ``<CopilotKit showDevConsole={false}>`` -- DevRel reported the default web-inspector "kite" obscured the harness header; suppressed for the sample. - ``tsconfig.json`` includes ``../../../components/**/*`` so Next's bundler picks up the SDK card sources, and adds the ``@layerlens/copilotkit-cards`` path alias. The pattern (frontend ``useCopilotAction`` for backend tools with ``available: "remote"``) is what customers should copy. The harness demonstrates it in two flavours: lightweight inline cards (for the first three tools) and full SDK-component composition (for the result). Both styles are valid; teams pick based on visual weight they want.

Reshaped the CopilotKit sample so it reads as a commercial-grade SDK demo rather than a test fixture, and brought the visual language into line with CopilotKit's own samples (research-canvas, travel, banking, with-shadcn-ui). Structure - Move sample out of `samples/copilotkit/tests/browser/{backend,frontend}` to `samples/copilotkit/app/{backend,frontend}` so customers see "the app" rather than "a test harness". Update README + path constants. - Add `app/frontend/.gitignore` for `.next/`, `node_modules/`, and Playwright artefacts. Backend (`app/backend/server.py`, `agents/evaluator_agent.py`) - Real LayerLens only: missing `LAYERLENS_STRATIX_API_KEY` is a hard startup error. No fake-fixture path, no `MagicMock`, no env-var flag — fixtures only ever existed for an earlier Playwright fixture and conflicted with the SDK posture in CLAUDE.md. - Agent built with `create_agent` + `CopilotKitMiddleware`, real `@tool` impls returning `Command(update={...})` so each tool emits state into `state.{traces,judges,evaluations,results}`. Async tools call `copilotkit_emit_state` so the canvas updates live during a run. - New `GET /evaluations/{id}` endpoint for out-of-band polling: the agent kicks off evaluations, ends in seconds, and the frontend folds completed verdicts into the canvas as each evaluation resolves on LayerLens. Fixes the 30s-evaluation-vs-LLM-polling-loop hallucination. - `LangGraphAGUIAgent` constructor gets `config={"recursion_limit": 200}` so a 5-trace fan-out doesn't trip the default 25-hop limit (tested via `with_config` first; that path is dropped by ag-ui's internal config merge). - System prompt rewritten: strict tool order; `confirm_judge` takes no args (frontend reads candidates from `state.judges` to avoid the `tool_argument_parse_failed: Unterminated string in JSON` we hit when streaming 38 judges through tool args); evaluations capped at 5 traces; pending != failed; final summary template branches on whether anything completed. SDK card library (`samples/copilotkit/components/`) - Rewritten on top of shadcn/ui primitives. Cards now compose `Card`, `CardHeader`, `CardContent`, `CardFooter`, `Badge`, `Button`, `Separator`, `Progress` from `@/components/ui/*`. Status pills use the `bg-{color}-50 text-{color}-600 dark:bg-{color}-900/20` pattern CopilotKit's banking sample uses, not custom ring/shadow chrome. - Stock shadcn neutral OKLCH palette (`baseColor: neutral`). Brand accent `#6766FC` applied via Tailwind class strings on CTAs/links — same approach research-canvas takes for its accent. No edits to `--primary` / shadcn theme variables. - Score bars solid (`bg-green-500` / `bg-red-500` / `bg-amber-500`) not gradients. Sparklines color-coded by pass-rate threshold. - `dashboardBaseUrl` is now strictly opt-in across `TraceCard` and `EvaluationCard`: the "Trace Explorer →" / "Agent Graph →" / "View in Dashboard →" footers only render when a real URL is configured via `NEXT_PUBLIC_LAYERLENS_DASHBOARD_URL`. Stops 404s on routes that aren't deployed yet. Frontend (`app/frontend/`) - shadcn primitives installed via `npx shadcn@latest add card button badge progress separator`. Deps: `radix-ui`, `class-variance- authority`, `clsx`, `tailwind-merge`, `tw-animate-css`. Tailwind 4 + React 19. `components.json` aliases `ui` to the SDK card library. - New `globals.css` with shadcn neutral tokens (`--background`, `--card`, `--muted-foreground`, etc.), `@theme inline` mapping for Tailwind 4, and a `--copilot-kit-*` bridge so `<CopilotChat>` reads the same neutral tokens as the canvas. Brand accent set on `--copilot-kit-secondary-color`. Drops the previous "force dark" CSS. - Layout split-pane, **light by default** to match every official CopilotKit sample. New `theme-toggle.tsx` segmented control (Light / System / Dark) persists to `localStorage` and reacts to OS-level theme changes when set to System. - `useCoAgent({ name: "evaluator" })` reads live agent state. New out-of-band poller (`useEffect` against `/evaluations/{id}` every 5 s) folds verdicts that arrive after the agent run ends into the canvas. `state.results` (agent) and `polledResults` (frontend) are merged via `useMemo` so MetricStrip / EvaluationCard / JudgeVerdict- Card all see one consistent results array. - Picker: `JudgePicker` is its own component subscribed to `useCoAgent` so it re-renders when `state.judges` populates after the LLM streams out the tool call. `confirm_judge` uses `available: "remote"` + `renderAndWaitForResponse` per the canonical research-canvas HITL pattern. Cleanup - Strip every dev artefact: agent's `[tool] X INVOKED` prints, the page's debug-state `<pre>`, the `console.log("[evaluator state]"…)` effect, the "Run diagnostic" button + panel + state, and the `probe_e2e.py` SSE diagnostic script. Header is now just the title, theme toggle, and the primary CTA.

…n reasoning Polish pass after first review: - Chat token bridge fixed. Re-read CopilotKit's ``react-ui/colors.css`` semantics: ``primary-color`` is the user-bubble + interactive accent, ``secondary-color`` is the assistant message background, not a brand slot. Earlier mapping made the assistant greeting render as solid indigo and clip out of view in light mode. Now mapped onto shadcn tokens semantically: ``primary → --primary``, ``contrast → --primary- foreground``, ``secondary → --card``, ``secondary-contrast → --card-foreground``. Brand accent ``#6766FC`` stays only on actual CTA buttons via Tailwind class strings. - ``JudgePicker`` "selected" pill now uses light + dark variants (``bg-green-50 text-green-700 dark:bg-green-900/20 dark:text-green-300``) instead of dark-mode-only emerald that disappeared on a light page. - ``JudgeVerdictCard`` redesign: * Pass / Fail / Error are now solid-filled badges (``bg-green-600``, ``bg-red-600``, ``bg-amber-600`` with white text), readable at a glance instead of subtle ghost pills. * Severity rendered as a colored pill with a triangle alert glyph, not a dot. Severity is a status (impact-of-failure level), not a trend, so an "alert" shape is correct; chevrons would imply direction. Hide the severity chip when verdict=pass AND severity=low — nothing meaningful to flag. * Reasoning rendered through a tiny inline ``MarkdownLite`` that handles paragraph breaks, line breaks, ``**bold**``, and ``*italic*`` — the cases LayerLens API actually emits. No ``react-markdown`` dep (the SDK card library lives outside the Next app's node_modules so it can't resolve packages there); no raw HTML injection. Fixes the wall-of-text rendering of judge reasoning. - Tailwind 4 ``@source`` directive added to ``globals.css`` so it scans ``samples/copilotkit/components/**/*.{ts,tsx}``. Without this, classes used inside the SDK card library (``bg-amber-500``, ``bg-green-600``, etc.) get tree-shaken out of the generated CSS and pills silently flatten to plain text. - ``TraceCardProps.status`` made optional. The LayerLens ``traces.get_many`` API doesn't expose per-trace lifecycle today, so the sample no longer hardcodes ``status="ok"`` — that was rendering a misleading green pill on every trace regardless of reality. The status pill is hidden when the prop is omitted; restore it once the API surfaces real status.

When the agent kicks off N evaluations and only K complete on the first poll, the remaining (N - K) used to disappear from the ``Verdicts`` grid even though the run-summary card still counted them — verdict count would say "5", grid would show 4, and the trailing pending one looked like it had been lost. Add ``PendingVerdictCard``: same shadcn ``Card`` chrome as ``JudgeVerdictCard``, with a "Running" pill, a pulsing skeleton bar for the score, and copy explaining real LayerLens evaluations can take a minute or two. Render one per evaluation that doesn't have a matching entry in ``state.results`` yet. Side effects: - ``Verdicts`` section count now reflects total evaluations (not just completed) so the grid count matches what's actually rendered. - Section now renders even when ``results.length === 0`` as long as there are evaluations in flight (previously fell through to a textual placeholder). - Run summary picks the judge name from the first pending evaluation if no result has come back yet. The polling loop is unchanged — it keeps polling ``/evaluations/{id}`` every 5 s and replaces a pending card with the real ``JudgeVerdictCard`` the moment LayerLens returns a verdict.

The judge ``evaluation_goal`` field LayerLens returns is markdown- formatted (paragraph breaks, ``**bold**`` headers, numbered lists). Both the in-chat picker and the canvas's "Available judges" card were rendering it through plain ``<p>{text}</p>`` so each judge collapsed into one indented wall of text — same problem the verdict card's reasoning had before. Pull the inline markdown renderer that previously lived inside ``JudgeVerdictCard.tsx`` into its own ``markdown-lite.tsx`` module, re-export it from the SDK card library's ``index.ts``, and use it in: - JudgeVerdictCard reasoning (already) - JudgePicker goal description (chat-side) - JudgesCard goal description (canvas-side) Output is the same as before for the verdict card; the picker and the canvas judges card now show structured goal text. Still no ``react-markdown`` dependency — the SDK card library has to stay resolvable without the Next.js app's node_modules in scope, so we keep the small built-in renderer instead.

The README still described the previous incarnation of the sample — the create_agent + frontend HITL design from before the canvas / out-of-band-polling rewrite. Rewrite top-to-bottom to reflect what actually ships: - New layout section showing ``samples/copilotkit/{agents,components,app}`` with the SDK card library and the customer-facing app side-by-side. - Architecture diagram updated for the canvas + chat split-pane, ``useCoAgent`` driving state-driven cards, and the ``GET /evaluations/{id}`` polling endpoint that the frontend hits every 5s for in-flight verdicts. - Step-by-step "How the demo flows" walkthrough so a customer can read the README and predict what each click will do. - "Why this pattern" updated to highlight the canvas + frontend polling + ``copilotkit_emit_state`` triad. Old text framed the choice as ``create_agent`` vs ``interrupt()``; new text frames it as the research-canvas pattern. - Tools section updated for the async + ``Command(update={...})`` return shape and the no-arg ``confirm_judge`` (frontend reads candidates from ``state.judges``). - Frontend section adds: shadcn/ui foundation, ``components.json``, light-default theme + ``ThemeToggle``, ``--copilot-kit-*`` token bridge, brand accent ``#6766FC``, the SDK card matrix (5 cards + ``MarkdownLite``). - Backend section adds: ``recursion_limit: 200`` config, the ``GET /evaluations/{id}`` polling handler, and the "no fake fixture" guardrail. Drive-by: ``ruff format`` brought ``evaluator_agent.py`` and ``server.py`` in line with the project's ruff style. (The repo's ``[tool.ruff]`` ``exclude = ["samples"]`` would skip these on discovery, but reformatting locally keeps them tidy and avoids contributors re-doing it.)

Fixes both red CI checks on PR #92: - ``Check Lint`` was failing because tests/test_samples_e2e.py used the walrus operator (``:=``) at line 1446 and ruff's ``[tool.ruff].target-version`` is pinned to ``py37``. Replace with a regular assignment + boolean check — same semantics, py37 compatible. The package's runtime support (``Python >=3.10,<3.13``) doesn't dictate ruff's syntax target; bumping the ruff target is out of scope for this PR. - ``Check Format`` was failing because the same file had pre-existing multi-line wrapping that ruff's auto-format collapses to single lines under the 120-char limit. Apply ``ruff format``. - ``ruff check --fix`` also normalised one import block (I001). CI's ``test (3.9..3.12)`` jobs cancelled out after the lint pre-step failed — they should now actually run.

Per existing repo policy: the SDK sample and tests should not name a specific OpenAI-compatible provider. Configuring OpenRouter (or any other gateway) is the user's job in their own .env — the docs and test code stay vendor-neutral. Removes: - OpenRouter row from ``_default_model``'s docstring table. - OpenRouter mention in ``build_graph``'s docstring. - ``OpenRouter, vLLM`` aside in the CLI ``main()`` print block. - OpenRouter URL in ``samples/copilotkit/README.md`` env-var example. Replaced with a placeholder ``your-openai-compatible-host``. - ``OPENROUTER_API_KEY`` auto-mapping in ``test_copilotkit_evaluator _live_llm`` (the test now expects ``OPENAI_API_KEY`` and lets the user set ``OPENAI_BASE_URL`` / ``OPENAI_MODEL`` themselves if pointing at a non-OpenAI endpoint). - Skip-message reference to ``OPENROUTER_API_KEY``. The sample still works against any OpenAI-compatible endpoint — the generic env vars (``OPENAI_API_KEY`` / ``OPENAI_BASE_URL`` / ``OPENAI_MODEL``) carry the configuration. The user's own gitignored ``.env`` is where provider-specific URLs (OpenRouter, Ollama, LM Studio, …) live.

Three test failures from the previous CI run, all addressed here: 1. ``tests/test_samples.py::test_sample_has_main[copilotkit/app/backend /server.py]`` expects every sample's entry-point file to expose a ``main()`` function. ``server.py`` had a bare ``if __name__ == "__main__":`` block instead. Lift the uvicorn.run call into a ``main()`` and call it from the ``if __name__`` guard. 2. ``test_copilotkit_agent_import[evaluator_agent]`` and 3. ``test_copilotkit_without_langchain[evaluator_agent]`` both stub the heavy deps via ``patch.dict("sys.modules", ...)`` so the agent module imports cleanly without langchain / copilotkit installed. The mock dict was missing the new submodules the agent now imports (``langgraph.prebuilt``, ``langchain.agents.middleware``, ``langchain_core.runnables``, ``langchain_core.tools.base``). Add them to both mock dicts. Locally ``ruff check`` and ``ruff format --check`` are clean on all touched files.

…success Bug repro: same evaluation reliably stayed "Running" across multiple demo runs. Root cause was the polling filter on the frontend: const completed = updates.filter( (u) => u.status === "success" && typeof u.score === "number", ); This rejected any LayerLens response that wasn't a clean success with a numeric score — including ``status: "failure"``, ``status: "error"``, ``status: "cancelled"``, and the ``status: "success"`` case where ``trace_evaluations.get_results`` returned ``score: null`` (which some judges legitimately do). The poller would then keep firing every 5s forever and the verdict card would sit in "Running" indefinitely. Two-sided fix: Backend (``GET /evaluations/{id}``): - New ``done: bool`` field — true for any of ``success | failure | error | cancelled | not_found``, false while the evaluation is still ``in_progress`` / ``pending`` / ``queued``. - Always include ``passed`` / ``score`` / ``reasoning`` once ``done: true``, even for terminal failures and ``success``-without- score: defaults are ``passed: false``, ``score: 0.0``, and a ``reasoning`` string explaining the terminal state. - ``try/except`` around ``trace_evaluations.get`` so a malformed / unauthorized id surfaces as ``status: "error", done: true`` instead of a 500 that the frontend retries forever. Frontend (``page.tsx``): - Polling filter is now ``u.done === true`` instead of ``status === "success" && typeof score === "number"``. - ``ResultRecord`` type gains an optional ``done?: boolean`` field (the agent's own ``state.results`` entries don't carry it; only the ``/evaluations/{id}`` polling responses do). Verified against a real eval id (clean success path → ``done: true``, score returned) and a deadbeef id (error path → ``done: true``, ``status: "error"``, no 500). The 5th-eval-stuck symptom is from the non-success terminal cases — frontend now folds them into the canvas as a verdict card with the appropriate fail/error styling instead of spinning forever.

Adds the assistant resource handler so SDK users can drive the Stratix Assistant programmatically. Mirrors the REST surface from atlas-app's DOCS/api/assistant-openapi.yaml and the SSE event channel from DOCS/api/assistant-asyncapi.yaml. Surface (sync + async parity): - list_conversations() → AssistantConversationList - create_conversation(title=None) → AssistantConversation - get_conversation(id) → AssistantConversation - rename_conversation(id, title=...) → AssistantConversation - delete_conversation(id) → None - list_messages(conv_id, limit=None) → AssistantMessageList - chat(conv_id, content) → Iterator[AssistantStreamEvent] The chat() iterator parses the SSE stream and yields one event per block. Six event types are recognized (token, tool_call, tool_result, done, moderation_refused, error). Unknown event types are silently skipped so a forward-compat addition on the server doesn't crash SDK clients. The iterator stops on any terminal event (done, moderation_refused, error). Models (mirrors server-side Pydantic shape): - AssistantConversation, AssistantMessage, AssistantToolCall - AssistantConversationList, AssistantMessageList - AssistantStreamEvent (with .is_terminal() and .text() helpers) - AssistantTokenUsage Access control (server-side, surfaced to SDK callers as exceptions): - 403 PermissionDeniedError when the org's tier does not have AssistantSDKEnabled = true. Default-deny — contact LayerLens to request enablement. - 429 RateLimitError when the per-org daily token cap is exhausted (or 0, which is the default for every plan). Headers X-Token-Budget-Used / X-Token-Budget-Cap reported on success. - 503 when Redis (rate-limit + budget backend) is unreachable — fail-closed posture, no in-memory fallback. Tests: - 9 SSE-block parser tests (token, done, moderation_refused, error, unknown event forward-compat, malformed JSON, missing event/data, text() accessor for non-text events). - 8 resource-method tests (list/create/get/rename/delete + envelope unwrapping + edge cases). - 2 streaming tests (real SSE flow with token+done sequence; 403 raises HTTPStatusError). Mypy strict clean across the new files.

Closes the cross-cutting CLAUDE.md multi-tenancy gap surfaced by A:/tmp/adapter-depth-audit.md (finding #3): all 203 adapter emissions in the stratix-python SDK shipped without org_id propagation, violating the platform-wide 'EVERY data operation must be scoped by tenant' mandate. What changed: * BaseAdapter.__init__ now requires a resolvable org_id (explicit kwarg or stratix.org_id / stratix.organization_id). Construction without a non-empty value raises ValueError — fail-fast, no silent fallback. * BaseAdapter.emit_event and emit_dict_event stamp the bound org_id into every payload before forwarding to the client. Any caller-supplied value is overwritten with the adapter's own tenant binding (defensive overwrite — prevents cross-tenant leaks via misuse). * The replay trace record and EventSink dispatch path both carry org_id at the envelope level. EventSink.send signature now requires org_id as a keyword-only arg. * All 17 framework adapters, 1 protocol adapter base, and 9 LLM provider adapters thread org_id through their __init__ to super().__init__. instrument_* helper functions in each adapter package accept and forward the kwarg. * IngestionPipelineSink uses per-event org_id as the tenant_id for downstream ingest. Tests: * tests/instrument/adapters/_base/test_org_id_propagation.py — 17 tests covering construction-time fail-fast (5), per-emission propagation (5), cross-tenant isolation (2), and public surface stability (2). Includes the cross-tenant test the audit recommended (org A's adapter never tags events with org B). * tests/instrument/adapters/frameworks/test_per_adapter_org_id.py — 37 parametrized tests (one accept + one fail-fast pair per adapter, plus an audit-cardinality guard) covering all 17 framework adapters. * All existing _RecordingStratix test stand-ins updated with an org_id class attribute. instrument_* helper-call sites in tests pass org_id="test-org". Acceptance: 17/17 base tests + 37/37 per-adapter tests + 258 existing framework adapter tests pass; mypy --strict on _base clean; ruff clean. Docs: docs/adapters/multi-tenancy.md documents the contract for future adapters.

…laceholder from #116) (#126) Replaces the M7 placeholder shipped in PR #116 (truncation policy) with the full BrowserUseAdapter — every lifecycle hook wired, every event emitted, and every cross-cutting CLAUDE.md contract enforced from day one. What changed ------------ Full lifecycle adapter (src/layerlens/instrument/adapters/frameworks/ browser_use/lifecycle.py): * connect / disconnect / health_check / get_adapter_info / serialize_for_replay (all five abstract BaseAdapter methods). * on_session_start, on_session_end, on_navigation, on_action, on_screenshot, on_dom_extraction, on_llm_call (every spec'd hook). * Capability declaration: TRACE_TOOLS + TRACE_MODELS + TRACE_STATE + STREAMING + REPLAY (no longer the placeholder's TRACE_TOOLS-only set). * Canonical events: browser.session.start, browser.navigate, browser.action, browser.screenshot, browser.dom.extract, tool.call, model.invoke, agent.input/output/state.change, cost.record, environment.config — plus agent.error / tool.error / model.error per the PR #115 error-aware emission contract. * Per-callback resilience wrapper per PR #117 — observability errors NEVER crash the customer's agent, surfaced via resilience_snapshot(). * Multi-tenant org_id propagation per PR #118 — bound at construction (kwarg or resolved from stratix.org_id), stamped defensively on every emit, caller-supplied values overwritten to prevent cross-tenant leaks. * Truncation policy from day one (DEFAULT_POLICY) — screenshot bytes DROPPED to deterministic SHA-256 references, DOM/HTML capped at 16 KiB, prompts/completions/tool I/O at 4/2 KiB. * Browser-event layer mapping (_BROWSER_EVENT_LAYERS) so unknown browser.* event types respect CaptureConfig gating without falling through the unknown-event-drops-by-default path. * requires_pydantic = PydanticCompat.V2_ONLY (browser_use is a v2 lib). Public surface (src/layerlens/instrument/adapters/frameworks/ browser_use/__init__.py): * ADAPTER_CLASS = BrowserUseAdapter (registry). * instrument_agent(agent, stratix=, capture_config=, org_id=) one-liner returning the connected, wrapping adapter. * STRATIXBrowserUseAdapter top-level binding (legacy alias) — fires DeprecationWarning on construction. Exposed as a static binding so the manifest consistency lint's AST walk finds it. Pyproject: * Adds 'browser-use' optional extra: browser-use>=0.1.0,<2 with the python_version >= '3.11' marker (browser_use's own constraint). Tests (tests/instrument/adapters/frameworks/test_browser_use_adapter.py): * Replaces the 7-test scaffold from #116 with 40 tests covering: wiring + alias + lifecycle round-trip + truncation (screenshot drop, hash determinism, HTML cap, short-payload no-audit) + multi-tenancy (kwarg, client attribute, defensive overwrite) + resilience (poison stratix, exploding agent attribute access) + error-aware emission (agent.error / tool.error / model.error) + per-hook coverage + sync + async wrapping + replay round-trip + 10-case provider detection table. Sample (samples/instrument/browser_use/{main.py,__init__.py,README.md}): * Runs OFFLINE — no browser-use install, no Playwright, no API key, no network. Three-step duck-typed agent + happy/--fail paths exercise the full event surface and demonstrate screenshot drop + org_id stamping + agent.error emission before re-raise. Doc (docs/adapters/frameworks-browser_use.md): * Install + quickstart + capabilities matrix + 14-event reference table + truncation policy table + multi-tenancy + resilience + error-aware emission + capture config + browser_use specifics + BYOK + replay sections. Manifest (scripts/emit_adapter_manifest.py): * Promotes browser_use from _LIFECYCLE_PREVIEW to _MATURE — every required artifact (test file with >= 12 funcs, sample, doc, STRATIX→LayerLens deprecation alias) ships in this PR. Verification ------------ * uv run pytest tests/instrument/adapters/frameworks/test_browser_use_adapter.py → 40 passed * mypy --strict src/layerlens/instrument/adapters/frameworks/browser_use → Success: no issues found in 2 source files * ruff check on src + test + script → All checks passed! * Sample runs cleanly offline (happy + --fail) * pip install -e .[browser-use] resolves cleanly (browser-use only pulled on Python 3.11+ per the env marker) * tests/instrument/adapters/test_manifest_consistency.py:: test_mature_adapters_have_required_artifacts[browser_use] passes * Full instrument suite (excl. pre-existing crewai/protocols references not on this branch): 312 passed, 1 skipped, 12 xfailed

…or 6 lighter adapters (cross-poll #1) (#130) Implements cross-pollination item #1 from A:/tmp/adapter-cross-pollination-audit.md section 2 #1. The four mature framework adapters (LangChain, AutoGen, CrewAI, Semantic Kernel) carry ad-hoc memory plumbing — episodic recent turns, procedural learned patterns, semantic long-lived facts — that lets agents recall context across runs. The lighter adapters (agno, ms_agent_framework, openai_agents, llama_index, google_adk, bedrock_agents, browser_use) all behave as goldfish agents — every run starts from a blank slate. This PR ports the pattern into a shared, replay-safe primitive that the lighter adapters plug into uniformly. ## What is new ### Shared memory primitive src/layerlens/instrument/adapters/_base/memory.py — new - MemorySnapshot — frozen dataclass with turn_index, episodic (recent turns), procedural (detected patterns), semantic (key/value facts), content_hash (SHA-256 of canonical-JSON encoding), org_id (tenant binding). to_dict / from_dict round-trip preserves identity. - MemoryRecorder — thread-safe accumulator. record_turn(...) is the per-turn entry point; set_semantic(key, value) for long-lived facts; snapshot() returns the immutable view; restore(snap) rebuilds state from a previous snapshot. All buckets bounded (defaults 200/16/64); episodic FIFO eviction, semantic LRU, procedural keep-top-by-count. - Procedural pattern detection: O(window) per turn, scans the recent episodic window for recurring (prev_tools, current_tools) pairs. - Multi-tenant: recorder requires non-empty org_id at construction; restore() rejects cross-tenant snapshots and tampered snapshots (content-hash mismatch). - Replay-safe: snapshot -> restore -> snapshot round-trip produces byte-identical content_hash. ### BaseAdapter integration src/layerlens/instrument/adapters/_base/adapter.py - Constructor builds self._memory_recorder = MemoryRecorder(org_id=self._org_id). - New record_memory_turn(...) helper — best-effort wrapper that swallows recorder failures so memory persistence never breaks the host framework call stack (CLAUDE.md "tracing never breaks user code"). - memory_recorder property, memory_snapshot() and memory_snapshot_dict() convenience accessors. ### Per-adapter wiring (6 adapters) - agno: Agent.run/arun finally-block; episodic input from args/kwargs; tool list from _collect_tool_names(result.messages). - ms_agent_framework: Chat.invoke/invoke_stream finally-block; episodic input from kwargs; tool list from streamed message items. - openai_agents: _on_agent_span_end (TraceProcessor) + on_run_end (Runner wrap); episodic input cached at span_start per span_id; tool list rolled up from _on_function_span_end per parent_id. - llama_index: _on_agent_step_end; episodic input cached at step_start per thread id; tool list rolled up from _on_tool_call. - google_adk: after_agent_callback + on_agent_end; episodic input cached at before_agent_callback per thread id; tool list rolled up from after_tool_callback per thread id. - bedrock_agents: _after_invoke_agent (boto3 hook); episodic input cached at _before_invoke_agent per thread id; tool list rolled up from _process_trace action-group / KB step names. Each adapter serialize_for_replay() now embeds the snapshot under ReplayableTrace.metadata["memory_snapshot"] so replay engines can reconstruct memory state via MemorySnapshot.from_dict(...) -> recorder.restore(snapshot) before re-execution. ## Tests (57 new) ### tests/instrument/adapters/_base/test_memory.py — 27 tests Recorder construction (empty/non-string org_id rejected; zero buffer sizes rejected; initial state empty). Snapshot determinism (identical content -> identical hash; different org_id -> different hash; mutating recorder doesnt affect prior snapshot; to_dict/from_dict round-trip preserves hash; from_dict rejects missing required fields). Replay round-trip (snapshot -> restore -> snapshot byte-identical hash; deterministic next-state under matching inputs; cross-tenant restore raises; tampered-content-hash restore raises). Bounded eviction (episodic FIFO at cap; semantic LRU at cap; semantic overwrite refreshes LRU; procedural cap). Procedural detection (repeated tool sequences accumulate count; no-tool turns produce no patterns). Per-turn truncation (multi-megabyte values capped with deterministic suffix). Thread safety (8 threads x 50 turns produces unbroken 1..400 sequence). Clear preserves binding; defaults positive; extra metadata sorted for hash determinism. ### tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py — 30 tests (5 x 6 adapters parametrized) - Each adapter exposes a recorder bound to its org_id. - record_memory_turn advances the episodic buffer. - serialize_for_replay() embeds metadata["memory_snapshot"]. - Replay engine can restore the recorder from the serialised trace (content-hash match end-to-end). - Cross-tenant snapshot is rejected at the per-adapter recorder boundary. ## Documentation docs/adapters/memory-contract.md — explains the three buckets, the contract (tenant binding, bounded buffers, tamper-evident snapshots, replay-safe round-trip, best-effort recording, thread safety), per-adapter wiring matrix, and audit hooks. Includes the replay-engine integration recipe and the honest scope disclosure for browser_use. ## Honest scope disclosure The cross-pollination audit section 2 #1 enumerates seven target adapters. Six are wired here. The seventh — browser_use — does NOT exist on this PR base branch (feat/instrument-multitenancy-org-id-propagation); it lives on the parallel feat/instrument-frameworks-browser-use-full history. It will be wired when that adapter is ported to this base or when the histories merge. This follows the same honest-disclosure pattern as PR #120 (state filters, which omitted ms_agent_framework for the same reason). The future browser_use wiring (per audit section 2 #1) will be: - Episodic: page navigation events (URL, action, selector) - Procedural: recurring (prev_action, current_action) patterns - Semantic: long-lived page-content cache keyed by URL/DOM hash ## Acceptance uv run pytest tests/instrument/adapters/_base/test_memory.py -x -> 27 passed uv run pytest tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py -x -> 30 passed uv run pytest tests/instrument/adapters/_base/ -> 44 passed (no regressions) uv run pytest tests/instrument/adapters/frameworks/{agno,bedrock_agents,google_adk,llama_index,ms_agent_framework,openai_agents}_adapter.py -> 72 passed (no regressions) uv run mypy --strict src/layerlens/instrument/adapters/_base/memory.py -> Success: no issues found in 1 source file uv run mypy src/layerlens/instrument/adapters/_base/adapter.py src/layerlens/instrument/adapters/frameworks/{6 adapters}/lifecycle.py -> Success: no issues found in 7 source files uv run ruff check src/layerlens/instrument/adapters/_base/memory.py tests/instrument/adapters/_base/test_memory.py tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py -> All checks passed!

mmercuri added 23 commits April 23, 2026 09:06

mmercuri requested a review from m-peko April 27, 2026 00:16

This was referenced May 10, 2026

test(instrument): tier-2 test port for langfuse (port-as-is) #147

Closed

docs(adapters): embedding + benchmark_import specs #148

Closed

test(instrument): tier-2 test ports for ms_agent_framework + strands + pydantic_ai + smolagents (port-as-is) #149

Closed

mmercuri mentioned this pull request May 10, 2026

docs(samples): backfill READMEs for 11 framework samples on agents branch #161

Closed

4 tasks

m-peko approved these changes May 12, 2026

View reviewed changes

m-peko self-requested a review May 12, 2026 19:16

m-peko closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(instrument): Propagate org_id through all event emissions (multi-tenancy CLAUDE.md fix)#118

fix(instrument): Propagate org_id through all event emissions (multi-tenancy CLAUDE.md fix)#118
mmercuri wants to merge 24 commits into
mainfrom
feat/instrument-multitenancy-org-id-propagation

mmercuri commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mmercuri commented Apr 27, 2026

Summary

What changed

Foundation (_base/)

Per-adapter wiring (17 framework + 1 protocol base + 9 LLM providers)

Tests

Documentation

Acceptance (per task spec)

CLAUDE.md compliance

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Foundation (`_base/`)