fix(instrument): Propagate org_id through all event emissions (multi-tenancy CLAUDE.md fix)#118
Closed
mmercuri wants to merge 24 commits into
Closed
fix(instrument): Propagate org_id through all event emissions (multi-tenancy CLAUDE.md fix)#118mmercuri wants to merge 24 commits into
mmercuri wants to merge 24 commits into
Conversation
…rrupt() The evaluator agent calls ``interrupt()`` in ``confirm_judge_node`` for human-in-the-loop judge confirmation. A checkpointer is mandatory for that to work -- without one, a ``Command(resume=...)`` call produces zero events, ``ag-ui-langgraph`` never emits ``RUN_FINISHED``, and the CopilotKit frontend blocks all subsequent messages with "Cannot send 'RUN_STARTED' while a run is still active". Changes: - ``evaluator_agent.py``: compile with ``InMemorySaver`` and a commented Postgres swap block. Convert ``JudgeInfo`` / ``TraceInfo`` / ``EvaluationInfo`` from ``@dataclass`` to ``pydantic.BaseModel`` so LangGraph's default ``JsonPlusSerializer`` can persist state across the pause boundary (dataclasses raise ``TypeError: Type is not msgpack serializable``). - ``samples/copilotkit/README.md``: add full FastAPI backend wiring with ``add_langgraph_fastapi_endpoint``, Next.js frontend wiring with ``LangGraphHttpAgent``, a checkpointer options matrix (InMemory / SQLite / Postgres / Redis / LangGraph Platform) with per-option migration snippets, a version-compatibility table pinning the versions the bug reporter used, and a troubleshooting section mapping the observed frontend errors back to the backend cause. - ``docs/samples-guide.md``: cross-reference the checkpointer requirement. - ``tests/test_samples_e2e.py``: add ``test_copilotkit_evaluator_interrupt_resume`` that imports the real ``langgraph`` (not ``MagicMock``), asserts the compiled graph has a non-None checkpointer, and drives a full ``astream -> interrupt -> Command(resume=...) -> astream`` cycle with a patched Stratix client. Confirmed this test fails on the pre-fix code and passes on the fix. Also extended the existing mock-modules dicts so the import-smoke tests include ``langgraph.checkpoint.memory``. The existing tests missed this because they mock ``langgraph``, ``langgraph.graph``, and ``langgraph.types`` with ``MagicMock()`` and then only call ``main()`` (which prints usage). They never build or execute the graph, so they cannot observe the missing checkpointer.
Follow-ups to the interrupt/checkpointer fix, addressing the open items
flagged in the prior commit:
1. Deserialize warning resolved. The Pydantic DTOs
(``JudgeInfo`` / ``TraceInfo`` / ``EvaluationInfo``) are now registered
on ``JsonPlusSerializer(allowed_msgpack_modules=...)`` via a custom
serde passed to ``InMemorySaver``. Verified the sample passes with
``LANGGRAPH_STRICT_MSGPACK=true``, so it survives LangGraph's planned
tightening of checkpoint deserialization.
2. End-to-end AG-UI wire validation. New integration test
``test_copilotkit_evaluator_agui_wire`` wires the evaluator graph into
a FastAPI app through ``ag_ui_langgraph.add_langgraph_fastapi_endpoint``,
drives the full user flow in-process via ``httpx.ASGITransport``, and
asserts:
- Phase 1 (initial run, hits ``interrupt()``): emits RUN_STARTED and
RUN_FINISHED on the SSE stream.
- Phase 2 (resume with user confirmation, same ``threadId``): emits
RUN_STARTED and RUN_FINISHED.
- Phase 3 (follow-up message after resume): not blocked -- RUN_STARTED
and RUN_FINISHED fire again.
This is the exact symptom the reporter hit, tested through the exact
protocol path. Gated on ``pytest.importorskip`` for the heavy deps so
the test skips cleanly when they are absent.
Side benefit: running the same scenario against pre-fix code produces
``ValueError: No checkpointer set`` directly from
``graph.aget_state()``, giving operators a much louder error than the
silent "stream ends without RUN_FINISHED" path.
3. README backend-wiring snippet corrected. The actual
``add_langgraph_fastapi_endpoint`` signature takes an
``agent=LangGraphAgent(...)`` wrapper, not a bare ``graph=`` kwarg --
the example in the previous commit would have failed at import.
Also expanded the SSE-protocol explanation to match what the new e2e
test observes on the wire.
4. Investigator graph annotated. ``investigator_graph`` does not call
``interrupt()`` so it does not need a checkpointer, but without an
explicit note future contributors adding a HITL step would silently
regress. Added a short comment at the ``.compile()`` call pointing at
the evaluator pattern.
Follow-ups addressing the remaining open items from the previous two
commits:
1. Rename ``error`` node to ``handle_error`` (evaluator + investigator).
The old name collided with the ``error`` field on the state
dataclass. LangGraph 1.x accepts the collision; earlier versions
reject it with "'error' is already being used as a state key".
Renaming the node (and the conditional-edge routing targets) keeps
the routing token ``"error"`` purely an edge key and sidesteps the
conflict on any LangGraph version the sample may be copied into.
2. Guard the ``allowed_msgpack_modules`` kwarg behind try/except so the
sample still imports cleanly on langgraph<1.0 (where the kwarg does
not exist and the strict-msgpack warning is not emitted either).
Verified the sample now imports on both langgraph 0.2.56 and 1.1.9.
3. Ruff-clean the changed files (import sort I001 fixes on the new
test additions; unrelated warnings in pre-existing ``main()`` /
``error_node`` code are out of scope per "only change what was
asked").
4. New ``samples/copilotkit/tests/browser/`` harness:
- ``backend/server.py`` -- FastAPI app that patches
``layerlens.Stratix`` before importing the evaluator module and
mounts ``evaluator_graph`` via
``add_langgraph_fastapi_endpoint(..., path="/evaluator")``.
- ``frontend/`` -- Next.js 16.2.4 app pinned to the reporter's exact
CopilotKit versions (``@copilotkit/react-core``,
``@copilotkit/react-ui``, ``@copilotkit/runtime`` all at 1.56.3),
with the CopilotKit runtime wired to ``LangGraphHttpAgent`` against
the FastAPI backend.
- ``frontend/tests/interrupt-resume.spec.ts`` -- Playwright spec
that drives CopilotChat through the three-turn scenario the
reporter hit ("evaluate" -> "ok" -> "thanks") and asserts the
exact string "Cannot send 'RUN_STARTED' while a run is still
active" appears in neither the visible DOM nor the browser
console.
Known limitation documented in the harness README: CopilotChat
1.56's textarea reports as aria-hidden / non-"visible" under
Playwright strict actionability checks in **headless** Chromium,
and multiple input-driving patterns (``fill``, ``keyboard.type +
Enter``, ``pressSequentially``, DOM-setter + bubbled input event)
failed to reliably enable the Send button headlessly. The harness
works with ``--headed`` for human verification and is structurally
complete. The authoritative regression coverage for the fix is the
Python test suite (``test_copilotkit_evaluator_interrupt_resume``
and ``test_copilotkit_evaluator_agui_wire``); the browser harness
is corroborating / demo value, not gate-keeping.
…nggraph runId bug DevRel surfaced that the backend fix in the previous commits got the Python side working but the frontend still locked up with "Cannot send 'RUN_STARTED' while a run is still active. ... INCOMPLETE_STREAM" on the second message. Raw SSE capture confirmed the root cause: RUN_STARTED runId = "r1_aca59ad1" (client-supplied) RUN_FINISHED runId = "019dc049-14ba-..." (LangGraph's internal chain UUID) This is an upstream bug in ag-ui-langgraph (ag-ui-protocol/ag-ui#1582): ``_handle_stream_events`` overwrites ``self.active_run['id']`` with every LangGraph event's internal ``run_id``, so RUN_FINISHED emits LangGraph's UUID instead of the client-supplied ``input.run_id``. ``@copilotkit/runtime`` tracks active runs by client runId and raises RUN_ERROR/INCOMPLETE_STREAM. Verified the bug is present in both ag-ui-langgraph 0.0.22 (CopilotKit's officially-pinned version) and 0.0.34 (the reporter's version), and also in ``copilotkit.LangGraphAGUIAgent`` which inherits the broken method. Changes in this commit, all aligned with CopilotKit's own ``examples/integrations/langgraph-fastapi`` reference sample: 1. ``evaluator_agent.py``: - State class converted from ``@dataclass`` to a ``TypedDict`` inheriting from ``copilotkit.CopilotKitState``. This gives us ``MessagesState``'s ``add_messages`` reducer for free (nodes return NEW messages; they are appended, not replaced) and the ``copilotkit`` field the frontend injects. - All node functions updated from ``state.X`` / ``state.messages + [m]`` to ``state.get('X')`` / ``{'messages': [m]}``. - HITL interrupt now uses ``copilotkit.langgraph.copilotkit_interrupt`` (wraps ``interrupt()`` with ``__copilotkit_messages__`` so the prompt renders as a real AIMessage in the chat UI). The bare ``langgraph.types.interrupt(prompt)`` emitted a CUSTOM event the UI ignored -- why the reporter said "the agent stops and never reaches the human-in-the-loop confirmation step." - New ``RunIdPreservingAgent`` subclass (lazy factory ``_build_langgraph_agui_agent``) overrides ``_dispatch_event`` to restore ``input.run_id`` on RUN_FINISHED / RUN_ERROR terminal events. Clearly commented with a "remove when upstream ships" TODO pointing at the ag-ui-protocol/ag-ui issue. 2. ``samples/copilotkit/README.md``: - Version matrix re-pinned to CopilotKit's exact tested set (``copilotkit==0.1.74``, ``langchain==1.0.1``, ``langgraph==1.0.1``, ``ag-ui-langgraph==0.0.22``, ``@copilotkit/*==1.56.3``, Python ``>=3.10,<3.13``). - Upstream-bug callout explaining the runId workaround. - Backend wiring snippet updated to show the factory import and the ``LangGraphAGUIAgent`` path for non-interrupt graphs (investigator). 3. ``tests/test_samples_e2e.py``: - ``test_copilotkit_evaluator_interrupt_resume`` now sends ``Command(resume=[HumanMessage(content='ok')])`` rather than ``Command(resume='ok')``, matching ``copilotkit_interrupt``'s expected resume payload shape. - ``test_copilotkit_evaluator_agui_wire`` rewritten. The previous version had a blind spot: it only asserted RUN_FINISHED was PRESENT, not that ``RUN_STARTED.runId == RUN_FINISHED.runId == input.run_id``. Now it uses the ``RunIdPreservingAgent`` factory and asserts runId continuity end-to-end. Without the workaround this test would catch the upstream bug immediately. - Mock-module dict extended with ``copilotkit.langgraph`` for the import-smoke test. 4. ``samples/copilotkit/tests/browser/``: - ``backend/requirements.txt`` re-pinned to CopilotKit's set. - ``backend/server.py`` switched from raw ``ag_ui_langgraph.LangGraphAgent`` to the sample's factory so the browser harness also benefits from the runId workaround. All 9 copilotkit tests pass in the pinned venv. Empirical verification scripts in /tmp/ (not committed) show raw SSE with matching runIds end-to-end.
…rrupt path DevRel's diagnostic bundle (ag-ui-langgraph==0.0.34, copilotkit==0.1.87, @ag-ui/client==0.0.52 transitively) confirmed commit 542002b did not fix the browser symptom. Raw SSE from the Network tab showed: RUN_STARTED runId=d0b9d6c5-... [graph reaches step="confirm_judge" -- interrupt IS being hit] RUN_ERROR {code: "INCOMPLETE_STREAM", message: "Cannot send 'RUN_STARTED' while a run is still active. The previous run must be finished with 'RUN_FINISHED' before starting a new run."} Same error text as before, different root cause. A second bug in ag-ui-langgraph: when a request arrives on a thread whose graph is already paused at ``interrupt()`` and the request does NOT carry ``forwardedProps.command.resume``, the ``has_active_interrupts`` branch of ``prepare_stream`` (agent.py:491) emits a second ``RunStartedEvent`` to ``events_to_dispatch`` -- after ``_handle_stream_events`` (line 209) already emitted one at the top of the stream. The server's own AG-UI encoder validator catches the duplicate and converts it into a ``RUN_ERROR`` with the exact "Cannot send 'RUN_STARTED'..." message, terminating the stream before ``RUN_FINISHED`` can be dispatched. On ``@ag-ui/client@0.0.52`` (the newer protocol-state validator, which enforces within-stream start/finish invariants rather than the runId correlation the previous version used) this is what lands as INCOMPLETE_STREAM in the browser. Extended the sample's workaround subclass to filter at the agent boundary rather than override ``_dispatch_event`` (which expects to return an Event, not None/""). The filter: 1. Drops any RUN_STARTED after the first within a single stream -- fixes the duplicate-emission bug on the ``has_active_interrupts`` path. 2. Restamps ``input.run_id`` on RUN_FINISHED / RUN_ERROR -- preserves the existing ag-ui-protocol/ag-ui#1582 fix for older clients that correlate by runId. Verified on both pin matrices: - copilotkit==0.1.74 / ag-ui-langgraph==0.0.22 (CopilotKit's own reference sample pins): all tests pass. - copilotkit==0.1.87 / ag-ui-langgraph==0.0.34 (DevRel / reporter): all tests pass. Tightened ``test_copilotkit_evaluator_agui_wire`` accordingly: - asserts exactly one RUN_STARTED per stream (catches bug b) - asserts no RUN_ERROR - asserts RUN_STARTED.runId == RUN_FINISHED.runId == input.run_id (catches bug a, ag-ui-protocol/ag-ui#1582) Without either half of the workaround the test fails with a precise message pointing at which bug regressed. Follow-up: file the duplicate-RUN_STARTED bug upstream as a separate issue on ag-ui-protocol/ag-ui.
… ship lockfile Replaces the earlier pinning to CopilotKit's reference-sample versions (copilotkit==0.1.74 / ag-ui-langgraph==0.0.22) with the current published set customers actually install: copilotkit==0.1.87 langchain==1.2.15 langchain-core==1.3.0 langgraph==1.1.9 ag-ui-langgraph==0.0.34 Frontend transitive ``@ag-ui/client==0.0.52`` now matches what ``@copilotkit/react-core==1.56.3`` actually pulls in (DevRel's environment per their diagnostic bundle). Changes: - ``samples/copilotkit/tests/browser/backend/requirements.txt`` -- pins updated to the latest set above. - ``samples/copilotkit/tests/browser/backend/requirements.lock`` -- NEW, committed pip-freeze of the verified environment. ``pip install -r requirements.lock`` now gives byte-identical transitive deps. - ``samples/copilotkit/README.md`` -- version matrix and install snippets updated to the latest set; upstream-bug callout now lists both issues (``ag-ui-protocol/ag-ui#1582`` runId overwrite, ``ag-ui-protocol/ag-ui#1584`` duplicate RUN_STARTED). - ``samples/copilotkit/agents/evaluator_agent.py`` -- renamed the factory from ``_build_langgraph_agui_agent`` to the public ``build_agui_agent``; added a ``_version_guard_ag_ui_langgraph`` helper that emits a ``RuntimeWarning`` when the installed version is outside the tested range ``[0.0.22, 0.0.34]`` so silent behavior drift does not hide a regression. A backwards-compatible alias keeps the old private name importable for internal tests during the rename window. - ``samples/copilotkit/tests/browser/backend/server.py`` and ``tests/test_samples_e2e.py`` -- call sites updated to the public name. Verified end-to-end against the latest version matrix: - pytest -k copilotkit: 9 passed, 2 skipped (live-only). - Manual HTTP drive against a running backend with the reporter's exact flow (turn 1 initial -> interrupt, turn 2 re-entry on paused graph): both turns emit exactly one RUN_STARTED and one RUN_FINISHED, both with matching client runIds, no RUN_ERROR / INCOMPLETE_STREAM.
… resume heuristic
DevRel confirmed the Apr-24 push resolved the turn-1 INCOMPLETE_STREAM
(backend now emits a clean RUN_STARTED -> STEP_* -> RUN_FINISHED for the
initial interrupt turn). Remaining gap: when the user replies to the
interrupt, plain ``<CopilotChat>`` sends the reply as an ordinary new
chat message, not as ``forwardedProps.command.resume`` -- so the graph
stayed paused and the same error returned on the follow-up.
Correct fix is on the frontend, not the backend:
``@copilotkit/react-core@1.56.3`` ships ``useLangGraphInterrupt``, the
hook specifically designed for this case. It renders a UI when the
graph pauses at ``interrupt()`` and calls ``resolve(...)`` with the
user's answer -- which the runtime forwards as the proper
``command.resume`` payload. This is the supported AG-UI protocol path:
the frontend must explicitly signal a resume rather than a new turn.
Changes:
- ``samples/copilotkit/tests/browser/frontend/app/page.tsx``: wires
``useLangGraphInterrupt`` with a dedicated prompt widget
(``data-testid`` stable for automation), and a "Start evaluation"
test-hook button that uses ``useCopilotChat().appendMessage`` to
kick off the graph without having to type into CopilotChat's
textarea (which Playwright can't reliably drive on 1.56.3 + React 19).
The ``resolve([{role:"user", content}])`` shape matches what
``copilotkit_interrupt`` expects server-side
(``answer = response[-1].content``).
- ``samples/copilotkit/tests/browser/frontend/app/globals.css``: styles
for the interrupt widget and the test-hook start button.
- ``samples/copilotkit/agents/evaluator_agent.py``: reverts the
backend auto-resume heuristic I had shipped as a stopgap. It was
overloading the protocol semantics ("any user message during active
interrupt == resume answer") which is incorrect for anything beyond
a simple sample -- breaks cancel flows and multi-interrupt
scenarios. The backend now only does the two genuine protocol-bug
workarounds (runId overwrite, duplicate RUN_STARTED). Resume
belongs to the frontend.
Test plan:
- Python test suite: ``pytest -k copilotkit`` -- 9 passed / 2 skipped
(live) on DevRel's exact version matrix.
- Backend HTTP round-trip with a programmatic ``command.resume``
payload: both turns emit matched ``RUN_STARTED``/``RUN_FINISHED``
with client runId, no ``RUN_ERROR`` (verified on 2026-04-24).
- Browser end-to-end: the hook wiring in page.tsx matches CopilotKit's
own showcase pattern and the hook source I inspected. I could not
self-verify the full browser round-trip because (a) Playwright
cannot reliably drive CopilotChat's textarea on 1.56.3 + React 19
(tracked at CopilotKit/CopilotKit#4215), and (b) my attempted
programmatic appendMessage test-hook did not trigger a runtime
POST in my local venv for reasons I have not yet pinned down.
**DevRel re-test in a real browser is the authoritative check for
the frontend round-trip.**
Follow-up (per "#2" in the user's plan): rewrite the evaluator HITL
to use CopilotKit's current idiom (``useCopilotAction`` /
``useHumanInTheLoop`` -- frontend-defined tool + UI render + resolve)
instead of backend ``interrupt()``. That's the pattern CopilotKit's
active samples use; it avoids the ag-ui-langgraph interrupt path
bugs entirely and is where customers should be pointed for new work.
…TL tool Replaces the custom StateGraph + ``langgraph.types.interrupt()`` pattern with CopilotKit's current HITL idiom: ``langchain.agents.create_agent`` driving an LLM that calls backend tools, with the human-in-the-loop step wired as a **frontend** tool via ``useCopilotAction`` + ``renderAndWaitForResponse``. This matches what CopilotKit's active showcases (``hitl_in_chat_agent.py``, ``interrupt_agent.py``) use. Why the rearchitect: the ``interrupt()`` code path in ``ag-ui-langgraph`` has two protocol-level bugs (tracked upstream as ``ag-ui-protocol/ag-ui#1582`` and ``#1584``) that the previous revision worked around by subclassing ``LangGraphAGUIAgent`` and reaching into private internals. That ships, but it's not the pattern CopilotKit themselves exercise, and the workaround is fragile across upstream bumps. Moving off the ``interrupt()`` path sidesteps both bugs by construction and aligns with CopilotKit's active direction. Design (three-role review): - **AI engineer**: LLM drives. Backend tools (``list_judges``, ``list_recent_traces``, ``run_trace_evaluation``, ``get_evaluation_result``) are thin wrappers over the LayerLens SDK. A tight system prompt guides the flow. ``confirm_judge`` is a frontend tool declared via ``useCopilotAction``; ``CopilotKitMiddleware()`` bridges it into the agent's toolbelt so the LLM can "call" it like any other tool. - **Designer**: HITL renders as a card list -- each judge shows name, id, and evaluation goal, with a ``Select <Name>`` button. Keyboard accessible, visible focus states, compact "Judge selected." state after the user chooses. ``data-testid`` attributes throughout for deterministic automation. - **SDK engineer**: ~160 LoC for the evaluator (down from ~560). No private-API reach. No workaround subclass. No checkpointer needed (``create_agent`` owns state). Lockfile updated for ``langchain-openai``. Frontend pins unchanged. The old ``build_agui_agent`` factory, ``build_graph`` with a custom ``StateGraph``, ``EvaluatorState`` TypedDict, all node functions, the msgpack DTO allowlist, and the version-guard helpers are all gone -- replaced by one ``build_graph(model=...)`` that returns the compiled ``create_agent`` graph. Tests: - ``tests/test_samples_e2e.py`` rewritten. ``test_copilotkit_evaluator_ interrupt_resume`` and ``test_copilotkit_evaluator_agui_wire`` (both specific to the old ``interrupt()`` architecture) replaced by ``test_copilotkit_evaluator_tools``, which exercises each backend tool against a patched Stratix client and verifies the system prompt references ``confirm_judge``. - Import-smoke test mock list extended for ``langchain.agents`` / ``langchain.tools`` / ``langchain_core.tools`` / ``langchain_openai``. - ``pytest -k copilotkit``: 8 passed, 2 skipped (live). Frontend: - ``page.tsx``: ``useCopilotAction("confirm_judge", ...)`` with a rich judge-card list; ``useLangGraphInterrupt`` removed. - ``globals.css``: styles for ``judge-picker`` / ``judge-card`` / complete / empty states. - ``Evaluate my traces`` quick-action button retained for direct user triggering and automation. Backend server: - ``samples/copilotkit/tests/browser/backend/server.py`` swaps ``build_agui_agent(...)`` for plain ``LangGraphAGUIAgent(...)`` -- no workaround needed on this code path. README: - Full rewrite around the new architecture. Version matrix unchanged. The two upstream ``ag-ui-langgraph`` bugs are preserved in the "informational" section for customers building their own ``interrupt()``-based graphs. Per user direction: no backwards compatibility for the old sample (no customer has it). The workaround subclass is removed, not deprecated.
The previous commit's tests verified the new architecture against
mocks; this one verifies it against a real LLM through the actual AG-UI
FastAPI endpoint.
New test ``test_copilotkit_evaluator_live_llm``:
- Loads credentials from a gitignored ``.env`` (or real env vars in CI),
with OpenRouter convenience: if only ``OPENROUTER_API_KEY`` is set,
the loader auto-points ``OPENAI_BASE_URL`` at OpenRouter.
- Builds a FastAPI app with the patched Stratix client + the real
evaluator graph (real LLM, no fake model).
- POSTs an AG-UI ``RunAgentInput`` whose ``tools`` array declares the
``confirm_judge`` frontend tool, exactly as the browser would.
- Asserts: tool sequence is ``list_recent_traces`` -> ``list_judges``
-> ``confirm_judge``; agent halts at ``confirm_judge`` (never calls
``run_trace_evaluation``); single ``RUN_STARTED`` + ``RUN_FINISHED``
with matching client ``runId``; no ``RUN_ERROR``.
- Marked ``@pytest.mark.live`` and ``pytest.skip``s when no key is
available, so the default ``pytest`` run is unaffected.
Verified locally: passes against ``openrouter:openai/gpt-4o-mini``.
Other changes in this commit:
- ``evaluator_agent.py``:
- ``_default_model()`` honours ``OPENAI_API_KEY``,
``OPENAI_BASE_URL``, and ``OPENAI_MODEL`` so any OpenAI-compatible
endpoint works (OpenAI, Ollama, LM Studio, OpenRouter, vLLM, ...).
For non-compatible providers, customers pass any LangChain
``BaseChatModel`` to ``build_graph(model=...)``.
- ``create_agent`` now compiles with ``InMemorySaver``. ``ag-ui-
langgraph``'s ``add_langgraph_fastapi_endpoint`` calls
``graph.aget_state(config)`` on every request, which fails with
``ValueError("No checkpointer set")`` if the graph wasn't compiled
with one -- regardless of whether ``interrupt()`` is used.
- ``build_agui_agent`` reintroduced as a *minimal* runId-only
workaround for ``ag-ui-protocol/ag-ui#1582``. Bug #1584 (duplicate
RUN_STARTED) is unreachable on this code path because the
evaluator never calls ``langgraph.types.interrupt()``, so we only
need the runId fix. Live test confirms the workaround restores
runId continuity end-to-end.
- ``samples/copilotkit/tests/browser/backend/server.py``: switched back
to ``build_agui_agent(...)`` so the runId workaround is active in
the harness backend. The earlier "no workaround needed" claim was
wrong; @ag-ui/client@0.0.52 doesn't enforce runId continuity but
older clients did and future strict ones likely will.
- ``tests/.env.example``: documents the supported env vars (OPENAI,
OpenRouter convenience, LayerLens). Real ``tests/.env`` is
gitignored.
- ``samples/copilotkit/README.md``: documents the live-test setup and
links the .env.example. Also documents the
``OPENAI_API_KEY``/``OPENAI_BASE_URL``/``OPENAI_MODEL`` env-var
triplet for OpenAI-compatible providers (Ollama, LM Studio,
OpenRouter).
DevRel hit a "page renders but every button is dead, textarea won't
accept input" failure mode while running the harness locally. Diagnosis
took several iterations because there was no client-side error:
- Backend was healthy; ``/healthz`` returned 200.
- ``/api/copilotkit`` was up; an ``info`` JSON-RPC probe listed the
evaluator agent.
- Direct POSTs to the backend at :8123 streamed real LLM events.
- The page HTML had every expected ``data-testid``.
- Browser console showed only one repeating warning:
``WebSocket connection to 'ws://127.0.0.1:3000/_next/webpack-hmr'
failed: Error during WebSocket handshake``
Root cause: Next 16 enforces a cross-origin allowlist for dev resources
(including the webpack-hmr WebSocket). When the user serves on
``127.0.0.1`` but the allowlist is implicit ``localhost``, HMR fails to
connect and Next leaves React in a half-hydrated state. The page
renders from the server but client React never wires up event handlers
or controlled-input state -- so buttons and textareas are visually
present but inert. No error is surfaced beyond the WebSocket warning.
Fix:
- Add ``allowedDevOrigins: ["127.0.0.1", "localhost"]`` to
``samples/copilotkit/tests/browser/frontend/next.config.js``. Both
origins are the supported way to load the harness; without this,
whichever the user picks tends to break.
Also, to make this kind of failure self-diagnosing rather than
requiring DevTools-paste skills:
- New ``samples/copilotkit/tests/browser/frontend/public/diag.html``
-- a static page (no React) that runs three probes on load and
renders results inline: runtime ``info`` reachability, an
``agent/run`` round-trip through ``/api/copilotkit``, and a direct
``/healthz`` ping against the backend. Visit
``http://127.0.0.1:3000/diag.html`` to see green/red labels for
each. This bypasses the React app entirely, so it stays useful even
when hydration is broken.
- New "Run diagnostic" button on the harness page (next to "Evaluate
my traces") that runs the same probes plus a couple of React-only
checks (textarea state, isLoading, intercepted ``appendMessage`` POST
body) and renders the report directly on the page. Useful for users
who can't (or don't want to) paste JS into DevTools console.
Verified locally: after the cache + allowedDevOrigins fix, both
buttons fire, ``appendMessage`` POSTs to ``/api/copilotkit`` and gets
back a real ``RUN_STARTED`` SSE stream end-to-end.
CopilotKit's ``renderAndWaitForResponse`` re-renders the action UI
progressively as the LLM streams the tool-call JSON, so for the first
render tick or two ``judge.id`` (and sometimes ``judge.name``) can be
undefined even though the surrounding React state is stable. That
tripped two issues in our judge picker:
1. ``key={judge.id}`` warned "Each child in a list should have a
unique key prop" when id was undefined.
2. The Select button was clickable with an undefined id, which would
``respond({ id: undefined, name: undefined })`` and break the
resume.
Fix:
- Fall back to ``pending-{index}`` for the React key while id is
pending. Quiet warning + stable row identity.
- Mark each row "ready" only when both id and name are present and
``respond`` is non-null. Disable the Select button and show
"Loading..." until ready. The button text and ``data-testid``
follow the ready state so automated tests don't grab a half-loaded
row by accident.
- Hide the dim id-pill (``judge-card-id``) while id is pending so the
card doesn't flash an empty grey box.
…tionCard
DevRel asked: "where is the tool indicator I should see?" CopilotChat
only renders user/assistant text and frontend HITL widgets by default;
backend tool calls fire invisibly. Surface them with the
``useCopilotAction`` + ``available: "remote"`` + ``render`` pattern --
the same pattern CopilotKit's ``tool_rendering_agent.py`` showcase
uses.
Changes:
- All four backend tools (``list_recent_traces``, ``list_judges``,
``run_trace_evaluation``, ``get_evaluation_result``) now render
inline cards with a pulsing-dot "Running" status pill, transitioning
to a green "Done" pill when the tool resolves. Each card has a
stable ``data-testid`` for automated tests.
- ``get_evaluation_result`` (the final result) renders the polished
``EvaluationCard`` from ``samples/copilotkit/components/`` -- the
production-grade SDK card with the score donut and pass-rate ring.
Imported via a tsconfig path alias
(``@layerlens/copilotkit-cards``) so the harness can reuse the
upstream SDK components without copying or duplicating them.
- ``confirm_judge`` HITL picker restyled with matching Tailwind tokens
to keep the visual language consistent across all tool cards.
- Tailwind 4 added (``@tailwindcss/postcss``, ``tailwindcss``) +
``postcss.config.mjs`` + ``@import "tailwindcss"`` in ``globals.css``.
Inline custom CSS removed in favour of Tailwind utilities, matching
CopilotKit's own showcase samples.
- ``html className="dark"`` + ``color-scheme: dark`` so the SDK
reference cards (which key off the ``.dark`` ancestor) render in
dark mode by default.
- ``<CopilotKit showDevConsole={false}>`` -- DevRel reported the
default web-inspector "kite" obscured the harness header; suppressed
for the sample.
- ``tsconfig.json`` includes ``../../../components/**/*`` so Next's
bundler picks up the SDK card sources, and adds the
``@layerlens/copilotkit-cards`` path alias.
The pattern (frontend ``useCopilotAction`` for backend tools with
``available: "remote"``) is what customers should copy. The harness
demonstrates it in two flavours: lightweight inline cards (for the
first three tools) and full SDK-component composition (for the
result). Both styles are valid; teams pick based on visual weight
they want.
Reshaped the CopilotKit sample so it reads as a commercial-grade SDK
demo rather than a test fixture, and brought the visual language into
line with CopilotKit's own samples (research-canvas, travel, banking,
with-shadcn-ui).
Structure
- Move sample out of `samples/copilotkit/tests/browser/{backend,frontend}`
to `samples/copilotkit/app/{backend,frontend}` so customers see "the
app" rather than "a test harness". Update README + path constants.
- Add `app/frontend/.gitignore` for `.next/`, `node_modules/`, and
Playwright artefacts.
Backend (`app/backend/server.py`, `agents/evaluator_agent.py`)
- Real LayerLens only: missing `LAYERLENS_STRATIX_API_KEY` is a hard
startup error. No fake-fixture path, no `MagicMock`, no env-var
flag — fixtures only ever existed for an earlier Playwright fixture
and conflicted with the SDK posture in CLAUDE.md.
- Agent built with `create_agent` + `CopilotKitMiddleware`, real `@tool`
impls returning `Command(update={...})` so each tool emits state into
`state.{traces,judges,evaluations,results}`. Async tools call
`copilotkit_emit_state` so the canvas updates live during a run.
- New `GET /evaluations/{id}` endpoint for out-of-band polling: the
agent kicks off evaluations, ends in seconds, and the frontend folds
completed verdicts into the canvas as each evaluation resolves on
LayerLens. Fixes the 30s-evaluation-vs-LLM-polling-loop hallucination.
- `LangGraphAGUIAgent` constructor gets `config={"recursion_limit":
200}` so a 5-trace fan-out doesn't trip the default 25-hop limit
(tested via `with_config` first; that path is dropped by ag-ui's
internal config merge).
- System prompt rewritten: strict tool order; `confirm_judge` takes no
args (frontend reads candidates from `state.judges` to avoid the
`tool_argument_parse_failed: Unterminated string in JSON` we hit
when streaming 38 judges through tool args); evaluations capped at
5 traces; pending != failed; final summary template branches on
whether anything completed.
SDK card library (`samples/copilotkit/components/`)
- Rewritten on top of shadcn/ui primitives. Cards now compose `Card`,
`CardHeader`, `CardContent`, `CardFooter`, `Badge`, `Button`,
`Separator`, `Progress` from `@/components/ui/*`. Status pills use
the `bg-{color}-50 text-{color}-600 dark:bg-{color}-900/20` pattern
CopilotKit's banking sample uses, not custom ring/shadow chrome.
- Stock shadcn neutral OKLCH palette (`baseColor: neutral`). Brand
accent `#6766FC` applied via Tailwind class strings on CTAs/links —
same approach research-canvas takes for its accent. No edits to
`--primary` / shadcn theme variables.
- Score bars solid (`bg-green-500` / `bg-red-500` / `bg-amber-500`)
not gradients. Sparklines color-coded by pass-rate threshold.
- `dashboardBaseUrl` is now strictly opt-in across `TraceCard` and
`EvaluationCard`: the "Trace Explorer →" / "Agent Graph →" / "View
in Dashboard →" footers only render when a real URL is configured
via `NEXT_PUBLIC_LAYERLENS_DASHBOARD_URL`. Stops 404s on routes
that aren't deployed yet.
Frontend (`app/frontend/`)
- shadcn primitives installed via `npx shadcn@latest add card button
badge progress separator`. Deps: `radix-ui`, `class-variance-
authority`, `clsx`, `tailwind-merge`, `tw-animate-css`. Tailwind 4 +
React 19. `components.json` aliases `ui` to the SDK card library.
- New `globals.css` with shadcn neutral tokens (`--background`,
`--card`, `--muted-foreground`, etc.), `@theme inline` mapping for
Tailwind 4, and a `--copilot-kit-*` bridge so `<CopilotChat>` reads
the same neutral tokens as the canvas. Brand accent set on
`--copilot-kit-secondary-color`. Drops the previous "force dark"
CSS.
- Layout split-pane, **light by default** to match every official
CopilotKit sample. New `theme-toggle.tsx` segmented control
(Light / System / Dark) persists to `localStorage` and reacts to
OS-level theme changes when set to System.
- `useCoAgent({ name: "evaluator" })` reads live agent state. New
out-of-band poller (`useEffect` against `/evaluations/{id}` every
5 s) folds verdicts that arrive after the agent run ends into the
canvas. `state.results` (agent) and `polledResults` (frontend) are
merged via `useMemo` so MetricStrip / EvaluationCard / JudgeVerdict-
Card all see one consistent results array.
- Picker: `JudgePicker` is its own component subscribed to `useCoAgent`
so it re-renders when `state.judges` populates after the LLM streams
out the tool call. `confirm_judge` uses `available: "remote"` +
`renderAndWaitForResponse` per the canonical research-canvas HITL
pattern.
Cleanup
- Strip every dev artefact: agent's `[tool] X INVOKED` prints, the
page's debug-state `<pre>`, the `console.log("[evaluator state]"…)`
effect, the "Run diagnostic" button + panel + state, and the
`probe_e2e.py` SSE diagnostic script. Header is now just the title,
theme toggle, and the primary CTA.
…n reasoning
Polish pass after first review:
- Chat token bridge fixed. Re-read CopilotKit's ``react-ui/colors.css``
semantics: ``primary-color`` is the user-bubble + interactive accent,
``secondary-color`` is the assistant message background, not a brand
slot. Earlier mapping made the assistant greeting render as solid
indigo and clip out of view in light mode. Now mapped onto shadcn
tokens semantically: ``primary → --primary``, ``contrast → --primary-
foreground``, ``secondary → --card``, ``secondary-contrast →
--card-foreground``. Brand accent ``#6766FC`` stays only on actual
CTA buttons via Tailwind class strings.
- ``JudgePicker`` "selected" pill now uses light + dark variants
(``bg-green-50 text-green-700 dark:bg-green-900/20 dark:text-green-300``)
instead of dark-mode-only emerald that disappeared on a light page.
- ``JudgeVerdictCard`` redesign:
* Pass / Fail / Error are now solid-filled badges (``bg-green-600``,
``bg-red-600``, ``bg-amber-600`` with white text), readable at a
glance instead of subtle ghost pills.
* Severity rendered as a colored pill with a triangle alert glyph,
not a dot. Severity is a status (impact-of-failure level), not a
trend, so an "alert" shape is correct; chevrons would imply
direction. Hide the severity chip when verdict=pass AND
severity=low — nothing meaningful to flag.
* Reasoning rendered through a tiny inline ``MarkdownLite`` that
handles paragraph breaks, line breaks, ``**bold**``, and
``*italic*`` — the cases LayerLens API actually emits. No
``react-markdown`` dep (the SDK card library lives outside the
Next app's node_modules so it can't resolve packages there); no
raw HTML injection. Fixes the wall-of-text rendering of judge
reasoning.
- Tailwind 4 ``@source`` directive added to ``globals.css`` so it
scans ``samples/copilotkit/components/**/*.{ts,tsx}``. Without this,
classes used inside the SDK card library (``bg-amber-500``,
``bg-green-600``, etc.) get tree-shaken out of the generated CSS
and pills silently flatten to plain text.
- ``TraceCardProps.status`` made optional. The LayerLens
``traces.get_many`` API doesn't expose per-trace lifecycle today, so
the sample no longer hardcodes ``status="ok"`` — that was rendering
a misleading green pill on every trace regardless of reality. The
status pill is hidden when the prop is omitted; restore it once the
API surfaces real status.
When the agent kicks off N evaluations and only K complete on the
first poll, the remaining (N - K) used to disappear from the
``Verdicts`` grid even though the run-summary card still counted
them — verdict count would say "5", grid would show 4, and the
trailing pending one looked like it had been lost.
Add ``PendingVerdictCard``: same shadcn ``Card`` chrome as
``JudgeVerdictCard``, with a "Running" pill, a pulsing skeleton bar
for the score, and copy explaining real LayerLens evaluations can take
a minute or two. Render one per evaluation that doesn't have a
matching entry in ``state.results`` yet.
Side effects:
- ``Verdicts`` section count now reflects total evaluations (not just
completed) so the grid count matches what's actually rendered.
- Section now renders even when ``results.length === 0`` as long as
there are evaluations in flight (previously fell through to a
textual placeholder).
- Run summary picks the judge name from the first pending evaluation
if no result has come back yet.
The polling loop is unchanged — it keeps polling
``/evaluations/{id}`` every 5 s and replaces a pending card with the
real ``JudgeVerdictCard`` the moment LayerLens returns a verdict.
The judge ``evaluation_goal`` field LayerLens returns is markdown-
formatted (paragraph breaks, ``**bold**`` headers, numbered lists).
Both the in-chat picker and the canvas's "Available judges" card
were rendering it through plain ``<p>{text}</p>`` so each judge
collapsed into one indented wall of text — same problem the verdict
card's reasoning had before.
Pull the inline markdown renderer that previously lived inside
``JudgeVerdictCard.tsx`` into its own ``markdown-lite.tsx`` module,
re-export it from the SDK card library's ``index.ts``, and use it in:
- JudgeVerdictCard reasoning (already)
- JudgePicker goal description (chat-side)
- JudgesCard goal description (canvas-side)
Output is the same as before for the verdict card; the picker and the
canvas judges card now show structured goal text. Still no
``react-markdown`` dependency — the SDK card library has to stay
resolvable without the Next.js app's node_modules in scope, so we
keep the small built-in renderer instead.
The README still described the previous incarnation of the sample —
the create_agent + frontend HITL design from before the canvas /
out-of-band-polling rewrite. Rewrite top-to-bottom to reflect what
actually ships:
- New layout section showing ``samples/copilotkit/{agents,components,app}``
with the SDK card library and the customer-facing app side-by-side.
- Architecture diagram updated for the canvas + chat split-pane,
``useCoAgent`` driving state-driven cards, and the
``GET /evaluations/{id}`` polling endpoint that the frontend hits
every 5s for in-flight verdicts.
- Step-by-step "How the demo flows" walkthrough so a customer can
read the README and predict what each click will do.
- "Why this pattern" updated to highlight the canvas + frontend
polling + ``copilotkit_emit_state`` triad. Old text framed the
choice as ``create_agent`` vs ``interrupt()``; new text frames it
as the research-canvas pattern.
- Tools section updated for the async + ``Command(update={...})``
return shape and the no-arg ``confirm_judge`` (frontend reads
candidates from ``state.judges``).
- Frontend section adds: shadcn/ui foundation, ``components.json``,
light-default theme + ``ThemeToggle``, ``--copilot-kit-*`` token
bridge, brand accent ``#6766FC``, the SDK card matrix
(5 cards + ``MarkdownLite``).
- Backend section adds: ``recursion_limit: 200`` config, the
``GET /evaluations/{id}`` polling handler, and the "no fake
fixture" guardrail.
Drive-by: ``ruff format`` brought ``evaluator_agent.py`` and
``server.py`` in line with the project's ruff style. (The repo's
``[tool.ruff]`` ``exclude = ["samples"]`` would skip these on
discovery, but reformatting locally keeps them tidy and avoids
contributors re-doing it.)
Fixes both red CI checks on PR #92: - ``Check Lint`` was failing because tests/test_samples_e2e.py used the walrus operator (``:=``) at line 1446 and ruff's ``[tool.ruff].target-version`` is pinned to ``py37``. Replace with a regular assignment + boolean check — same semantics, py37 compatible. The package's runtime support (``Python >=3.10,<3.13``) doesn't dictate ruff's syntax target; bumping the ruff target is out of scope for this PR. - ``Check Format`` was failing because the same file had pre-existing multi-line wrapping that ruff's auto-format collapses to single lines under the 120-char limit. Apply ``ruff format``. - ``ruff check --fix`` also normalised one import block (I001). CI's ``test (3.9..3.12)`` jobs cancelled out after the lint pre-step failed — they should now actually run.
Per existing repo policy: the SDK sample and tests should not name a specific OpenAI-compatible provider. Configuring OpenRouter (or any other gateway) is the user's job in their own .env — the docs and test code stay vendor-neutral. Removes: - OpenRouter row from ``_default_model``'s docstring table. - OpenRouter mention in ``build_graph``'s docstring. - ``OpenRouter, vLLM`` aside in the CLI ``main()`` print block. - OpenRouter URL in ``samples/copilotkit/README.md`` env-var example. Replaced with a placeholder ``your-openai-compatible-host``. - ``OPENROUTER_API_KEY`` auto-mapping in ``test_copilotkit_evaluator _live_llm`` (the test now expects ``OPENAI_API_KEY`` and lets the user set ``OPENAI_BASE_URL`` / ``OPENAI_MODEL`` themselves if pointing at a non-OpenAI endpoint). - Skip-message reference to ``OPENROUTER_API_KEY``. The sample still works against any OpenAI-compatible endpoint — the generic env vars (``OPENAI_API_KEY`` / ``OPENAI_BASE_URL`` / ``OPENAI_MODEL``) carry the configuration. The user's own gitignored ``.env`` is where provider-specific URLs (OpenRouter, Ollama, LM Studio, …) live.
Three test failures from the previous CI run, all addressed here:
1. ``tests/test_samples.py::test_sample_has_main[copilotkit/app/backend
/server.py]`` expects every sample's entry-point file to expose a
``main()`` function. ``server.py`` had a bare ``if __name__ ==
"__main__":`` block instead. Lift the uvicorn.run call into a
``main()`` and call it from the ``if __name__`` guard.
2. ``test_copilotkit_agent_import[evaluator_agent]`` and
3. ``test_copilotkit_without_langchain[evaluator_agent]`` both stub
the heavy deps via ``patch.dict("sys.modules", ...)`` so the agent
module imports cleanly without langchain / copilotkit installed.
The mock dict was missing the new submodules the agent now imports
(``langgraph.prebuilt``, ``langchain.agents.middleware``,
``langchain_core.runnables``, ``langchain_core.tools.base``).
Add them to both mock dicts.
Locally ``ruff check`` and ``ruff format --check`` are clean on all
touched files.
…success
Bug repro: same evaluation reliably stayed "Running" across multiple
demo runs. Root cause was the polling filter on the frontend:
const completed = updates.filter(
(u) => u.status === "success" && typeof u.score === "number",
);
This rejected any LayerLens response that wasn't a clean success with
a numeric score — including ``status: "failure"``, ``status: "error"``,
``status: "cancelled"``, and the ``status: "success"`` case where
``trace_evaluations.get_results`` returned ``score: null`` (which
some judges legitimately do). The poller would then keep firing every
5s forever and the verdict card would sit in "Running" indefinitely.
Two-sided fix:
Backend (``GET /evaluations/{id}``):
- New ``done: bool`` field — true for any of
``success | failure | error | cancelled | not_found``, false while
the evaluation is still ``in_progress`` / ``pending`` / ``queued``.
- Always include ``passed`` / ``score`` / ``reasoning`` once
``done: true``, even for terminal failures and ``success``-without-
score: defaults are ``passed: false``, ``score: 0.0``, and a
``reasoning`` string explaining the terminal state.
- ``try/except`` around ``trace_evaluations.get`` so a malformed /
unauthorized id surfaces as ``status: "error", done: true`` instead
of a 500 that the frontend retries forever.
Frontend (``page.tsx``):
- Polling filter is now ``u.done === true`` instead of
``status === "success" && typeof score === "number"``.
- ``ResultRecord`` type gains an optional ``done?: boolean`` field
(the agent's own ``state.results`` entries don't carry it; only the
``/evaluations/{id}`` polling responses do).
Verified against a real eval id (clean success path → ``done: true``,
score returned) and a deadbeef id (error path → ``done: true``,
``status: "error"``, no 500). The 5th-eval-stuck symptom is from the
non-success terminal cases — frontend now folds them into the canvas
as a verdict card with the appropriate fail/error styling instead of
spinning forever.
Adds the assistant resource handler so SDK users can drive the Stratix Assistant programmatically. Mirrors the REST surface from atlas-app's DOCS/api/assistant-openapi.yaml and the SSE event channel from DOCS/api/assistant-asyncapi.yaml. Surface (sync + async parity): - list_conversations() → AssistantConversationList - create_conversation(title=None) → AssistantConversation - get_conversation(id) → AssistantConversation - rename_conversation(id, title=...) → AssistantConversation - delete_conversation(id) → None - list_messages(conv_id, limit=None) → AssistantMessageList - chat(conv_id, content) → Iterator[AssistantStreamEvent] The chat() iterator parses the SSE stream and yields one event per block. Six event types are recognized (token, tool_call, tool_result, done, moderation_refused, error). Unknown event types are silently skipped so a forward-compat addition on the server doesn't crash SDK clients. The iterator stops on any terminal event (done, moderation_refused, error). Models (mirrors server-side Pydantic shape): - AssistantConversation, AssistantMessage, AssistantToolCall - AssistantConversationList, AssistantMessageList - AssistantStreamEvent (with .is_terminal() and .text() helpers) - AssistantTokenUsage Access control (server-side, surfaced to SDK callers as exceptions): - 403 PermissionDeniedError when the org's tier does not have AssistantSDKEnabled = true. Default-deny — contact LayerLens to request enablement. - 429 RateLimitError when the per-org daily token cap is exhausted (or 0, which is the default for every plan). Headers X-Token-Budget-Used / X-Token-Budget-Cap reported on success. - 503 when Redis (rate-limit + budget backend) is unreachable — fail-closed posture, no in-memory fallback. Tests: - 9 SSE-block parser tests (token, done, moderation_refused, error, unknown event forward-compat, malformed JSON, missing event/data, text() accessor for non-text events). - 8 resource-method tests (list/create/get/rename/delete + envelope unwrapping + edge cases). - 2 streaming tests (real SSE flow with token+done sequence; 403 raises HTTPStatusError). Mypy strict clean across the new files.
Closes the cross-cutting CLAUDE.md multi-tenancy gap surfaced by A:/tmp/adapter-depth-audit.md (finding #3): all 203 adapter emissions in the stratix-python SDK shipped without org_id propagation, violating the platform-wide 'EVERY data operation must be scoped by tenant' mandate. What changed: * BaseAdapter.__init__ now requires a resolvable org_id (explicit kwarg or stratix.org_id / stratix.organization_id). Construction without a non-empty value raises ValueError — fail-fast, no silent fallback. * BaseAdapter.emit_event and emit_dict_event stamp the bound org_id into every payload before forwarding to the client. Any caller-supplied value is overwritten with the adapter's own tenant binding (defensive overwrite — prevents cross-tenant leaks via misuse). * The replay trace record and EventSink dispatch path both carry org_id at the envelope level. EventSink.send signature now requires org_id as a keyword-only arg. * All 17 framework adapters, 1 protocol adapter base, and 9 LLM provider adapters thread org_id through their __init__ to super().__init__. instrument_* helper functions in each adapter package accept and forward the kwarg. * IngestionPipelineSink uses per-event org_id as the tenant_id for downstream ingest. Tests: * tests/instrument/adapters/_base/test_org_id_propagation.py — 17 tests covering construction-time fail-fast (5), per-emission propagation (5), cross-tenant isolation (2), and public surface stability (2). Includes the cross-tenant test the audit recommended (org A's adapter never tags events with org B). * tests/instrument/adapters/frameworks/test_per_adapter_org_id.py — 37 parametrized tests (one accept + one fail-fast pair per adapter, plus an audit-cardinality guard) covering all 17 framework adapters. * All existing _RecordingStratix test stand-ins updated with an org_id class attribute. instrument_* helper-call sites in tests pass org_id="test-org". Acceptance: 17/17 base tests + 37/37 per-adapter tests + 258 existing framework adapter tests pass; mypy --strict on _base clean; ruff clean. Docs: docs/adapters/multi-tenancy.md documents the contract for future adapters.
This was referenced Apr 27, 2026
Merged
4 tasks
m-peko
pushed a commit
that referenced
this pull request
May 12, 2026
…laceholder from #116) (#126) Replaces the M7 placeholder shipped in PR #116 (truncation policy) with the full BrowserUseAdapter — every lifecycle hook wired, every event emitted, and every cross-cutting CLAUDE.md contract enforced from day one. What changed ------------ Full lifecycle adapter (src/layerlens/instrument/adapters/frameworks/ browser_use/lifecycle.py): * connect / disconnect / health_check / get_adapter_info / serialize_for_replay (all five abstract BaseAdapter methods). * on_session_start, on_session_end, on_navigation, on_action, on_screenshot, on_dom_extraction, on_llm_call (every spec'd hook). * Capability declaration: TRACE_TOOLS + TRACE_MODELS + TRACE_STATE + STREAMING + REPLAY (no longer the placeholder's TRACE_TOOLS-only set). * Canonical events: browser.session.start, browser.navigate, browser.action, browser.screenshot, browser.dom.extract, tool.call, model.invoke, agent.input/output/state.change, cost.record, environment.config — plus agent.error / tool.error / model.error per the PR #115 error-aware emission contract. * Per-callback resilience wrapper per PR #117 — observability errors NEVER crash the customer's agent, surfaced via resilience_snapshot(). * Multi-tenant org_id propagation per PR #118 — bound at construction (kwarg or resolved from stratix.org_id), stamped defensively on every emit, caller-supplied values overwritten to prevent cross-tenant leaks. * Truncation policy from day one (DEFAULT_POLICY) — screenshot bytes DROPPED to deterministic SHA-256 references, DOM/HTML capped at 16 KiB, prompts/completions/tool I/O at 4/2 KiB. * Browser-event layer mapping (_BROWSER_EVENT_LAYERS) so unknown browser.* event types respect CaptureConfig gating without falling through the unknown-event-drops-by-default path. * requires_pydantic = PydanticCompat.V2_ONLY (browser_use is a v2 lib). Public surface (src/layerlens/instrument/adapters/frameworks/ browser_use/__init__.py): * ADAPTER_CLASS = BrowserUseAdapter (registry). * instrument_agent(agent, stratix=, capture_config=, org_id=) one-liner returning the connected, wrapping adapter. * STRATIXBrowserUseAdapter top-level binding (legacy alias) — fires DeprecationWarning on construction. Exposed as a static binding so the manifest consistency lint's AST walk finds it. Pyproject: * Adds 'browser-use' optional extra: browser-use>=0.1.0,<2 with the python_version >= '3.11' marker (browser_use's own constraint). Tests (tests/instrument/adapters/frameworks/test_browser_use_adapter.py): * Replaces the 7-test scaffold from #116 with 40 tests covering: wiring + alias + lifecycle round-trip + truncation (screenshot drop, hash determinism, HTML cap, short-payload no-audit) + multi-tenancy (kwarg, client attribute, defensive overwrite) + resilience (poison stratix, exploding agent attribute access) + error-aware emission (agent.error / tool.error / model.error) + per-hook coverage + sync + async wrapping + replay round-trip + 10-case provider detection table. Sample (samples/instrument/browser_use/{main.py,__init__.py,README.md}): * Runs OFFLINE — no browser-use install, no Playwright, no API key, no network. Three-step duck-typed agent + happy/--fail paths exercise the full event surface and demonstrate screenshot drop + org_id stamping + agent.error emission before re-raise. Doc (docs/adapters/frameworks-browser_use.md): * Install + quickstart + capabilities matrix + 14-event reference table + truncation policy table + multi-tenancy + resilience + error-aware emission + capture config + browser_use specifics + BYOK + replay sections. Manifest (scripts/emit_adapter_manifest.py): * Promotes browser_use from _LIFECYCLE_PREVIEW to _MATURE — every required artifact (test file with >= 12 funcs, sample, doc, STRATIX→LayerLens deprecation alias) ships in this PR. Verification ------------ * uv run pytest tests/instrument/adapters/frameworks/test_browser_use_adapter.py → 40 passed * mypy --strict src/layerlens/instrument/adapters/frameworks/browser_use → Success: no issues found in 2 source files * ruff check on src + test + script → All checks passed! * Sample runs cleanly offline (happy + --fail) * pip install -e .[browser-use] resolves cleanly (browser-use only pulled on Python 3.11+ per the env marker) * tests/instrument/adapters/test_manifest_consistency.py:: test_mature_adapters_have_required_artifacts[browser_use] passes * Full instrument suite (excl. pre-existing crewai/protocols references not on this branch): 312 passed, 1 skipped, 12 xfailed
…or 6 lighter adapters (cross-poll #1) (#130) Implements cross-pollination item #1 from A:/tmp/adapter-cross-pollination-audit.md section 2 #1. The four mature framework adapters (LangChain, AutoGen, CrewAI, Semantic Kernel) carry ad-hoc memory plumbing — episodic recent turns, procedural learned patterns, semantic long-lived facts — that lets agents recall context across runs. The lighter adapters (agno, ms_agent_framework, openai_agents, llama_index, google_adk, bedrock_agents, browser_use) all behave as goldfish agents — every run starts from a blank slate. This PR ports the pattern into a shared, replay-safe primitive that the lighter adapters plug into uniformly. ## What is new ### Shared memory primitive src/layerlens/instrument/adapters/_base/memory.py — new - MemorySnapshot — frozen dataclass with turn_index, episodic (recent turns), procedural (detected patterns), semantic (key/value facts), content_hash (SHA-256 of canonical-JSON encoding), org_id (tenant binding). to_dict / from_dict round-trip preserves identity. - MemoryRecorder — thread-safe accumulator. record_turn(...) is the per-turn entry point; set_semantic(key, value) for long-lived facts; snapshot() returns the immutable view; restore(snap) rebuilds state from a previous snapshot. All buckets bounded (defaults 200/16/64); episodic FIFO eviction, semantic LRU, procedural keep-top-by-count. - Procedural pattern detection: O(window) per turn, scans the recent episodic window for recurring (prev_tools, current_tools) pairs. - Multi-tenant: recorder requires non-empty org_id at construction; restore() rejects cross-tenant snapshots and tampered snapshots (content-hash mismatch). - Replay-safe: snapshot -> restore -> snapshot round-trip produces byte-identical content_hash. ### BaseAdapter integration src/layerlens/instrument/adapters/_base/adapter.py - Constructor builds self._memory_recorder = MemoryRecorder(org_id=self._org_id). - New record_memory_turn(...) helper — best-effort wrapper that swallows recorder failures so memory persistence never breaks the host framework call stack (CLAUDE.md "tracing never breaks user code"). - memory_recorder property, memory_snapshot() and memory_snapshot_dict() convenience accessors. ### Per-adapter wiring (6 adapters) - agno: Agent.run/arun finally-block; episodic input from args/kwargs; tool list from _collect_tool_names(result.messages). - ms_agent_framework: Chat.invoke/invoke_stream finally-block; episodic input from kwargs; tool list from streamed message items. - openai_agents: _on_agent_span_end (TraceProcessor) + on_run_end (Runner wrap); episodic input cached at span_start per span_id; tool list rolled up from _on_function_span_end per parent_id. - llama_index: _on_agent_step_end; episodic input cached at step_start per thread id; tool list rolled up from _on_tool_call. - google_adk: after_agent_callback + on_agent_end; episodic input cached at before_agent_callback per thread id; tool list rolled up from after_tool_callback per thread id. - bedrock_agents: _after_invoke_agent (boto3 hook); episodic input cached at _before_invoke_agent per thread id; tool list rolled up from _process_trace action-group / KB step names. Each adapter serialize_for_replay() now embeds the snapshot under ReplayableTrace.metadata["memory_snapshot"] so replay engines can reconstruct memory state via MemorySnapshot.from_dict(...) -> recorder.restore(snapshot) before re-execution. ## Tests (57 new) ### tests/instrument/adapters/_base/test_memory.py — 27 tests Recorder construction (empty/non-string org_id rejected; zero buffer sizes rejected; initial state empty). Snapshot determinism (identical content -> identical hash; different org_id -> different hash; mutating recorder doesnt affect prior snapshot; to_dict/from_dict round-trip preserves hash; from_dict rejects missing required fields). Replay round-trip (snapshot -> restore -> snapshot byte-identical hash; deterministic next-state under matching inputs; cross-tenant restore raises; tampered-content-hash restore raises). Bounded eviction (episodic FIFO at cap; semantic LRU at cap; semantic overwrite refreshes LRU; procedural cap). Procedural detection (repeated tool sequences accumulate count; no-tool turns produce no patterns). Per-turn truncation (multi-megabyte values capped with deterministic suffix). Thread safety (8 threads x 50 turns produces unbroken 1..400 sequence). Clear preserves binding; defaults positive; extra metadata sorted for hash determinism. ### tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py — 30 tests (5 x 6 adapters parametrized) - Each adapter exposes a recorder bound to its org_id. - record_memory_turn advances the episodic buffer. - serialize_for_replay() embeds metadata["memory_snapshot"]. - Replay engine can restore the recorder from the serialised trace (content-hash match end-to-end). - Cross-tenant snapshot is rejected at the per-adapter recorder boundary. ## Documentation docs/adapters/memory-contract.md — explains the three buckets, the contract (tenant binding, bounded buffers, tamper-evident snapshots, replay-safe round-trip, best-effort recording, thread safety), per-adapter wiring matrix, and audit hooks. Includes the replay-engine integration recipe and the honest scope disclosure for browser_use. ## Honest scope disclosure The cross-pollination audit section 2 #1 enumerates seven target adapters. Six are wired here. The seventh — browser_use — does NOT exist on this PR base branch (feat/instrument-multitenancy-org-id-propagation); it lives on the parallel feat/instrument-frameworks-browser-use-full history. It will be wired when that adapter is ported to this base or when the histories merge. This follows the same honest-disclosure pattern as PR #120 (state filters, which omitted ms_agent_framework for the same reason). The future browser_use wiring (per audit section 2 #1) will be: - Episodic: page navigation events (URL, action, selector) - Procedural: recurring (prev_action, current_action) patterns - Semantic: long-lived page-content cache keyed by URL/DOM hash ## Acceptance uv run pytest tests/instrument/adapters/_base/test_memory.py -x -> 27 passed uv run pytest tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py -x -> 30 passed uv run pytest tests/instrument/adapters/_base/ -> 44 passed (no regressions) uv run pytest tests/instrument/adapters/frameworks/{agno,bedrock_agents,google_adk,llama_index,ms_agent_framework,openai_agents}_adapter.py -> 72 passed (no regressions) uv run mypy --strict src/layerlens/instrument/adapters/_base/memory.py -> Success: no issues found in 1 source file uv run mypy src/layerlens/instrument/adapters/_base/adapter.py src/layerlens/instrument/adapters/frameworks/{6 adapters}/lifecycle.py -> Success: no issues found in 7 source files uv run ruff check src/layerlens/instrument/adapters/_base/memory.py tests/instrument/adapters/_base/test_memory.py tests/instrument/adapters/frameworks/test_memory_persistence_wiring.py -> All checks passed!
m-peko
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the cross-cutting CLAUDE.md multi-tenancy gap surfaced by
A:/tmp/adapter-depth-audit.md(cross-cutting finding #3): all 203 adapter emissions in the stratix-python SDK shipped withoutorg_idpropagation, violating the platform-wide "EVERY data operation must be scoped by tenant" mandate.What changed
Foundation (
_base/)BaseAdapter.__init__now requires a resolvableorg_id(explicit kwarg orstratix.org_id/stratix.organization_id). Construction without a non-empty value raisesValueError— fail-fast, no silent fallback.BaseAdapter.emit_eventandemit_dict_eventstamp the boundorg_idinto every payload via_stamp_org_idbefore forwarding to the client. Caller-supplied values are overwritten with the adapter's own tenant binding (defensive overwrite — prevents cross-tenant leaks via misuse).EventSinkdispatch path both carryorg_idat the envelope level.EventSink.sendsignature now requiresorg_idas a keyword-only arg.IngestionPipelineSinkuses per-eventorg_idas thetenant_idfor downstream ingest.ORG_ID_FIELD = "org_id"exported from_base.Per-adapter wiring (17 framework + 1 protocol base + 9 LLM providers)
Every adapter
__init__now accepts a keyword-onlyorg_id: str | None = Noneand forwards it tosuper().__init__. The 12instrument_*helper functions in each framework adapter package were updated to accept and forward the kwarg.BaseProtocolAdapter(concrete protocol adapters use**kwargs)LLMProviderAdapterbase)Tests
tests/instrument/adapters/_base/test_org_id_propagation.py— 17 new tests covering construction-time fail-fast (5), per-emission propagation (5), cross-tenant isolation (2 — the explicit cross-tenant scenario the audit recommended), and public surface stability (2).tests/instrument/adapters/frameworks/test_per_adapter_org_id.py— 37 parametrized tests (one accept + one fail-fast pair per adapter, plus an audit-cardinality guard) covering all 17 framework adapters._RecordingStratixtest stand-ins updated with anorg_id = "test-org"class attribute.instrument_*helper-call sites in tests passorg_id="test-org".test_bulk_ported_smoke.pyupdated to passorg_idat the smoke construction sites.Documentation
docs/adapters/multi-tenancy.md— full contract for future adapters (resolution order, fail-fast semantics, defensive overwrite rationale, wiring template, test obligations).BaseAdapterclass docstring updated to call out the multi-tenant binding.Acceptance (per task spec)
uv run pytest tests/instrument/adapters/_base/test_org_id_propagation.py -x— 17/17 passeduv run pytest tests/instrument/adapters/frameworks/ -x(excluding 5 pre-existing collection failures unrelated to this PR:test_agentforce.py,test_langchain.py,test_langfuse.py,test_langgraph.py,test_haystack.py— all fail to import without optional deps that are not installed in this venv) — 258/258 in scope passed, 13 skippeduv run mypy --strict src/layerlens/instrument/adapters/_base— cleanuv run ruff check src/layerlens/instrument/adapters/_base/ src/layerlens/instrument/adapters/frameworks/ tests/instrument/adapters/_base/ tests/instrument/adapters/frameworks/test_per_adapter_org_id.py— cleanCLAUDE.md compliance
org_id— enforced at the central_stamp_org_idchoke pointorg_idat init —BaseAdapter.__init__raisesValueError, never silently skipsReferences
A:/tmp/adapter-depth-audit.md— audit (2026-04-25), cross-cutting finding docs | LAY-881 Fix wrong yaml structure #3docs/adapters/multi-tenancy.md— new contract doc shipped in this PR