feat: save agent traces to disk in OTLP-JSON for benchmark evaluation#272
Merged
DhavalRepo18 merged 28 commits intomainfrom Apr 27, 2026
Merged
feat: save agent traces to disk in OTLP-JSON for benchmark evaluation#272DhavalRepo18 merged 28 commits intomainfrom
DhavalRepo18 merged 28 commits intomainfrom
Conversation
Adds opt-in OpenTelemetry tracing so each agent run emits a single correlated trace spanning the agent graph, every LLM call, and the LiteLLM proxy. - src/observability/ package exposes init_tracing() + agent_run_span() helpers; no-op when OTEL env not configured. - All four runners (plan-execute, claude-agent, openai-agent, deep-agent) wrap run() in a root span with gen_ai.* semconv attributes and trajectory-derived usage totals. - httpx auto-instrumentation propagates traceparent to the LiteLLM proxy so its spans nest under the agent trace. - Tests use InMemorySpanExporter; no collector required. Closes #270 Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The three SDK runners (claude-agent, openai-agent, deep-agent) each
shipped byte-identical Trajectory/TurnRecord/ToolCall dataclasses and
their own _resolve_model / _LITELLM_PREFIX helpers. Consolidates into:
- src/agent/models.py: canonical ToolCall, TurnRecord, Trajectory
alongside the existing AgentResult.
- src/agent/_litellm.py: shared LITELLM_PREFIX + resolve_model().
- Removed src/agent/{claude,openai,deep}_agent/models.py.
- Collapsed six duplicated per-runner _resolve_model tests into one
parametrized suite at src/agent/tests/test_litellm.py.
Net -110 lines, no behaviour change.
Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
4 tasks
- src/agent/_prompts.py: single AGENT_SYSTEM_PROMPT used by the three SDK runners (claude-agent, openai-agent, deep-agent). plan_execute keeps its own planning/summarisation prompts. - src/agent/_cli_common.py: setup_logging, add_common_args, print_trajectory, print_answer, print_result. The three SDK CLIs now only encode their prog name, default model, epilog text, and runner-specific arg (--max-turns vs --recursion-limit). - Extract _WATSONX_PREFIX constant in LiteLLMBackend. Net -110 lines; each CLI shrinks from ~140 LoC to ~60 LoC. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
refactor: share system prompt and CLI boilerplate
- src/agent/_cli_common.py: new run_sdk_cli(service_name, build_parser, run_coro) that bundles dotenv → parse → logging → init_tracing → asyncio.run. The three SDK main() bodies shrink from 9 lines each to one. - DeepAgentRunner._chat_model is now a cached_property so _build_chat_model runs once per runner instance instead of once per run(). Matches the ClaudeAgentRunner / OpenAIAgentRunner pattern of pre-building per-instance config, with lazy init so constructor tests don't need env set. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
refactor: deduplicate models, LiteLLM helpers, prompts, and CLI across agent runners
- AgentRunner.__init__ now resolves server_paths into self._server_paths (always a concrete dict). DEFAULT_SERVER_PATHS moves from plan_execute/executor.py to agent/runner.py so the base runner owns its own default; the three SDK runners drop their duplicated _resolved_server_paths attribute and the cross-package import. - Replace openai_agent._managed_servers (custom 20-line async context manager) with stdlib contextlib.AsyncExitStack. Enters each MCPServerStdio once and closes them in LIFO order on success or exception. Three test sites that mocked the removed class now run against the real stack with an empty server list. Net -30 lines, no behavior change. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The observability layer was originally shaped around live tracing (ship spans to a collector, inspect in Jaeger). That's the wrong primary story for a benchmark — what we actually need is per-run records saved to disk in a standard format for later analysis and replay. This commit reorients without ripping anything out: - agent_run_span() now accepts run_id / scenario_id kwargs and additionally reads from ambient contextvars set via new set_run_context(). The CLI layer seeds the contextvars, so runners need no signature changes. - All SDK CLIs and plan-execute gain --run-id (auto-UUID4 when omitted) and --scenario-id, both recorded as root-span attributes (agent.run_id, agent.scenario_id). - otel-collector.yaml: drop-in Collector config that persists spans to ./traces/traces.jsonl in OTLP-JSON (the canonical, replayable format every OTEL backend can ingest). - docs/observability.md: full workflow for save-then-replay, live Jaeger as a secondary option, jq recipes, and troubleshooting. Net +149 LoC across Python + 2 new files (YAML + docs). Tests cover the new kwargs, contextvars, and precedence (kwarg > context). Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Drop-in Collector config for trace persistence to disk (OTLP-JSON via the file exporter), plus full user-facing docs for enable/persist/replay/ troubleshoot flows. These files were intended to land with the previous commit but were missed by "git add -u" since they were untracked. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Review feedback (see PR #272 discussion) surfaced several over-engineered helpers and one real bug. Bug fix - tracing.py: BatchSpanProcessor buffers spans and the final agent run's root span could be dropped on CLI exit. Register atexit.register(provider.shutdown) so the exporter flushes. Drops with no behavior change - Custom _NoopTracer / _NoopSpan shims + try/except ImportError around the OTEL import. opentelemetry.trace.get_tracer() already returns a ProxyTracer with NonRecordingSpan no-ops when no provider is installed, and opentelemetry-api is a hard transitive dep. - _safe_getattr_int / _set_error_status helpers (each used once, now inline). - annotate_result helper: inlined into all four runners; each runner already knows its Trajectory shape. - agent_run_span run_id / scenario_id kwargs: no production caller uses them; contextvar-sourced values are the only real path. - _reset_for_tests public symbol: tests monkeypatch _initialized directly. - src/observability/attributes.py module: every constant was used once inside runspan.py; inlined the six literal strings. - Alias table in _system_from_model: speculative aws/gcp/bedrock/ vertex_ai → anthropic mappings that were never emitted. Kept the aws alias that the repo actually uses. Reuse / consistency - plan-execute CLI now uses _cli_common (setup_logging, add_common_args, HR, run_sdk_cli). Drops its private _setup_logging, LOG_FORMAT, LOG_DATE_FORMAT, and duplicate main() body. Perf - ClaudeAgentRunner caches mcp_servers dict in __init__ (matches the deep-agent cached_property pattern). openai-agent intentionally skipped: its MCPServerStdio instances are entered/exited per run and can't be safely cached. Net -347 lines, 255 tests pass. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Observability no longer requires running a Docker Collector. Set OTEL_TRACES_FILE=./traces/traces.jsonl and each span batch is appended to disk in canonical OTLP-JSON — the same format the OpenTelemetry Collector's file exporter produces, so saved traces remain replayable into any OTLP backend (Jaeger, Tempo, Honeycomb, …) via the Collector's otlpjsonfile receiver when network access is available. Changes - src/observability/file_exporter.py: OTLPJsonFileExporter — thread-safe append writer backed by the OTLP common trace encoder. - tracing.py: init_tracing now wires the file exporter when OTEL_TRACES_FILE is set. HTTP and file exporters are independent; set either or both. - Drop otel-collector.yaml — the Collector is now optional, not the primary story. - docs/observability.md rewritten around the file-exporter workflow; live Jaeger demoted to a side note. - Tests cover encoding, directory creation, append semantics, and the file-only enablement path. 259 tests pass. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Clean separation between the two observability artifacts so each has a
single responsibility and neither duplicates information held by the
other.
* **Span** (traces.jsonl) carries *metadata*: runner, model, run_id,
scenario_id, question/answer lengths, timing. The previously
duplicated gen_ai.usage.input_tokens / gen_ai.usage.output_tokens /
agent.turns / agent.tool_calls attributes are removed — those values
are derived from the trajectory and adding them to the span was
redundant work + divergence risk.
* **Trajectory** (AGENT_TRAJECTORY_DIR/{run_id}.json) carries *content*:
per-turn text, tool call inputs/outputs, per-turn token usage. New
persistence module writes one file per run when the env var is set.
Joins to the trace by run_id.
Changes
- src/observability/persistence.py: persist_trajectory() — reads
run_id / scenario_id from the same contextvars used by agent_run_span,
so no public signature change on runners. Handles both SDK runners'
Trajectory dataclass and plan-execute's list[StepResult].
- All four runners call persist_trajectory() after building AgentResult
and drop the four derived span attributes.
- docs/observability.md rewritten around the two-artifact model; jq
examples split into metadata-via-trace and content-via-trajectory.
- .gitignore: traces/
Tests cover: disabled-without-env no-op, happy path, list trajectory
(plan-execute shape), missing run_id warn-and-skip, nested dir creation.
264 tests pass.
Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Reversing the "pure metadata on span" framing from the prior commit. Token totals and turn/tool_call counts are aggregate metrics, not content, and belong on the span per OTEL GenAI semantic conventions. Removing them made saved traces opaque to Jaeger, Tempo, Langfuse, Grafana Cloud AI, Honeycomb, and every other OTEL-aware backend that expects gen_ai.usage.* to display cost / tokens. The "avoid duplication" argument was weak — both span attribute and trajectory property are derived from the same Trajectory object in the same function call; they cannot diverge. Better separation: - Span: metadata + aggregates (runner, model, IDs, latency, token totals, turn and tool_call counts). - Trajectory: per-turn content only (text, tool inputs / outputs, per-turn tokens). No overlap: totals appear once on the span; per-turn numbers appear once in the trajectory file. docs/observability.md updated accordingly; jq example for token totals now reads the trace alone instead of iterating trajectory files. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Aggregates on root span (queryable by any OTEL UI); per-unit durations on trajectory fields (nullable where the runner's SDK doesn't expose clean callbacks). Same invariant as tokens: no double-accounting. Span attributes - agent.duration_ms — all runners, wall-clock of run(). - agent.tool_time_ms — claude-agent only (via PreToolUse/PostToolUse hooks). - agent.llm_time_ms — plan-execute only (planning + summarisation). - agent.planning_time_ms, agent.summarization_time_ms — plan-execute only. Trajectory fields - Trajectory.started_at — ISO-8601 UTC timestamp (SDK runners). - TurnRecord.duration_ms — wall-clock per turn (claude-agent only for now). - ToolCall.duration_ms — wall-clock per tool (claude-agent only for now). - StepResult.duration_ms — wall-clock per plan step (plan-execute). Deferred: per-turn / per-tool timing for openai-agent and deep-agent — their SDKs don't expose clean callback surfaces at that granularity. Start with agent.duration_ms on the span; add finer hooks later when needed. Tests - 5 new tests verifying nullable defaults on the four duration fields plus that plan-execute's executor always populates StepResult.duration_ms. 269 tests pass. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Adding a PreToolUse hook alongside PostToolUse for per-tool timing broke subprocess launch on user-reported @anthropic-ai/claude-code CLI versions (subprocess exits with code 1 during config parse). The Python claude_agent_sdk types PreToolUse as valid, but the installed CLI binary is a separate artifact shipped via npm and need not agree. Revert: only PostToolUse is registered, matching pre-#272 behavior. Per-tool duration_ms and the derived agent.tool_time_ms span attribute are no longer captured for claude-agent — matches openai-agent and deep-agent which never had per-tool timing. Turn-level timing (TurnRecord.duration_ms, measured from AssistantMessage arrival times) and run-level timing (agent.duration_ms) still work — they're computed in-process from observable events without adding subprocess hook registrations. docs/observability.md: removed agent.tool_time_ms from the span attributes table, updated the trajectory coverage matrix, and noted the compatibility constraint in the SDK-specific section. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
…ctory shape
- Token counts and turn/tool-call totals are set by SDK runners only
(claude-agent, openai-agent, deep-agent), not plan-execute.
- plan-execute persists trajectory as a flat list of StepResult, not the
{started_at, turns} object; document both shapes explicitly.
Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
LLMBackend.generate() returns only text; tokens are absent from both span and trajectory for plan-execute — not recoverable from either artifact. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
LiteLLM's completion response already carries prompt/completion token
counts; LiteLLMBackend was discarding them. Surface them via a new
LLMBackend.generate_with_usage() returning LLMResult, and wrap the
runner's backend in a _TokenMeter that accumulates across planner,
per-step arg-resolution, and summarise calls. Totals land on
gen_ai.usage.{input,output}_tokens alongside the SDK runners.
Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Drop _TokenMeter / LLMResult / generate_with_usage implementation detail from the user-facing doc — describe only the observable contract (sum across plan / arg-resolution / summarise calls). Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
445ddbd to
0a2c551
Compare
…bility.md Brief pointer (two-artifact model + env vars + --run-id/--scenario-id) with a link to the full reference for span attrs, trajectory layout, and replay workflows. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
INSTRUCTIONS.md kept the full per-tool tables for all six servers (110 lines), which dominated the file and pushed agent / observability content below the fold. - New docs/mcp-servers.md owns the per-server tool tables, env requirements, and direct-launch instructions. - INSTRUCTIONS.md MCP section is now an at-a-glance overview table (server, tool count, backing service) plus a link. - Quick Start step 4 reframed from "run servers manually" to "run an agent" — matches actual usage; servers are stdio-spawned on demand. - Collapsed three duplicate LITELLM_API_KEY / LITELLM_BASE_URL blocks (one per SDK runner) into a single proxy block. - Trimmed TOC: dropped per-server and per-agent sub-bullets that just duplicated section headers. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The four runner boxes restated what each per-agent section above already covers (planner / executor / summariser stages, SDK loop descriptions, trajectory collection). Keep just the runner names plus the agent → MCP-server fan-out so the diagram shows topology, not duplicated prose. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Plan-Execute / Claude / OpenAI / Deep each had a near-identical ~50-line section: one-line description, "How it works" diagram, flags table, model-prefix table, examples. Most of that was duplicated across all four — same --show-trajectory / --json / --verbose flags, same litellm_proxy/ prefix table, same "uv run X \$query" CLI shape. Replace with one Agents section: - Comparison table (runner, source, loop, default model). - Shared flags and runner-specific flags split into two tables so each flag appears once. - Single model-prefix table covering all runners. - Plan-Execute's loop kept as the only diagram (the SDK runners delegate to upstream; their loops belong in the SDK docs, not here). - Consolidated example block hits each runner's distinguishing flag. INSTRUCTIONS.md: 481 → 313 lines. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The flag exposed an internal customisation hook (override the MCP server registry per-run from the command line). No benchmark scenario or doc example uses it, and PlanExecuteRunner.__init__ already accepts server_paths programmatically for callers that actually need the override. - src/agent/cli.py: remove the argparse entry, _parse_servers helper, unused Path import, and pass-through into PlanExecuteRunner. - INSTRUCTIONS.md: drop the row from the runner-specific flags table. 102 agent + observability tests pass; no test exercised the flag. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The runner table's 'Loop' column already names the four loop styles; the ASCII diagram restated plan-execute's stages in prose without adding architectural information not already in src/agent/plan_execute/. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The provider-prefix → env-var mapping is already covered by the Environment Variables section above (WatsonX and LiteLLM proxy blocks); the third row was the only new info and is sufficiently implied by the runner table's default model column. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Previous version repeated the same 'uv run pytest src/ -v' command in two subsections (top-level + 'Integration tests'), enumerated six near-identical per-server commands (and missed vibration), and split work-order integration tests into their own subsection for no clear reason. - Lead with the two commands actually worth memorising: unit-only (-k "not integration") and full suite. - Single table mapping suite → skip-unless condition (CouchDB up, WatsonX env, TSFM paths) — replaces the bullet list, the per-server command list, and the two integration subsections. - Three narrowing examples: path, single file, -k pattern. Verified -k "integration" actually collects 50/320 tests; dropped the -m requires_couchdb example that didn't work because those are skipif marks, not collection markers. 47 → 19 lines. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
DhavalRepo18
approved these changes
Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds an evaluation-records persistence layer for AssetOpsBench: every agent run emits two artifacts joined by
run_id— an OpenTelemetry root span (metadata + aggregate metrics) and a per-run trajectory JSON (per-turn content) — both written directly by the agent process. No Docker, no Collector, no live backend required.Closes #270.
Motivation
Per-run token usage and tool-call detail used to live only in the in-process
AgentResult.trajectoryobject and vanish when the command exited. For a benchmark we need durable records that survive runs, are keyed to a scenario, and can be analysed offline or replayed into an observability backend later. Rather than invent a custom JSON schema, this PR uses OTLP — the format every OTEL backend already speaks — for the trace half, and a small companion JSON file for the per-turn content.Two-artifact model
Spans and trajectories carry disjoint data:
OTEL_TRACES_FILE=…/traces.jsonl) — root span per run with metadata + aggregate metrics: runner, model, IDs, span duration, token totals, turn / tool-call counts, plus auto-instrumented HTTPX child spans per outbound LiteLLM request.AGENT_TRAJECTORY_DIR=…) —{run_id}.jsonper run with per-turn content: turn text, tool call inputs / outputs, per-turn token usage and timing.Aggregate numbers live on the span; per-turn numbers live in the trajectory. Nothing is repeated.
What lands
Instrumentation (
src/observability/,src/agent/*/runner.py)init_tracing(service_name)wires a globalTracerProviderplusHTTPXClientInstrumentor(auto-propagatestraceparentto the LiteLLM proxy). WhenOTEL_TRACES_FILEis set, attaches the in-processOTLPJsonFileExporter(canonical OTLP-JSON, replayable via the OTel Collector'sotlpjsonfilereceiver). WhenOTEL_EXPORTER_OTLP_ENDPOINTis set, attaches the OTLP/HTTP exporter. Either / both / neither — all valid.agent_run_span(...)wraps every runner'srun()with a root span carrying GenAI semconv attributes (gen_ai.system,gen_ai.request.model,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens) plusagent.runner/agent.turns/agent.tool_calls/agent.question.length/agent.answer.length/agent.duration_ms/agent.run_id/agent.scenario_id. plan-execute additionally recordsagent.plan.steps/agent.planning_time_ms/agent.summarization_time_ms/agent.llm_time_ms.set_run_context(run_id=..., scenario_id=...)seedscontextvarsso the runner signature stays clean.persist_trajectory(...)writes{AGENT_TRAJECTORY_DIR}/{run_id}.jsonper run with the runner's fullTrajectory(SDK runners) orlist[StepResult](plan-execute).atexit.register(provider.shutdown)ensures theBatchSpanProcessorflushes the final root span on CLI exit.Token usage on plan-execute —
LLMBackend.generate_with_usage()returns anLLMResult(text + input/output tokens).LiteLLMBackendpopulates fromresponse.usage; mocks default to zero. A_TokenMeterwrapper inside the plan-execute runner accumulates across planning, per-step arg-resolution, and summarisation calls so plan-execute spans now reportgen_ai.usage.*alongside the SDK runners.Timing metrics
agent.duration_ms(all),agent.planning_time_ms/agent.summarization_time_ms/agent.llm_time_ms(plan-execute).Trajectory.started_at(SDK runners),TurnRecord.duration_ms(claude-agent),StepResult.duration_ms(plan-execute). Per-tool timing intentionally not captured — adding thePreToolUsehook to claude-agent broke compatibility with several@anthropic-ai/claude-codeCLI versions; openai-agent / deep-agent SDKs don't expose clean per-tool callback surfaces either.CLI flags —
--run-id(auto-UUID4 if omitted) and--scenario-idon every entry point, seeded into the run context before dispatch.Docs (
docs/observability.md) — full enable → persist → query → replay workflow, span attribute table with per-runner coverage, trajectory layout for both shapes,jqrecipes, log rotation guidance, optional Jaeger / Collector replay sections.Bundled refactor (merged in from the now-closed #273 / #274 / additional commits on this branch) — consolidates what were previously copies across runners:
src/agent/models.py— canonicalToolCall/TurnRecord/Trajectory.src/agent/_litellm.py—LITELLM_PREFIX+resolve_model().src/agent/_prompts.py— sharedAGENT_SYSTEM_PROMPT.src/agent/_cli_common.py—setup_logging/add_common_args/print_result/run_sdk_cli; each SDK CLI'smain()is one line.src/agent/runner.py—DEFAULT_SERVER_PATHSlives on the base;AgentRunner.__init__resolvesserver_pathsinto a concrete dict once.openai_agent— custom_managed_serversreplaced with stdlibcontextlib.AsyncExitStack.deep_agent—_chat_modelis now acached_propertyso the LangChain client is built once per runner instance.LiteLLMBackend._WATSONX_PREFIXconstant._resolve_modeltests collapsed into one parametrized suite.Dependencies
New optional group
[dependency-groups.otel]keeps the base install lean:Enable with
uv sync --group otel. Trajectories need no extra deps. Both env vars are optional — runs work normally with zero persistence overhead when neither is set.Test plan
uv run pytest src/ -k "not integration"— 270 pass, 50 deselected, 0 failures.run_id/scenario_idkwargs and contextvar precedence, nullable timing defaults, plan-execute token accumulation across planner + arg-resolution + summarise calls.uv run {plan-execute,claude-agent,openai-agent,deep-agent} --helpall render--run-id/--scenario-id.OTEL_TRACES_FILE=./traces/traces.jsonl AGENT_TRAJECTORY_DIR=./traces/trajectories uv run deep-agent --run-id bench-001 "..."produces both artifacts; jq recipes from the docs return the expected metadata / per-turn content.jaegertracing/all-in-one.Not included / out of scope
Counter/Histograminstruments, add them as a follow-up.src/tmp/cleanup — owner handling separately.