feat: save agent traces to disk in OTLP-JSON for benchmark evaluation by ShuxinLin · Pull Request #272 · IBM/AssetOpsBench

ShuxinLin · 2026-04-23T19:14:49Z

Adds an evaluation-records persistence layer for AssetOpsBench: every agent run emits two artifacts joined by run_id — an OpenTelemetry root span (metadata + aggregate metrics) and a per-run trajectory JSON (per-turn content) — both written directly by the agent process. No Docker, no Collector, no live backend required.

Closes #270.

Motivation

Per-run token usage and tool-call detail used to live only in the in-process AgentResult.trajectory object and vanish when the command exited. For a benchmark we need durable records that survive runs, are keyed to a scenario, and can be analysed offline or replayed into an observability backend later. Rather than invent a custom JSON schema, this PR uses OTLP — the format every OTEL backend already speaks — for the trace half, and a small companion JSON file for the per-turn content.

Two-artifact model

Spans and trajectories carry disjoint data:

Trace (OTEL_TRACES_FILE=…/traces.jsonl) — root span per run with metadata + aggregate metrics: runner, model, IDs, span duration, token totals, turn / tool-call counts, plus auto-instrumented HTTPX child spans per outbound LiteLLM request.
Trajectory (AGENT_TRAJECTORY_DIR=…) — {run_id}.json per run with per-turn content: turn text, tool call inputs / outputs, per-turn token usage and timing.

Aggregate numbers live on the span; per-turn numbers live in the trajectory. Nothing is repeated.

What lands

Instrumentation (src/observability/, src/agent/*/runner.py)

init_tracing(service_name) wires a global TracerProvider plus HTTPXClientInstrumentor (auto-propagates traceparent to the LiteLLM proxy). When OTEL_TRACES_FILE is set, attaches the in-process OTLPJsonFileExporter (canonical OTLP-JSON, replayable via the OTel Collector's otlpjsonfile receiver). When OTEL_EXPORTER_OTLP_ENDPOINT is set, attaches the OTLP/HTTP exporter. Either / both / neither — all valid.
agent_run_span(...) wraps every runner's run() with a root span carrying GenAI semconv attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) plus agent.runner / agent.turns / agent.tool_calls / agent.question.length / agent.answer.length / agent.duration_ms / agent.run_id / agent.scenario_id. plan-execute additionally records agent.plan.steps / agent.planning_time_ms / agent.summarization_time_ms / agent.llm_time_ms.
set_run_context(run_id=..., scenario_id=...) seeds contextvars so the runner signature stays clean.
persist_trajectory(...) writes {AGENT_TRAJECTORY_DIR}/{run_id}.json per run with the runner's full Trajectory (SDK runners) or list[StepResult] (plan-execute).
atexit.register(provider.shutdown) ensures the BatchSpanProcessor flushes the final root span on CLI exit.

Token usage on plan-execute — LLMBackend.generate_with_usage() returns an LLMResult (text + input/output tokens). LiteLLMBackend populates from response.usage; mocks default to zero. A _TokenMeter wrapper inside the plan-execute runner accumulates across planning, per-step arg-resolution, and summarisation calls so plan-execute spans now report gen_ai.usage.* alongside the SDK runners.

Timing metrics

Span: agent.duration_ms (all), agent.planning_time_ms / agent.summarization_time_ms / agent.llm_time_ms (plan-execute).
Trajectory: Trajectory.started_at (SDK runners), TurnRecord.duration_ms (claude-agent), StepResult.duration_ms (plan-execute). Per-tool timing intentionally not captured — adding the PreToolUse hook to claude-agent broke compatibility with several @anthropic-ai/claude-code CLI versions; openai-agent / deep-agent SDKs don't expose clean per-tool callback surfaces either.

CLI flags — --run-id (auto-UUID4 if omitted) and --scenario-id on every entry point, seeded into the run context before dispatch.

Docs (docs/observability.md) — full enable → persist → query → replay workflow, span attribute table with per-runner coverage, trajectory layout for both shapes, jq recipes, log rotation guidance, optional Jaeger / Collector replay sections.

Bundled refactor (merged in from the now-closed #273 / #274 / additional commits on this branch) — consolidates what were previously copies across runners:

src/agent/models.py — canonical ToolCall / TurnRecord / Trajectory.
src/agent/_litellm.py — LITELLM_PREFIX + resolve_model().
src/agent/_prompts.py — shared AGENT_SYSTEM_PROMPT.
src/agent/_cli_common.py — setup_logging / add_common_args / print_result / run_sdk_cli; each SDK CLI's main() is one line.
src/agent/runner.py — DEFAULT_SERVER_PATHS lives on the base; AgentRunner.__init__ resolves server_paths into a concrete dict once.
openai_agent — custom _managed_servers replaced with stdlib contextlib.AsyncExitStack.
deep_agent — _chat_model is now a cached_property so the LangChain client is built once per runner instance.
LiteLLMBackend._WATSONX_PREFIX constant.
Six duplicated _resolve_model tests collapsed into one parametrized suite.

Dependencies

New optional group [dependency-groups.otel] keeps the base install lean:

opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-otlp-proto-http
opentelemetry-exporter-otlp-proto-common
opentelemetry-instrumentation-httpx

Enable with uv sync --group otel. Trajectories need no extra deps. Both env vars are optional — runs work normally with zero persistence overhead when neither is set.

Test plan

uv run pytest src/ -k "not integration" — 270 pass, 50 deselected, 0 failures.
New tests cover: file-exporter encoding / append semantics / directory creation, trajectory persistence (disabled-without-env no-op, happy path, list shape, missing run_id warn-and-skip, nested dir creation), run_id / scenario_id kwargs and contextvar precedence, nullable timing defaults, plan-execute token accumulation across planner + arg-resolution + summarise calls.
uv run {plan-execute,claude-agent,openai-agent,deep-agent} --help all render --run-id / --scenario-id.
End-to-end smoke: OTEL_TRACES_FILE=./traces/traces.jsonl AGENT_TRAJECTORY_DIR=./traces/trajectories uv run deep-agent --run-id bench-001 "..." produces both artifacts; jq recipes from the docs return the expected metadata / per-turn content.
Optional Jaeger replay path verified end-to-end with jaegertracing/all-in-one.

Not included / out of scope

OpenTelemetry metrics API — traces carry enough for aggregation (token totals, latency via span duration, counts). If dashboards later need proper Counter / Histogram instruments, add them as a follow-up.
Per-turn child spans — only the root span is emitted per run. Per-turn detail lives in the trajectory file.
Raw prompt / response text on spans — deliberately omitted (PII risk, attribute truncation). Trajectories carry the text on disk; if you want it in traces, add it explicitly.
Per-tool timing across SDK runners — see the PreToolUse compatibility note above; revisit when the SDKs grow stable callback surfaces.
src/tmp/ cleanup — owner handling separately.

Adds opt-in OpenTelemetry tracing so each agent run emits a single correlated trace spanning the agent graph, every LLM call, and the LiteLLM proxy. - src/observability/ package exposes init_tracing() + agent_run_span() helpers; no-op when OTEL env not configured. - All four runners (plan-execute, claude-agent, openai-agent, deep-agent) wrap run() in a root span with gen_ai.* semconv attributes and trajectory-derived usage totals. - httpx auto-instrumentation propagates traceparent to the LiteLLM proxy so its spans nest under the agent trace. - Tests use InMemorySpanExporter; no collector required. Closes #270 Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

The three SDK runners (claude-agent, openai-agent, deep-agent) each shipped byte-identical Trajectory/TurnRecord/ToolCall dataclasses and their own _resolve_model / _LITELLM_PREFIX helpers. Consolidates into: - src/agent/models.py: canonical ToolCall, TurnRecord, Trajectory alongside the existing AgentResult. - src/agent/_litellm.py: shared LITELLM_PREFIX + resolve_model(). - Removed src/agent/{claude,openai,deep}_agent/models.py. - Collapsed six duplicated per-runner _resolve_model tests into one parametrized suite at src/agent/tests/test_litellm.py. Net -110 lines, no behaviour change. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

- src/agent/_prompts.py: single AGENT_SYSTEM_PROMPT used by the three SDK runners (claude-agent, openai-agent, deep-agent). plan_execute keeps its own planning/summarisation prompts. - src/agent/_cli_common.py: setup_logging, add_common_args, print_trajectory, print_answer, print_result. The three SDK CLIs now only encode their prog name, default model, epilog text, and runner-specific arg (--max-turns vs --recursion-limit). - Extract _WATSONX_PREFIX constant in LiteLLMBackend. Net -110 lines; each CLI shrinks from ~140 LoC to ~60 LoC. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

refactor: share system prompt and CLI boilerplate

- src/agent/_cli_common.py: new run_sdk_cli(service_name, build_parser, run_coro) that bundles dotenv → parse → logging → init_tracing → asyncio.run. The three SDK main() bodies shrink from 9 lines each to one. - DeepAgentRunner._chat_model is now a cached_property so _build_chat_model runs once per runner instance instead of once per run(). Matches the ClaudeAgentRunner / OpenAIAgentRunner pattern of pre-building per-instance config, with lazy init so constructor tests don't need env set. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

refactor: deduplicate models, LiteLLM helpers, prompts, and CLI across agent runners

- AgentRunner.__init__ now resolves server_paths into self._server_paths (always a concrete dict). DEFAULT_SERVER_PATHS moves from plan_execute/executor.py to agent/runner.py so the base runner owns its own default; the three SDK runners drop their duplicated _resolved_server_paths attribute and the cross-package import. - Replace openai_agent._managed_servers (custom 20-line async context manager) with stdlib contextlib.AsyncExitStack. Enters each MCPServerStdio once and closes them in LIFO order on success or exception. Three test sites that mocked the removed class now run against the real stack with an empty server list. Net -30 lines, no behavior change. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

The observability layer was originally shaped around live tracing (ship spans to a collector, inspect in Jaeger). That's the wrong primary story for a benchmark — what we actually need is per-run records saved to disk in a standard format for later analysis and replay. This commit reorients without ripping anything out: - agent_run_span() now accepts run_id / scenario_id kwargs and additionally reads from ambient contextvars set via new set_run_context(). The CLI layer seeds the contextvars, so runners need no signature changes. - All SDK CLIs and plan-execute gain --run-id (auto-UUID4 when omitted) and --scenario-id, both recorded as root-span attributes (agent.run_id, agent.scenario_id). - otel-collector.yaml: drop-in Collector config that persists spans to ./traces/traces.jsonl in OTLP-JSON (the canonical, replayable format every OTEL backend can ingest). - docs/observability.md: full workflow for save-then-replay, live Jaeger as a secondary option, jq recipes, and troubleshooting. Net +149 LoC across Python + 2 new files (YAML + docs). Tests cover the new kwargs, contextvars, and precedence (kwarg > context). Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Drop-in Collector config for trace persistence to disk (OTLP-JSON via the file exporter), plus full user-facing docs for enable/persist/replay/ troubleshoot flows. These files were intended to land with the previous commit but were missed by "git add -u" since they were untracked. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Review feedback (see PR #272 discussion) surfaced several over-engineered helpers and one real bug. Bug fix - tracing.py: BatchSpanProcessor buffers spans and the final agent run's root span could be dropped on CLI exit. Register atexit.register(provider.shutdown) so the exporter flushes. Drops with no behavior change - Custom _NoopTracer / _NoopSpan shims + try/except ImportError around the OTEL import. opentelemetry.trace.get_tracer() already returns a ProxyTracer with NonRecordingSpan no-ops when no provider is installed, and opentelemetry-api is a hard transitive dep. - _safe_getattr_int / _set_error_status helpers (each used once, now inline). - annotate_result helper: inlined into all four runners; each runner already knows its Trajectory shape. - agent_run_span run_id / scenario_id kwargs: no production caller uses them; contextvar-sourced values are the only real path. - _reset_for_tests public symbol: tests monkeypatch _initialized directly. - src/observability/attributes.py module: every constant was used once inside runspan.py; inlined the six literal strings. - Alias table in _system_from_model: speculative aws/gcp/bedrock/ vertex_ai → anthropic mappings that were never emitted. Kept the aws alias that the repo actually uses. Reuse / consistency - plan-execute CLI now uses _cli_common (setup_logging, add_common_args, HR, run_sdk_cli). Drops its private _setup_logging, LOG_FORMAT, LOG_DATE_FORMAT, and duplicate main() body. Perf - ClaudeAgentRunner caches mcp_servers dict in __init__ (matches the deep-agent cached_property pattern). openai-agent intentionally skipped: its MCPServerStdio instances are entered/exited per run and can't be safely cached. Net -347 lines, 255 tests pass. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Observability no longer requires running a Docker Collector. Set OTEL_TRACES_FILE=./traces/traces.jsonl and each span batch is appended to disk in canonical OTLP-JSON — the same format the OpenTelemetry Collector's file exporter produces, so saved traces remain replayable into any OTLP backend (Jaeger, Tempo, Honeycomb, …) via the Collector's otlpjsonfile receiver when network access is available. Changes - src/observability/file_exporter.py: OTLPJsonFileExporter — thread-safe append writer backed by the OTLP common trace encoder. - tracing.py: init_tracing now wires the file exporter when OTEL_TRACES_FILE is set. HTTP and file exporters are independent; set either or both. - Drop otel-collector.yaml — the Collector is now optional, not the primary story. - docs/observability.md rewritten around the file-exporter workflow; live Jaeger demoted to a side note. - Tests cover encoding, directory creation, append semantics, and the file-only enablement path. 259 tests pass. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Clean separation between the two observability artifacts so each has a single responsibility and neither duplicates information held by the other. * **Span** (traces.jsonl) carries *metadata*: runner, model, run_id, scenario_id, question/answer lengths, timing. The previously duplicated gen_ai.usage.input_tokens / gen_ai.usage.output_tokens / agent.turns / agent.tool_calls attributes are removed — those values are derived from the trajectory and adding them to the span was redundant work + divergence risk. * **Trajectory** (AGENT_TRAJECTORY_DIR/{run_id}.json) carries *content*: per-turn text, tool call inputs/outputs, per-turn token usage. New persistence module writes one file per run when the env var is set. Joins to the trace by run_id. Changes - src/observability/persistence.py: persist_trajectory() — reads run_id / scenario_id from the same contextvars used by agent_run_span, so no public signature change on runners. Handles both SDK runners' Trajectory dataclass and plan-execute's list[StepResult]. - All four runners call persist_trajectory() after building AgentResult and drop the four derived span attributes. - docs/observability.md rewritten around the two-artifact model; jq examples split into metadata-via-trace and content-via-trajectory. - .gitignore: traces/ Tests cover: disabled-without-env no-op, happy path, list trajectory (plan-execute shape), missing run_id warn-and-skip, nested dir creation. 264 tests pass. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Reversing the "pure metadata on span" framing from the prior commit. Token totals and turn/tool_call counts are aggregate metrics, not content, and belong on the span per OTEL GenAI semantic conventions. Removing them made saved traces opaque to Jaeger, Tempo, Langfuse, Grafana Cloud AI, Honeycomb, and every other OTEL-aware backend that expects gen_ai.usage.* to display cost / tokens. The "avoid duplication" argument was weak — both span attribute and trajectory property are derived from the same Trajectory object in the same function call; they cannot diverge. Better separation: - Span: metadata + aggregates (runner, model, IDs, latency, token totals, turn and tool_call counts). - Trajectory: per-turn content only (text, tool inputs / outputs, per-turn tokens). No overlap: totals appear once on the span; per-turn numbers appear once in the trajectory file. docs/observability.md updated accordingly; jq example for token totals now reads the trace alone instead of iterating trajectory files. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Aggregates on root span (queryable by any OTEL UI); per-unit durations on trajectory fields (nullable where the runner's SDK doesn't expose clean callbacks). Same invariant as tokens: no double-accounting. Span attributes - agent.duration_ms — all runners, wall-clock of run(). - agent.tool_time_ms — claude-agent only (via PreToolUse/PostToolUse hooks). - agent.llm_time_ms — plan-execute only (planning + summarisation). - agent.planning_time_ms, agent.summarization_time_ms — plan-execute only. Trajectory fields - Trajectory.started_at — ISO-8601 UTC timestamp (SDK runners). - TurnRecord.duration_ms — wall-clock per turn (claude-agent only for now). - ToolCall.duration_ms — wall-clock per tool (claude-agent only for now). - StepResult.duration_ms — wall-clock per plan step (plan-execute). Deferred: per-turn / per-tool timing for openai-agent and deep-agent — their SDKs don't expose clean callback surfaces at that granularity. Start with agent.duration_ms on the span; add finer hooks later when needed. Tests - 5 new tests verifying nullable defaults on the four duration fields plus that plan-execute's executor always populates StepResult.duration_ms. 269 tests pass. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Adding a PreToolUse hook alongside PostToolUse for per-tool timing broke subprocess launch on user-reported @anthropic-ai/claude-code CLI versions (subprocess exits with code 1 during config parse). The Python claude_agent_sdk types PreToolUse as valid, but the installed CLI binary is a separate artifact shipped via npm and need not agree. Revert: only PostToolUse is registered, matching pre-#272 behavior. Per-tool duration_ms and the derived agent.tool_time_ms span attribute are no longer captured for claude-agent — matches openai-agent and deep-agent which never had per-tool timing. Turn-level timing (TurnRecord.duration_ms, measured from AssistantMessage arrival times) and run-level timing (agent.duration_ms) still work — they're computed in-process from observable events without adding subprocess hook registrations. docs/observability.md: removed agent.tool_time_ms from the span attributes table, updated the trajectory coverage matrix, and noted the compatibility constraint in the SDK-specific section. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

…ctory shape - Token counts and turn/tool-call totals are set by SDK runners only (claude-agent, openai-agent, deep-agent), not plan-execute. - plan-execute persists trajectory as a flat list of StepResult, not the {started_at, turns} object; document both shapes explicitly. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

LLMBackend.generate() returns only text; tokens are absent from both span and trajectory for plan-execute — not recoverable from either artifact. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

LiteLLM's completion response already carries prompt/completion token counts; LiteLLMBackend was discarding them. Surface them via a new LLMBackend.generate_with_usage() returning LLMResult, and wrap the runner's backend in a _TokenMeter that accumulates across planner, per-step arg-resolution, and summarise calls. Totals land on gen_ai.usage.{input,output}_tokens alongside the SDK runners. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Drop _TokenMeter / LLMResult / generate_with_usage implementation detail from the user-facing doc — describe only the observable contract (sum across plan / arg-resolution / summarise calls). Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

…bility.md Brief pointer (two-artifact model + env vars + --run-id/--scenario-id) with a link to the full reference for span attrs, trajectory layout, and replay workflows. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

INSTRUCTIONS.md kept the full per-tool tables for all six servers (110 lines), which dominated the file and pushed agent / observability content below the fold. - New docs/mcp-servers.md owns the per-server tool tables, env requirements, and direct-launch instructions. - INSTRUCTIONS.md MCP section is now an at-a-glance overview table (server, tool count, backing service) plus a link. - Quick Start step 4 reframed from "run servers manually" to "run an agent" — matches actual usage; servers are stdio-spawned on demand. - Collapsed three duplicate LITELLM_API_KEY / LITELLM_BASE_URL blocks (one per SDK runner) into a single proxy block. - Trimmed TOC: dropped per-server and per-agent sub-bullets that just duplicated section headers. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

The four runner boxes restated what each per-agent section above already covers (planner / executor / summariser stages, SDK loop descriptions, trajectory collection). Keep just the runner names plus the agent → MCP-server fan-out so the diagram shows topology, not duplicated prose. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Plan-Execute / Claude / OpenAI / Deep each had a near-identical ~50-line section: one-line description, "How it works" diagram, flags table, model-prefix table, examples. Most of that was duplicated across all four — same --show-trajectory / --json / --verbose flags, same litellm_proxy/ prefix table, same "uv run X \$query" CLI shape. Replace with one Agents section: - Comparison table (runner, source, loop, default model). - Shared flags and runner-specific flags split into two tables so each flag appears once. - Single model-prefix table covering all runners. - Plan-Execute's loop kept as the only diagram (the SDK runners delegate to upstream; their loops belong in the SDK docs, not here). - Consolidated example block hits each runner's distinguishing flag. INSTRUCTIONS.md: 481 → 313 lines. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

The flag exposed an internal customisation hook (override the MCP server registry per-run from the command line). No benchmark scenario or doc example uses it, and PlanExecuteRunner.__init__ already accepts server_paths programmatically for callers that actually need the override. - src/agent/cli.py: remove the argparse entry, _parse_servers helper, unused Path import, and pass-through into PlanExecuteRunner. - INSTRUCTIONS.md: drop the row from the runner-specific flags table. 102 agent + observability tests pass; no test exercised the flag. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

The runner table's 'Loop' column already names the four loop styles; the ASCII diagram restated plan-execute's stages in prose without adding architectural information not already in src/agent/plan_execute/. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

The provider-prefix → env-var mapping is already covered by the Environment Variables section above (WatsonX and LiteLLM proxy blocks); the third row was the only new info and is sufficiently implied by the runner table's default model column. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

Previous version repeated the same 'uv run pytest src/ -v' command in two subsections (top-level + 'Integration tests'), enumerated six near-identical per-server commands (and missed vibration), and split work-order integration tests into their own subsection for no clear reason. - Lead with the two commands actually worth memorising: unit-only (-k "not integration") and full suite. - Single table mapping suite → skip-unless condition (CouchDB up, WatsonX env, TSFM paths) — replaces the bullet list, the per-server command list, and the two integration subsections. - Three narrowing examples: path, single file, -k pattern. Verified -k "integration" actually collects 50/320 tests; dropped the -m requires_couchdb example that didn't work because those are skipif marks, not collection markers. 47 → 19 lines. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

ShuxinLin changed the base branch from feat/add-deep-agent-runner to main April 23, 2026 19:16

ShuxinLin mentioned this pull request Apr 23, 2026

refactor: deduplicate models, LiteLLM helpers, prompts, and CLI across agent runners #273

Merged

4 tasks

ShuxinLin added 7 commits April 23, 2026 16:09

Merge pull request #274 from IBM/refactor/share-prompt-and-cli

298344c

refactor: share system prompt and CLI boilerplate

Merge pull request #273 from IBM/refactor/unify-agent-models

aa81db6

refactor: deduplicate models, LiteLLM helpers, prompts, and CLI across agent runners

ShuxinLin changed the title ~~feat: OTEL tracing across agent runners and LiteLLM proxy~~ feat: save agent traces to disk in OTLP-JSON for benchmark evaluation Apr 24, 2026

ShuxinLin added 11 commits April 24, 2026 12:20

Update observability.md

7677828

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

docs(observability): clarify plan-execute has no token tracking

43694d3

LLMBackend.generate() returns only text; tokens are absent from both span and trajectory for plan-execute — not recoverable from either artifact. Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

ShuxinLin force-pushed the feat/otel-observability branch from 445ddbd to 0a2c551 Compare April 27, 2026 16:14

ShuxinLin added 6 commits April 27, 2026 12:16

ShuxinLin added 2 commits April 27, 2026 12:33

ShuxinLin requested a review from DhavalRepo18 April 27, 2026 16:36

DhavalRepo18 approved these changes Apr 27, 2026

View reviewed changes

DhavalRepo18 merged commit 2f1069c into main Apr 27, 2026
1 check passed

ShuxinLin deleted the feat/otel-observability branch April 27, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: save agent traces to disk in OTLP-JSON for benchmark evaluation#272

feat: save agent traces to disk in OTLP-JSON for benchmark evaluation#272
DhavalRepo18 merged 28 commits intomainfrom
feat/otel-observability

ShuxinLin commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShuxinLin commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Two-artifact model

What lands

Dependencies

Test plan

Not included / out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShuxinLin commented Apr 23, 2026 •

edited

Loading