Skip to content

feat: save agent traces to disk in OTLP-JSON for benchmark evaluation#272

Merged
DhavalRepo18 merged 28 commits intomainfrom
feat/otel-observability
Apr 27, 2026
Merged

feat: save agent traces to disk in OTLP-JSON for benchmark evaluation#272
DhavalRepo18 merged 28 commits intomainfrom
feat/otel-observability

Conversation

@ShuxinLin
Copy link
Copy Markdown
Collaborator

@ShuxinLin ShuxinLin commented Apr 23, 2026

Adds an evaluation-records persistence layer for AssetOpsBench: every agent run emits two artifacts joined by run_id — an OpenTelemetry root span (metadata + aggregate metrics) and a per-run trajectory JSON (per-turn content) — both written directly by the agent process. No Docker, no Collector, no live backend required.

Closes #270.

Motivation

Per-run token usage and tool-call detail used to live only in the in-process AgentResult.trajectory object and vanish when the command exited. For a benchmark we need durable records that survive runs, are keyed to a scenario, and can be analysed offline or replayed into an observability backend later. Rather than invent a custom JSON schema, this PR uses OTLP — the format every OTEL backend already speaks — for the trace half, and a small companion JSON file for the per-turn content.

Two-artifact model

Spans and trajectories carry disjoint data:

  • Trace (OTEL_TRACES_FILE=…/traces.jsonl) — root span per run with metadata + aggregate metrics: runner, model, IDs, span duration, token totals, turn / tool-call counts, plus auto-instrumented HTTPX child spans per outbound LiteLLM request.
  • Trajectory (AGENT_TRAJECTORY_DIR=…) — {run_id}.json per run with per-turn content: turn text, tool call inputs / outputs, per-turn token usage and timing.

Aggregate numbers live on the span; per-turn numbers live in the trajectory. Nothing is repeated.

What lands

Instrumentation (src/observability/, src/agent/*/runner.py)

  • init_tracing(service_name) wires a global TracerProvider plus HTTPXClientInstrumentor (auto-propagates traceparent to the LiteLLM proxy). When OTEL_TRACES_FILE is set, attaches the in-process OTLPJsonFileExporter (canonical OTLP-JSON, replayable via the OTel Collector's otlpjsonfile receiver). When OTEL_EXPORTER_OTLP_ENDPOINT is set, attaches the OTLP/HTTP exporter. Either / both / neither — all valid.
  • agent_run_span(...) wraps every runner's run() with a root span carrying GenAI semconv attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) plus agent.runner / agent.turns / agent.tool_calls / agent.question.length / agent.answer.length / agent.duration_ms / agent.run_id / agent.scenario_id. plan-execute additionally records agent.plan.steps / agent.planning_time_ms / agent.summarization_time_ms / agent.llm_time_ms.
  • set_run_context(run_id=..., scenario_id=...) seeds contextvars so the runner signature stays clean.
  • persist_trajectory(...) writes {AGENT_TRAJECTORY_DIR}/{run_id}.json per run with the runner's full Trajectory (SDK runners) or list[StepResult] (plan-execute).
  • atexit.register(provider.shutdown) ensures the BatchSpanProcessor flushes the final root span on CLI exit.

Token usage on plan-executeLLMBackend.generate_with_usage() returns an LLMResult (text + input/output tokens). LiteLLMBackend populates from response.usage; mocks default to zero. A _TokenMeter wrapper inside the plan-execute runner accumulates across planning, per-step arg-resolution, and summarisation calls so plan-execute spans now report gen_ai.usage.* alongside the SDK runners.

Timing metrics

  • Span: agent.duration_ms (all), agent.planning_time_ms / agent.summarization_time_ms / agent.llm_time_ms (plan-execute).
  • Trajectory: Trajectory.started_at (SDK runners), TurnRecord.duration_ms (claude-agent), StepResult.duration_ms (plan-execute). Per-tool timing intentionally not captured — adding the PreToolUse hook to claude-agent broke compatibility with several @anthropic-ai/claude-code CLI versions; openai-agent / deep-agent SDKs don't expose clean per-tool callback surfaces either.

CLI flags--run-id (auto-UUID4 if omitted) and --scenario-id on every entry point, seeded into the run context before dispatch.

Docs (docs/observability.md) — full enable → persist → query → replay workflow, span attribute table with per-runner coverage, trajectory layout for both shapes, jq recipes, log rotation guidance, optional Jaeger / Collector replay sections.

Bundled refactor (merged in from the now-closed #273 / #274 / additional commits on this branch) — consolidates what were previously copies across runners:

  • src/agent/models.py — canonical ToolCall / TurnRecord / Trajectory.
  • src/agent/_litellm.pyLITELLM_PREFIX + resolve_model().
  • src/agent/_prompts.py — shared AGENT_SYSTEM_PROMPT.
  • src/agent/_cli_common.pysetup_logging / add_common_args / print_result / run_sdk_cli; each SDK CLI's main() is one line.
  • src/agent/runner.pyDEFAULT_SERVER_PATHS lives on the base; AgentRunner.__init__ resolves server_paths into a concrete dict once.
  • openai_agent — custom _managed_servers replaced with stdlib contextlib.AsyncExitStack.
  • deep_agent_chat_model is now a cached_property so the LangChain client is built once per runner instance.
  • LiteLLMBackend._WATSONX_PREFIX constant.
  • Six duplicated _resolve_model tests collapsed into one parametrized suite.

Dependencies

New optional group [dependency-groups.otel] keeps the base install lean:

opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-otlp-proto-http
opentelemetry-exporter-otlp-proto-common
opentelemetry-instrumentation-httpx

Enable with uv sync --group otel. Trajectories need no extra deps. Both env vars are optional — runs work normally with zero persistence overhead when neither is set.

Test plan

  • uv run pytest src/ -k "not integration"270 pass, 50 deselected, 0 failures.
  • New tests cover: file-exporter encoding / append semantics / directory creation, trajectory persistence (disabled-without-env no-op, happy path, list shape, missing run_id warn-and-skip, nested dir creation), run_id / scenario_id kwargs and contextvar precedence, nullable timing defaults, plan-execute token accumulation across planner + arg-resolution + summarise calls.
  • uv run {plan-execute,claude-agent,openai-agent,deep-agent} --help all render --run-id / --scenario-id.
  • End-to-end smoke: OTEL_TRACES_FILE=./traces/traces.jsonl AGENT_TRAJECTORY_DIR=./traces/trajectories uv run deep-agent --run-id bench-001 "..." produces both artifacts; jq recipes from the docs return the expected metadata / per-turn content.
  • Optional Jaeger replay path verified end-to-end with jaegertracing/all-in-one.

Not included / out of scope

  • OpenTelemetry metrics API — traces carry enough for aggregation (token totals, latency via span duration, counts). If dashboards later need proper Counter / Histogram instruments, add them as a follow-up.
  • Per-turn child spans — only the root span is emitted per run. Per-turn detail lives in the trajectory file.
  • Raw prompt / response text on spans — deliberately omitted (PII risk, attribute truncation). Trajectories carry the text on disk; if you want it in traces, add it explicitly.
  • Per-tool timing across SDK runners — see the PreToolUse compatibility note above; revisit when the SDKs grow stable callback surfaces.
  • src/tmp/ cleanup — owner handling separately.

Adds opt-in OpenTelemetry tracing so each agent run emits a single
correlated trace spanning the agent graph, every LLM call, and the
LiteLLM proxy.

- src/observability/ package exposes init_tracing() + agent_run_span()
  helpers; no-op when OTEL env not configured.
- All four runners (plan-execute, claude-agent, openai-agent, deep-agent)
  wrap run() in a root span with gen_ai.* semconv attributes and
  trajectory-derived usage totals.
- httpx auto-instrumentation propagates traceparent to the LiteLLM proxy
  so its spans nest under the agent trace.
- Tests use InMemorySpanExporter; no collector required.

Closes #270

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
@ShuxinLin ShuxinLin changed the base branch from feat/add-deep-agent-runner to main April 23, 2026 19:16
The three SDK runners (claude-agent, openai-agent, deep-agent) each
shipped byte-identical Trajectory/TurnRecord/ToolCall dataclasses and
their own _resolve_model / _LITELLM_PREFIX helpers.  Consolidates into:

- src/agent/models.py: canonical ToolCall, TurnRecord, Trajectory
  alongside the existing AgentResult.
- src/agent/_litellm.py: shared LITELLM_PREFIX + resolve_model().
- Removed src/agent/{claude,openai,deep}_agent/models.py.
- Collapsed six duplicated per-runner _resolve_model tests into one
  parametrized suite at src/agent/tests/test_litellm.py.

Net -110 lines, no behaviour change.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
- src/agent/_prompts.py: single AGENT_SYSTEM_PROMPT used by the three
  SDK runners (claude-agent, openai-agent, deep-agent).  plan_execute
  keeps its own planning/summarisation prompts.
- src/agent/_cli_common.py: setup_logging, add_common_args,
  print_trajectory, print_answer, print_result.  The three SDK CLIs
  now only encode their prog name, default model, epilog text, and
  runner-specific arg (--max-turns vs --recursion-limit).
- Extract _WATSONX_PREFIX constant in LiteLLMBackend.

Net -110 lines; each CLI shrinks from ~140 LoC to ~60 LoC.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
refactor: share system prompt and CLI boilerplate
- src/agent/_cli_common.py: new run_sdk_cli(service_name, build_parser, run_coro)
  that bundles dotenv → parse → logging → init_tracing → asyncio.run.
  The three SDK main() bodies shrink from 9 lines each to one.
- DeepAgentRunner._chat_model is now a cached_property so _build_chat_model
  runs once per runner instance instead of once per run().  Matches the
  ClaudeAgentRunner / OpenAIAgentRunner pattern of pre-building per-instance
  config, with lazy init so constructor tests don't need env set.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
refactor: deduplicate models, LiteLLM helpers, prompts, and CLI across agent runners
- AgentRunner.__init__ now resolves server_paths into self._server_paths
  (always a concrete dict). DEFAULT_SERVER_PATHS moves from
  plan_execute/executor.py to agent/runner.py so the base runner owns
  its own default; the three SDK runners drop their duplicated
  _resolved_server_paths attribute and the cross-package import.

- Replace openai_agent._managed_servers (custom 20-line async context
  manager) with stdlib contextlib.AsyncExitStack. Enters each
  MCPServerStdio once and closes them in LIFO order on success or
  exception. Three test sites that mocked the removed class now run
  against the real stack with an empty server list.

Net -30 lines, no behavior change.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The observability layer was originally shaped around live tracing
(ship spans to a collector, inspect in Jaeger).  That's the wrong
primary story for a benchmark — what we actually need is per-run
records saved to disk in a standard format for later analysis and
replay.  This commit reorients without ripping anything out:

- agent_run_span() now accepts run_id / scenario_id kwargs and
  additionally reads from ambient contextvars set via new
  set_run_context().  The CLI layer seeds the contextvars, so
  runners need no signature changes.
- All SDK CLIs and plan-execute gain --run-id (auto-UUID4 when
  omitted) and --scenario-id, both recorded as root-span attributes
  (agent.run_id, agent.scenario_id).
- otel-collector.yaml: drop-in Collector config that persists spans
  to ./traces/traces.jsonl in OTLP-JSON (the canonical, replayable
  format every OTEL backend can ingest).
- docs/observability.md: full workflow for save-then-replay, live
  Jaeger as a secondary option, jq recipes, and troubleshooting.

Net +149 LoC across Python + 2 new files (YAML + docs).  Tests
cover the new kwargs, contextvars, and precedence (kwarg > context).

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Drop-in Collector config for trace persistence to disk (OTLP-JSON via
the file exporter), plus full user-facing docs for enable/persist/replay/
troubleshoot flows.

These files were intended to land with the previous commit but were
missed by "git add -u" since they were untracked.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
@ShuxinLin ShuxinLin changed the title feat: OTEL tracing across agent runners and LiteLLM proxy feat: save agent traces to disk in OTLP-JSON for benchmark evaluation Apr 24, 2026
Review feedback (see PR #272 discussion) surfaced several over-engineered
helpers and one real bug.

Bug fix
- tracing.py: BatchSpanProcessor buffers spans and the final agent run's
  root span could be dropped on CLI exit.  Register
  atexit.register(provider.shutdown) so the exporter flushes.

Drops with no behavior change
- Custom _NoopTracer / _NoopSpan shims + try/except ImportError around
  the OTEL import.  opentelemetry.trace.get_tracer() already returns a
  ProxyTracer with NonRecordingSpan no-ops when no provider is installed,
  and opentelemetry-api is a hard transitive dep.
- _safe_getattr_int / _set_error_status helpers (each used once, now
  inline).
- annotate_result helper: inlined into all four runners; each runner
  already knows its Trajectory shape.
- agent_run_span run_id / scenario_id kwargs: no production caller uses
  them; contextvar-sourced values are the only real path.
- _reset_for_tests public symbol: tests monkeypatch _initialized directly.
- src/observability/attributes.py module: every constant was used once
  inside runspan.py; inlined the six literal strings.
- Alias table in _system_from_model: speculative aws/gcp/bedrock/
  vertex_ai → anthropic mappings that were never emitted.  Kept the aws
  alias that the repo actually uses.

Reuse / consistency
- plan-execute CLI now uses _cli_common (setup_logging, add_common_args,
  HR, run_sdk_cli).  Drops its private _setup_logging, LOG_FORMAT,
  LOG_DATE_FORMAT, and duplicate main() body.

Perf
- ClaudeAgentRunner caches mcp_servers dict in __init__ (matches the
  deep-agent cached_property pattern).  openai-agent intentionally
  skipped: its MCPServerStdio instances are entered/exited per run and
  can't be safely cached.

Net -347 lines, 255 tests pass.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Observability no longer requires running a Docker Collector.  Set
OTEL_TRACES_FILE=./traces/traces.jsonl and each span batch is appended
to disk in canonical OTLP-JSON — the same format the OpenTelemetry
Collector's file exporter produces, so saved traces remain replayable
into any OTLP backend (Jaeger, Tempo, Honeycomb, …) via the Collector's
otlpjsonfile receiver when network access is available.

Changes
- src/observability/file_exporter.py: OTLPJsonFileExporter — thread-safe
  append writer backed by the OTLP common trace encoder.
- tracing.py: init_tracing now wires the file exporter when
  OTEL_TRACES_FILE is set.  HTTP and file exporters are independent; set
  either or both.
- Drop otel-collector.yaml — the Collector is now optional, not the
  primary story.
- docs/observability.md rewritten around the file-exporter workflow;
  live Jaeger demoted to a side note.
- Tests cover encoding, directory creation, append semantics, and the
  file-only enablement path.

259 tests pass.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Clean separation between the two observability artifacts so each has a
single responsibility and neither duplicates information held by the
other.

* **Span** (traces.jsonl) carries *metadata*: runner, model, run_id,
  scenario_id, question/answer lengths, timing.  The previously
  duplicated gen_ai.usage.input_tokens / gen_ai.usage.output_tokens /
  agent.turns / agent.tool_calls attributes are removed — those values
  are derived from the trajectory and adding them to the span was
  redundant work + divergence risk.
* **Trajectory** (AGENT_TRAJECTORY_DIR/{run_id}.json) carries *content*:
  per-turn text, tool call inputs/outputs, per-turn token usage.  New
  persistence module writes one file per run when the env var is set.
  Joins to the trace by run_id.

Changes
- src/observability/persistence.py: persist_trajectory() — reads
  run_id / scenario_id from the same contextvars used by agent_run_span,
  so no public signature change on runners.  Handles both SDK runners'
  Trajectory dataclass and plan-execute's list[StepResult].
- All four runners call persist_trajectory() after building AgentResult
  and drop the four derived span attributes.
- docs/observability.md rewritten around the two-artifact model; jq
  examples split into metadata-via-trace and content-via-trajectory.
- .gitignore: traces/

Tests cover: disabled-without-env no-op, happy path, list trajectory
(plan-execute shape), missing run_id warn-and-skip, nested dir creation.

264 tests pass.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Reversing the "pure metadata on span" framing from the prior commit.

Token totals and turn/tool_call counts are aggregate metrics, not
content, and belong on the span per OTEL GenAI semantic conventions.
Removing them made saved traces opaque to Jaeger, Tempo, Langfuse,
Grafana Cloud AI, Honeycomb, and every other OTEL-aware backend that
expects gen_ai.usage.* to display cost / tokens.  The "avoid
duplication" argument was weak — both span attribute and trajectory
property are derived from the same Trajectory object in the same
function call; they cannot diverge.

Better separation:
- Span: metadata + aggregates (runner, model, IDs, latency, token
  totals, turn and tool_call counts).
- Trajectory: per-turn content only (text, tool inputs / outputs,
  per-turn tokens).

No overlap: totals appear once on the span; per-turn numbers appear
once in the trajectory file.

docs/observability.md updated accordingly; jq example for token
totals now reads the trace alone instead of iterating trajectory
files.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Aggregates on root span (queryable by any OTEL UI); per-unit durations
on trajectory fields (nullable where the runner's SDK doesn't expose
clean callbacks).  Same invariant as tokens: no double-accounting.

Span attributes
- agent.duration_ms — all runners, wall-clock of run().
- agent.tool_time_ms — claude-agent only (via PreToolUse/PostToolUse hooks).
- agent.llm_time_ms — plan-execute only (planning + summarisation).
- agent.planning_time_ms, agent.summarization_time_ms — plan-execute only.

Trajectory fields
- Trajectory.started_at — ISO-8601 UTC timestamp (SDK runners).
- TurnRecord.duration_ms — wall-clock per turn (claude-agent only for now).
- ToolCall.duration_ms — wall-clock per tool (claude-agent only for now).
- StepResult.duration_ms — wall-clock per plan step (plan-execute).

Deferred: per-turn / per-tool timing for openai-agent and deep-agent —
their SDKs don't expose clean callback surfaces at that granularity.
Start with agent.duration_ms on the span; add finer hooks later when
needed.

Tests
- 5 new tests verifying nullable defaults on the four duration fields
  plus that plan-execute's executor always populates StepResult.duration_ms.

269 tests pass.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Adding a PreToolUse hook alongside PostToolUse for per-tool timing
broke subprocess launch on user-reported @anthropic-ai/claude-code CLI
versions (subprocess exits with code 1 during config parse).  The
Python claude_agent_sdk types PreToolUse as valid, but the installed
CLI binary is a separate artifact shipped via npm and need not agree.

Revert: only PostToolUse is registered, matching pre-#272 behavior.
Per-tool duration_ms and the derived agent.tool_time_ms span attribute
are no longer captured for claude-agent — matches openai-agent and
deep-agent which never had per-tool timing.

Turn-level timing (TurnRecord.duration_ms, measured from
AssistantMessage arrival times) and run-level timing
(agent.duration_ms) still work — they're computed in-process from
observable events without adding subprocess hook registrations.

docs/observability.md: removed agent.tool_time_ms from the span
attributes table, updated the trajectory coverage matrix, and noted
the compatibility constraint in the SDK-specific section.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
…ctory shape

- Token counts and turn/tool-call totals are set by SDK runners only
  (claude-agent, openai-agent, deep-agent), not plan-execute.
- plan-execute persists trajectory as a flat list of StepResult, not the
  {started_at, turns} object; document both shapes explicitly.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
LLMBackend.generate() returns only text; tokens are absent from both span
and trajectory for plan-execute — not recoverable from either artifact.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
LiteLLM's completion response already carries prompt/completion token
counts; LiteLLMBackend was discarding them.  Surface them via a new
LLMBackend.generate_with_usage() returning LLMResult, and wrap the
runner's backend in a _TokenMeter that accumulates across planner,
per-step arg-resolution, and summarise calls.  Totals land on
gen_ai.usage.{input,output}_tokens alongside the SDK runners.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Drop _TokenMeter / LLMResult / generate_with_usage implementation detail
from the user-facing doc — describe only the observable contract (sum
across plan / arg-resolution / summarise calls).

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
@ShuxinLin ShuxinLin force-pushed the feat/otel-observability branch from 445ddbd to 0a2c551 Compare April 27, 2026 16:14
…bility.md

Brief pointer (two-artifact model + env vars + --run-id/--scenario-id)
with a link to the full reference for span attrs, trajectory layout,
and replay workflows.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
INSTRUCTIONS.md kept the full per-tool tables for all six servers
(110 lines), which dominated the file and pushed agent / observability
content below the fold.

- New docs/mcp-servers.md owns the per-server tool tables, env
  requirements, and direct-launch instructions.
- INSTRUCTIONS.md MCP section is now an at-a-glance overview table
  (server, tool count, backing service) plus a link.
- Quick Start step 4 reframed from "run servers manually" to "run an
  agent" — matches actual usage; servers are stdio-spawned on demand.
- Collapsed three duplicate LITELLM_API_KEY / LITELLM_BASE_URL blocks
  (one per SDK runner) into a single proxy block.
- Trimmed TOC: dropped per-server and per-agent sub-bullets that just
  duplicated section headers.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The four runner boxes restated what each per-agent section above
already covers (planner / executor / summariser stages, SDK loop
descriptions, trajectory collection).  Keep just the runner names
plus the agent → MCP-server fan-out so the diagram shows topology,
not duplicated prose.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Plan-Execute / Claude / OpenAI / Deep each had a near-identical
~50-line section: one-line description, "How it works" diagram,
flags table, model-prefix table, examples.  Most of that was
duplicated across all four — same --show-trajectory / --json /
--verbose flags, same litellm_proxy/ prefix table, same "uv run X
\$query" CLI shape.

Replace with one Agents section:
- Comparison table (runner, source, loop, default model).
- Shared flags and runner-specific flags split into two tables so
  each flag appears once.
- Single model-prefix table covering all runners.
- Plan-Execute's loop kept as the only diagram (the SDK runners
  delegate to upstream; their loops belong in the SDK docs, not here).
- Consolidated example block hits each runner's distinguishing flag.

INSTRUCTIONS.md: 481 → 313 lines.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The flag exposed an internal customisation hook (override the MCP
server registry per-run from the command line).  No benchmark
scenario or doc example uses it, and PlanExecuteRunner.__init__
already accepts server_paths programmatically for callers that
actually need the override.

- src/agent/cli.py: remove the argparse entry, _parse_servers helper,
  unused Path import, and pass-through into PlanExecuteRunner.
- INSTRUCTIONS.md: drop the row from the runner-specific flags table.

102 agent + observability tests pass; no test exercised the flag.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The runner table's 'Loop' column already names the four loop styles;
the ASCII diagram restated plan-execute's stages in prose without
adding architectural information not already in src/agent/plan_execute/.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
The provider-prefix → env-var mapping is already covered by the
Environment Variables section above (WatsonX and LiteLLM proxy
blocks); the third row was the only new info and is sufficiently
implied by the runner table's default model column.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
Previous version repeated the same 'uv run pytest src/ -v' command in
two subsections (top-level + 'Integration tests'), enumerated six
near-identical per-server commands (and missed vibration), and split
work-order integration tests into their own subsection for no clear
reason.

- Lead with the two commands actually worth memorising: unit-only
  (-k "not integration") and full suite.
- Single table mapping suite → skip-unless condition (CouchDB up,
  WatsonX env, TSFM paths) — replaces the bullet list, the per-server
  command list, and the two integration subsections.
- Three narrowing examples: path, single file, -k pattern.  Verified
  -k "integration" actually collects 50/320 tests; dropped the
  -m requires_couchdb example that didn't work because those are
  skipif marks, not collection markers.

47 → 19 lines.

Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>
@ShuxinLin ShuxinLin requested a review from DhavalRepo18 April 27, 2026 16:36
@DhavalRepo18 DhavalRepo18 merged commit 2f1069c into main Apr 27, 2026
1 check passed
@ShuxinLin ShuxinLin deleted the feat/otel-observability branch April 27, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add OpenTelemetry observability across agent runners and LiteLLM proxy

2 participants