Skip to content

Add OpenTelemetry observability across agent runners and LiteLLM proxy #270

@ShuxinLin

Description

@ShuxinLin

Summary

Add end-to-end OpenTelemetry tracing so every agent run emits a single correlated trace that spans the agent graph, each LLM turn, and each MCP tool call — with the LiteLLM proxy contributing its own child span per call.

Motivation

Today, per-turn token usage is captured only in AgentResult.trajectory (see DeepAgentRunner._build_trajectory, runner.py:111-161), and LLM-call detail lives only in the proxy logs. Debugging questions like "which turn produced the 14k-token prompt?" require manually cross-referencing timestamps. OTEL gives us one trace per run with flamegraph-style drill-down, works across all four runners uniformly, and is vendor-neutral (any OTLP backend: Jaeger, Tempo, Honeycomb, Grafana Cloud).

Scope

  1. New package src/observability/ with a tracing.py module exposing:

    • init_tracing(service_name: str) -> None — configure TracerProvider, OTLP/HTTP exporter, and HTTPXClientInstrumentor (auto-injects traceparent into LiteLLM proxy requests).
    • get_tracer() -> Tracer — convenience accessor used by runners.
    • No-op when OTEL_SDK_DISABLED=true or required env not set, so existing users are unaffected.
  2. Runner instrumentation using GenAI semantic conventions:

    • Root span agent.run with attrs agent.runner, gen_ai.request.model, gen_ai.system.
    • Child agent.turn spans with turn.index, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens.
    • Child tool.call spans with tool.name, tool.server.
    • All four runners: plan_execute, claude_agent, openai_agent, deep_agent.
  3. CLI wiring: each entry point (plan-execute, claude-agent, openai-agent, deep-agent) calls init_tracing() once at startup.

  4. LiteLLM proxy config docs: document callbacks: ["otel"] + OTEL_EXPORTER_OTLP_ENDPOINT so the proxy's spans nest under the agent trace via the propagated traceparent header.

  5. Tests: unit tests using InMemorySpanExporter to assert the expected span tree per runner.

Dependencies

Added as a new optional group [dependency-groups.otel] so OTEL is opt-in:

  • opentelemetry-api
  • opentelemetry-sdk
  • opentelemetry-exporter-otlp-proto-http
  • opentelemetry-instrumentation-httpx

Out of scope

  • Metrics (counters/histograms) — tracing first, metrics in a follow-up.
  • Replacing the existing Trajectory serialization — trajectories stay as the in-process result object.
  • LangSmith or Langfuse integration — OTEL-only for now; these can be added as additional exporters later if desired.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions