Add OpenTelemetry observability across agent runners and LiteLLM proxy

## Summary

Add end-to-end OpenTelemetry tracing so every agent run emits a single correlated trace that spans the agent graph, each LLM turn, and each MCP tool call — with the LiteLLM proxy contributing its own child span per call.

## Motivation

Today, per-turn token usage is captured only in `AgentResult.trajectory` (see `DeepAgentRunner._build_trajectory`, `runner.py:111-161`), and LLM-call detail lives only in the proxy logs. Debugging questions like *"which turn produced the 14k-token prompt?"* require manually cross-referencing timestamps. OTEL gives us one trace per run with flamegraph-style drill-down, works across all four runners uniformly, and is vendor-neutral (any OTLP backend: Jaeger, Tempo, Honeycomb, Grafana Cloud).

## Scope

1. **New package `src/observability/`** with a `tracing.py` module exposing:
   - `init_tracing(service_name: str) -> None` — configure `TracerProvider`, OTLP/HTTP exporter, and `HTTPXClientInstrumentor` (auto-injects `traceparent` into LiteLLM proxy requests).
   - `get_tracer() -> Tracer` — convenience accessor used by runners.
   - No-op when `OTEL_SDK_DISABLED=true` or required env not set, so existing users are unaffected.

2. **Runner instrumentation** using GenAI semantic conventions:
   - Root span `agent.run` with attrs `agent.runner`, `gen_ai.request.model`, `gen_ai.system`.
   - Child `agent.turn` spans with `turn.index`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`.
   - Child `tool.call` spans with `tool.name`, `tool.server`.
   - All four runners: `plan_execute`, `claude_agent`, `openai_agent`, `deep_agent`.

3. **CLI wiring**: each entry point (`plan-execute`, `claude-agent`, `openai-agent`, `deep-agent`) calls `init_tracing()` once at startup.

4. **LiteLLM proxy config docs**: document `callbacks: ["otel"]` + `OTEL_EXPORTER_OTLP_ENDPOINT` so the proxy's spans nest under the agent trace via the propagated `traceparent` header.

5. **Tests**: unit tests using `InMemorySpanExporter` to assert the expected span tree per runner.

## Dependencies

Added as a new optional group `[dependency-groups.otel]` so OTEL is opt-in:

- `opentelemetry-api`
- `opentelemetry-sdk`
- `opentelemetry-exporter-otlp-proto-http`
- `opentelemetry-instrumentation-httpx`

## Out of scope

- Metrics (counters/histograms) — tracing first, metrics in a follow-up.
- Replacing the existing `Trajectory` serialization — trajectories stay as the in-process result object.
- LangSmith or Langfuse integration — OTEL-only for now; these can be added as additional exporters later if desired.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenTelemetry observability across agent runners and LiteLLM proxy #270

Summary

Motivation

Scope

Dependencies

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add OpenTelemetry observability across agent runners and LiteLLM proxy #270

Description

Summary

Motivation

Scope

Dependencies

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions