Context
#298–#306 shipped the OTel foundation:
OtelTraceExporter + OtlpJsonFileExporter + Langfuse/Braintrust/Confident presets
OtelStreamingObserver emits real-time agentv.eval / chat / execute_tool spans during eval execution
- W3C
traceparent propagation, GenAI per-span tokens, turn-level span grouping
The remaining gap: all of this is reachable only through the eval orchestrator. A user who just wants to watch an agent run (no test case, no graders) has no surface today. That blocks AgentV from being a live observability product.
A full architecture + 3-MVP plan is in the wiki synthesis:
tsoyang-org/ai-research-wiki#11 → concepts/agentv-realtime-observability.md
This issue tracks MVP-2 (agentv watch) from that plan. MVP-3 (in-flight scoring) gets a separate followup once MVP-2 ships.
Proposal: agentv watch <target>
Reuse the existing OtelStreamingObserver around a free-form agent invocation:
agentv watch claude-sdk -i prompt.md
agentv watch claude-sdk -i prompt.md --collect # also write .agentv/traces/watch-<ts>.otlp.json
echo "do the thing" | agentv watch codex --otel-endpoint http://localhost:4317
Behavior:
- No test case, no
evaluate(), no graders.
- Same provider adapters as
agentv run; same streaming observer; same gen_ai.* span attributes.
- Subagent trace propagation: set
TRACEPARENT + OTEL_RESOURCE_ATTRIBUTES in the child process env so anything reading W3C tracecontext (Copilot CLI, claude-agent-sdk, nested agentv watch) becomes a child span automatically.
- Output: stdout (agent response) + OTLP stream to whatever
OTEL_EXPORTER_OTLP_ENDPOINT says. Optional --collect writes OTLP JSON to .agentv/traces/ for offline replay.
Implementation surface: apps/cli/src/commands/watch/ consuming packages/core/src/observability/ directly. No new wire format. No new exporters.
Why this slice, why now
- Smallest blast radius. The streaming observer and exporter already exist; this command wires them up outside the orchestrator. No new infrastructure.
- Unlocks MVP-3. Once
agentv watch exists, in-flight scoring is just adding grader hooks at span-close in the watch command.
- Unlocks harness-quality eval. A
--collect flag over real Claude/Codex/Copilot sessions is exactly the input shape needed for edit-survival and longitudinal-evaluation work (see VS Code Copilot's four_gram / no_revert precedent in [[vscode-copilot-agents]]).
- No coupling to a backend. Aspire Dashboard, Phoenix, Langfuse, Braintrust, Datadog — all consume OTLP. AgentV stays the emitter.
Acceptance criteria
agentv watch claude-sdk -i prompt.md produces a complete OTel trace tree (root invoke_agent span + chat + execute_tool children) visible in any OTLP backend.
agentv watch <target> --collect writes a valid OTLP-JSON file under .agentv/traces/ that any OTel backend (e.g. otel-cli) can ingest.
- Subagent spans nest correctly when AgentV is invoked from another AgentV-instrumented process (verify via local Aspire Dashboard run —
docker run mcr.microsoft.com/dotnet/aspire-dashboard).
- Docs page at
apps/web/src/content/docs/observability/watch.md plus the 6-line Aspire docker run recipe for the "first 5 minutes" path.
- Unit tests for the watch command match the
OtelStreamingObserver test patterns already in packages/core/test/observability/.
Non-goals (deferred)
- In-flight scoring — separate followup issue (MVP-3 in the synthesis); needs span-close hooks + sampling story.
- Hosted dashboard. Aspire / Phoenix / Langfuse already exist. The product-strategy in [[agentv-improvement-opportunities]] §"Out-of-scope" is explicit.
- New wire format. OTel + OTLP is the format.
References
Context
#298–#306 shipped the OTel foundation:
OtelTraceExporter+OtlpJsonFileExporter+ Langfuse/Braintrust/Confident presetsOtelStreamingObserveremits real-timeagentv.eval/chat/execute_toolspans during eval executiontraceparentpropagation, GenAI per-span tokens, turn-level span groupingThe remaining gap: all of this is reachable only through the eval orchestrator. A user who just wants to watch an agent run (no test case, no graders) has no surface today. That blocks AgentV from being a live observability product.
A full architecture + 3-MVP plan is in the wiki synthesis:
tsoyang-org/ai-research-wiki#11 →
concepts/agentv-realtime-observability.mdThis issue tracks MVP-2 (
agentv watch) from that plan. MVP-3 (in-flight scoring) gets a separate followup once MVP-2 ships.Proposal:
agentv watch <target>Reuse the existing
OtelStreamingObserveraround a free-form agent invocation:Behavior:
evaluate(), no graders.agentv run; same streaming observer; samegen_ai.*span attributes.TRACEPARENT+OTEL_RESOURCE_ATTRIBUTESin the child process env so anything reading W3C tracecontext (Copilot CLI, claude-agent-sdk, nestedagentv watch) becomes a child span automatically.OTEL_EXPORTER_OTLP_ENDPOINTsays. Optional--collectwrites OTLP JSON to.agentv/traces/for offline replay.Implementation surface:
apps/cli/src/commands/watch/consumingpackages/core/src/observability/directly. No new wire format. No new exporters.Why this slice, why now
agentv watchexists, in-flight scoring is just adding grader hooks at span-close in the watch command.--collectflag over real Claude/Codex/Copilot sessions is exactly the input shape needed for edit-survival and longitudinal-evaluation work (see VS Code Copilot'sfour_gram/no_revertprecedent in [[vscode-copilot-agents]]).Acceptance criteria
agentv watch claude-sdk -i prompt.mdproduces a complete OTel trace tree (rootinvoke_agentspan +chat+execute_toolchildren) visible in any OTLP backend.agentv watch <target> --collectwrites a valid OTLP-JSON file under.agentv/traces/that any OTel backend (e.g.otel-cli) can ingest.docker run mcr.microsoft.com/dotnet/aspire-dashboard).apps/web/src/content/docs/observability/watch.mdplus the 6-line Aspiredocker runrecipe for the "first 5 minutes" path.OtelStreamingObservertest patterns already inpackages/core/test/observability/.Non-goals (deferred)
References
concepts/agentv-realtime-observability.mdpackages/core/src/observability/otel-exporter.ts,apps/cli/src/commands/eval/run-eval.ts:843