Affected area
Plugins, Observability or exporters, Third-party integration patches
Problem or opportunity
Summary
This is an OpenClaw hook/API request needed by the NeMo Flow OpenClaw observability plugin. The current plugin can produce useful Phoenix traces from existing public hooks, but accurate provider-level LLM tracing requires OpenClaw to expose a stable provider-call lifecycle event stream.
Current Approach
Today the OpenClaw plugin hooks expose partial observability signals across separate event streams: session/agent lifecycle events, tool execution events, message-write events, and model-call timing events. The NeMo Flow plugin combines those signals into Phoenix traces, but it has to infer LLM span boundaries and correlate request, response, usage, tool calls, timing, and final output after the fact.
That reconstruction is inherently lossy because the public hook surface does not provide one stable provider-call lifecycle object keyed by a shared callId. In multi-step agent loops, one run can contain several model calls and tool calls. Without a first-class provider-call start/delta/completion/failure contract, the integration has to rely on ordering and best-effort correlation to pair message snapshots with model timings and assistant outputs.
This is acceptable for general debugging, but not strong enough for source-of-truth optimization metrics. For optimization use cases, provider-native data needs to be attached to the exact LLM call that produced it: prompt/input tokens, completion/output tokens, cache read/write tokens, cost, latency, time-to-first-byte, finish reason, tool-call metadata, retry/fallback metadata, and normalized request/response content. Without that, token/cache/cost attribution can be ambiguous in LLM -> tool -> LLM -> tool -> LLM loops, and ACG/tool-policy optimization cannot safely use the trace as authoritative evidence.
Why the Patched Path Was More Accurate
The older patched integration produced more accurate traces because it instrumented OpenClaw’s provider execution path directly, where OpenClaw had the complete request, streamed/final response, provider usage object, cache counters, timing, and error/fallback state for a single model invocation. That gave NeMo Flow a natural one-to-one mapping between a provider call and an exported LLM span.
Desired State
The plugin implementation should not patch OpenClaw internals or depend on private runtime structure. It should stay on the supported public hook API. To reach the same trace fidelity through a plugin, OpenClaw should expose provider-call lifecycle hooks with a stable callId:
- provider-call started
- provider-call delta/stream event, if streaming is enabled
- provider-call completed
- provider-call failed
Each completed/failed event should carry the normalized provider request/response, provider usage, cache counters, latency, time-to-first-byte, finish reason, retry/fallback metadata, and sanitized raw payloads where appropriate.
That would let observability integrations export exact LLM spans without guessing from message order, timing candidates, or later tool events.
Proposed enhancement
Request/track OpenClaw support for stable provider-call lifecycle hooks with a shared callId:
- model_call_started
- model_call_delta
- model_call_completed
- model_call_failed
Each completed call should expose provider, model, normalized request/response, tool calls, provider usage, cache read/write counters, latency, time-to-first-byte, finish reason, retry/fallback metadata, and sanitized raw payloads where appropriate.
Why this matters
This is not blocking PR 67, but it is needed before treating plugin-only traces as authoritative evidence for:
- LLM -> tool -> LLM -> tool -> LLM replay fidelity
- token/cost attribution per LLM call
- provider cache evidence for ACG
- tool-policy optimization and cost comparisons
- debugging retries, fallbacks, and cache behavior without heuristic correlation
Runtime contract and binding impact
This would add new public OpenClaw plugin hook events. It should not require NeMo Flow to patch OpenClaw internals, and it should not change the existing agent/tool execution behavior.
Expected contract:
- Every provider/model invocation gets a stable
callId.
callId is shared across start, delta, completed, and failed events.
- Events include
runId, sessionId, agentId, provider, model, timestamps, and request/response metadata.
- Completion events expose provider-native usage, including prompt/input tokens, completion/output tokens, total tokens, cache read/write tokens, cost where available, finish reason, latency, and time-to-first-byte.
- Failure events expose error type/status, retry/fallback metadata, and elapsed timing.
- Payloads should be sanitized consistently with OpenClaw’s existing privacy/redaction rules.
Binding impact:
- No required change to existing plugin hooks if added as new events.
- Existing plugins can ignore these hooks.
- NeMo Flow would bind to the new events and map each provider call directly to one OpenInference LLM span.
- This would reduce or remove the current best-effort correlation logic in the NeMo Flow OpenClaw plugin.
Alternatives considered
Alternatives considered
The alternatives are useful in narrower cases, but each has a limitation for authoritative optimization telemetry:
-
Best-effort reconstruction from existing public hooks. PR 67 uses this approach today. It is the right short-term path because it avoids patching OpenClaw internals, but it is still lossy compared with provider-call instrumentation and should not be treated as the long-term source-of-truth contract for optimization telemetry.
-
Continue patching OpenClaw internals. This produced more accurate traces in the earlier prototype because it instrumented the provider execution path directly, but it is not sustainable. It depends on private runtime structure and increases maintenance risk across OpenClaw releases.
-
Infer provider-call identity from message order and timing. This can work for simple sessions, but it becomes ambiguous in multi-step loops, retries, fallbacks, streaming responses, or concurrent tool activity.
-
Trace only final assistant messages and tool events. This is clean for debugging, but it loses provider-native token/cache/cost/latency evidence needed for ACG and tool-policy optimization.
Acceptance criteria
Acceptance criteria
- OpenClaw exposes public plugin hooks for provider-call lifecycle events: started, delta/streaming when applicable, completed, and failed.
- All events for the same provider/model invocation share a stable
callId.
- Events include
runId, sessionId, agentId, provider, model, timestamps, and request/response metadata.
- Completion events expose provider-native usage: prompt/input tokens, completion/output tokens, total tokens, cache read/write tokens, cost when available, finish reason, latency, and time-to-first-byte.
- Failure events expose error type/status, elapsed timing, and retry/fallback metadata.
- Payloads follow OpenClaw’s existing privacy/redaction policy and do not expose secrets.
- Existing plugin hooks remain backward-compatible.
- A NeMo Flow plugin can map each provider call directly to one OpenInference LLM span without relying on message-order or timing-candidate heuristics.
- A multi-step agent loop can produce an accurate
LLM -> tool -> LLM -> tool -> LLM trace with visible LLM input/output and correct token/cache/cost attribution per LLM span.
Affected area
Plugins, Observability or exporters, Third-party integration patches
Problem or opportunity
Summary
This is an OpenClaw hook/API request needed by the NeMo Flow OpenClaw observability plugin. The current plugin can produce useful Phoenix traces from existing public hooks, but accurate provider-level LLM tracing requires OpenClaw to expose a stable provider-call lifecycle event stream.
Current Approach
Today the OpenClaw plugin hooks expose partial observability signals across separate event streams: session/agent lifecycle events, tool execution events, message-write events, and model-call timing events. The NeMo Flow plugin combines those signals into Phoenix traces, but it has to infer LLM span boundaries and correlate request, response, usage, tool calls, timing, and final output after the fact.
That reconstruction is inherently lossy because the public hook surface does not provide one stable provider-call lifecycle object keyed by a shared
callId. In multi-step agent loops, one run can contain several model calls and tool calls. Without a first-class provider-call start/delta/completion/failure contract, the integration has to rely on ordering and best-effort correlation to pair message snapshots with model timings and assistant outputs.This is acceptable for general debugging, but not strong enough for source-of-truth optimization metrics. For optimization use cases, provider-native data needs to be attached to the exact LLM call that produced it: prompt/input tokens, completion/output tokens, cache read/write tokens, cost, latency, time-to-first-byte, finish reason, tool-call metadata, retry/fallback metadata, and normalized request/response content. Without that, token/cache/cost attribution can be ambiguous in
LLM -> tool -> LLM -> tool -> LLMloops, and ACG/tool-policy optimization cannot safely use the trace as authoritative evidence.Why the Patched Path Was More Accurate
The older patched integration produced more accurate traces because it instrumented OpenClaw’s provider execution path directly, where OpenClaw had the complete request, streamed/final response, provider usage object, cache counters, timing, and error/fallback state for a single model invocation. That gave NeMo Flow a natural one-to-one mapping between a provider call and an exported LLM span.
Desired State
The plugin implementation should not patch OpenClaw internals or depend on private runtime structure. It should stay on the supported public hook API. To reach the same trace fidelity through a plugin, OpenClaw should expose provider-call lifecycle hooks with a stable
callId:Each completed/failed event should carry the normalized provider request/response, provider usage, cache counters, latency, time-to-first-byte, finish reason, retry/fallback metadata, and sanitized raw payloads where appropriate.
That would let observability integrations export exact LLM spans without guessing from message order, timing candidates, or later tool events.
Proposed enhancement
Request/track OpenClaw support for stable provider-call lifecycle hooks with a shared callId:
Each completed call should expose provider, model, normalized request/response, tool calls, provider usage, cache read/write counters, latency, time-to-first-byte, finish reason, retry/fallback metadata, and sanitized raw payloads where appropriate.
Why this matters
This is not blocking PR 67, but it is needed before treating plugin-only traces as authoritative evidence for:
Runtime contract and binding impact
This would add new public OpenClaw plugin hook events. It should not require NeMo Flow to patch OpenClaw internals, and it should not change the existing agent/tool execution behavior.
Expected contract:
callId.callIdis shared across start, delta, completed, and failed events.runId,sessionId,agentId, provider, model, timestamps, and request/response metadata.Binding impact:
Alternatives considered
Alternatives considered
The alternatives are useful in narrower cases, but each has a limitation for authoritative optimization telemetry:
Best-effort reconstruction from existing public hooks. PR 67 uses this approach today. It is the right short-term path because it avoids patching OpenClaw internals, but it is still lossy compared with provider-call instrumentation and should not be treated as the long-term source-of-truth contract for optimization telemetry.
Continue patching OpenClaw internals. This produced more accurate traces in the earlier prototype because it instrumented the provider execution path directly, but it is not sustainable. It depends on private runtime structure and increases maintenance risk across OpenClaw releases.
Infer provider-call identity from message order and timing. This can work for simple sessions, but it becomes ambiguous in multi-step loops, retries, fallbacks, streaming responses, or concurrent tool activity.
Trace only final assistant messages and tool events. This is clean for debugging, but it loses provider-native token/cache/cost/latency evidence needed for ACG and tool-policy optimization.
Acceptance criteria
Acceptance criteria
callId.runId,sessionId,agentId, provider, model, timestamps, and request/response metadata.LLM -> tool -> LLM -> tool -> LLMtrace with visible LLM input/output and correct token/cache/cost attribution per LLM span.