Skip to content

[Enhancement]: Request OpenClaw provider-call hooks for more accurate LLM traces #75

@mnajafian-nv

Description

@mnajafian-nv

Affected area

Plugins, Observability or exporters, Third-party integration patches

Problem or opportunity

Summary

This is an OpenClaw hook/API request needed by the NeMo Flow OpenClaw observability plugin. The current plugin can produce useful Phoenix traces from existing public hooks, but accurate provider-level LLM tracing requires OpenClaw to expose a stable provider-call lifecycle event stream.

Current Approach

Today the OpenClaw plugin hooks expose partial observability signals across separate event streams: session/agent lifecycle events, tool execution events, message-write events, and model-call timing events. The NeMo Flow plugin combines those signals into Phoenix traces, but it has to infer LLM span boundaries and correlate request, response, usage, tool calls, timing, and final output after the fact.

That reconstruction is inherently lossy because the public hook surface does not provide one stable provider-call lifecycle object keyed by a shared callId. In multi-step agent loops, one run can contain several model calls and tool calls. Without a first-class provider-call start/delta/completion/failure contract, the integration has to rely on ordering and best-effort correlation to pair message snapshots with model timings and assistant outputs.

This is acceptable for general debugging, but not strong enough for source-of-truth optimization metrics. For optimization use cases, provider-native data needs to be attached to the exact LLM call that produced it: prompt/input tokens, completion/output tokens, cache read/write tokens, cost, latency, time-to-first-byte, finish reason, tool-call metadata, retry/fallback metadata, and normalized request/response content. Without that, token/cache/cost attribution can be ambiguous in LLM -> tool -> LLM -> tool -> LLM loops, and ACG/tool-policy optimization cannot safely use the trace as authoritative evidence.

Why the Patched Path Was More Accurate

The older patched integration produced more accurate traces because it instrumented OpenClaw’s provider execution path directly, where OpenClaw had the complete request, streamed/final response, provider usage object, cache counters, timing, and error/fallback state for a single model invocation. That gave NeMo Flow a natural one-to-one mapping between a provider call and an exported LLM span.

Desired State

The plugin implementation should not patch OpenClaw internals or depend on private runtime structure. It should stay on the supported public hook API. To reach the same trace fidelity through a plugin, OpenClaw should expose provider-call lifecycle hooks with a stable callId:

  • provider-call started
  • provider-call delta/stream event, if streaming is enabled
  • provider-call completed
  • provider-call failed

Each completed/failed event should carry the normalized provider request/response, provider usage, cache counters, latency, time-to-first-byte, finish reason, retry/fallback metadata, and sanitized raw payloads where appropriate.

That would let observability integrations export exact LLM spans without guessing from message order, timing candidates, or later tool events.

Proposed enhancement

Request/track OpenClaw support for stable provider-call lifecycle hooks with a shared callId:

  • model_call_started
  • model_call_delta
  • model_call_completed
  • model_call_failed

Each completed call should expose provider, model, normalized request/response, tool calls, provider usage, cache read/write counters, latency, time-to-first-byte, finish reason, retry/fallback metadata, and sanitized raw payloads where appropriate.

Why this matters
This is not blocking PR 67, but it is needed before treating plugin-only traces as authoritative evidence for:

  • LLM -> tool -> LLM -> tool -> LLM replay fidelity
  • token/cost attribution per LLM call
  • provider cache evidence for ACG
  • tool-policy optimization and cost comparisons
  • debugging retries, fallbacks, and cache behavior without heuristic correlation

Runtime contract and binding impact

This would add new public OpenClaw plugin hook events. It should not require NeMo Flow to patch OpenClaw internals, and it should not change the existing agent/tool execution behavior.

Expected contract:

  • Every provider/model invocation gets a stable callId.
  • callId is shared across start, delta, completed, and failed events.
  • Events include runId, sessionId, agentId, provider, model, timestamps, and request/response metadata.
  • Completion events expose provider-native usage, including prompt/input tokens, completion/output tokens, total tokens, cache read/write tokens, cost where available, finish reason, latency, and time-to-first-byte.
  • Failure events expose error type/status, retry/fallback metadata, and elapsed timing.
  • Payloads should be sanitized consistently with OpenClaw’s existing privacy/redaction rules.

Binding impact:

  • No required change to existing plugin hooks if added as new events.
  • Existing plugins can ignore these hooks.
  • NeMo Flow would bind to the new events and map each provider call directly to one OpenInference LLM span.
  • This would reduce or remove the current best-effort correlation logic in the NeMo Flow OpenClaw plugin.

Alternatives considered

Alternatives considered
The alternatives are useful in narrower cases, but each has a limitation for authoritative optimization telemetry:

  • Best-effort reconstruction from existing public hooks. PR 67 uses this approach today. It is the right short-term path because it avoids patching OpenClaw internals, but it is still lossy compared with provider-call instrumentation and should not be treated as the long-term source-of-truth contract for optimization telemetry.

  • Continue patching OpenClaw internals. This produced more accurate traces in the earlier prototype because it instrumented the provider execution path directly, but it is not sustainable. It depends on private runtime structure and increases maintenance risk across OpenClaw releases.

  • Infer provider-call identity from message order and timing. This can work for simple sessions, but it becomes ambiguous in multi-step loops, retries, fallbacks, streaming responses, or concurrent tool activity.

  • Trace only final assistant messages and tool events. This is clean for debugging, but it loses provider-native token/cache/cost/latency evidence needed for ACG and tool-policy optimization.

Acceptance criteria

Acceptance criteria

  • OpenClaw exposes public plugin hooks for provider-call lifecycle events: started, delta/streaming when applicable, completed, and failed.
  • All events for the same provider/model invocation share a stable callId.
  • Events include runId, sessionId, agentId, provider, model, timestamps, and request/response metadata.
  • Completion events expose provider-native usage: prompt/input tokens, completion/output tokens, total tokens, cache read/write tokens, cost when available, finish reason, latency, and time-to-first-byte.
  • Failure events expose error type/status, elapsed timing, and retry/fallback metadata.
  • Payloads follow OpenClaw’s existing privacy/redaction policy and do not expose secrets.
  • Existing plugin hooks remain backward-compatible.
  • A NeMo Flow plugin can map each provider call directly to one OpenInference LLM span without relying on message-order or timing-candidate heuristics.
  • A multi-step agent loop can produce an accurate LLM -> tool -> LLM -> tool -> LLM trace with visible LLM input/output and correct token/cache/cost attribution per LLM span.

Metadata

Metadata

Assignees

Labels

Improvementimprovement to existing functionality
No fields configured for Enhancement.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions