diff --git a/docs/index.md b/docs/index.md index a153fef8..4d7cc2a3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -68,6 +68,7 @@ Use the reading path that matches your task: | Observe a local coding-agent CLI | [NeMo Relay CLI](nemo-relay-cli/about.md) | | Package reusable behavior | [Build Plugins](build-plugins/about.md) | | Export traces or trajectories | [Observability](plugins/observability/about.md) | +| Debug trace incidents | [Trace Incident Runbook](troubleshooting/trace-incident-runbook.md) | | Tune performance with adaptive behavior | [Adaptive](plugins/adaptive/about.md) | | Look up symbols | [APIs](reference/api/index.md) | @@ -270,6 +271,7 @@ reference/performance :maxdepth: 2 Troubleshooting Guide +Trace Incident Runbook ``` ```{toctree} diff --git a/docs/plugins/observability/about.md b/docs/plugins/observability/about.md index 81b8741d..2b49f832 100644 --- a/docs/plugins/observability/about.md +++ b/docs/plugins/observability/about.md @@ -57,9 +57,13 @@ Choose the exporter based on the downstream system: | Generic OTLP traces | [OpenTelemetry](opentelemetry.md) | | OpenInference-oriented agent and LLM spans | [OpenInference](openinference.md) | -Start with local event inspection before production export. Add sanitize +Start with in-process event inspection before exporting externally. Add sanitize guardrails before exporters receive sensitive payloads. +For trace incidents involving missing traces, wrong scope attachment, export +failures, duplicate events, or sensitive telemetry, use the +[Trace Incident Runbook](../../troubleshooting/trace-incident-runbook.md). + ## Correlating Trajectories And Traces When ATIF and trace exporters observe the same NeMo Relay events, they share diff --git a/docs/troubleshooting/trace-incident-runbook.md b/docs/troubleshooting/trace-incident-runbook.md new file mode 100644 index 00000000..d8d31682 --- /dev/null +++ b/docs/troubleshooting/trace-incident-runbook.md @@ -0,0 +1,210 @@ + + +# Trace Incident Runbook + +Use this runbook when a NeMo Relay application has missing traces, partial +traces, incorrect scope parentage, exporter failures, duplicate events, or +sensitive data in telemetry. It assumes that the application already has a +baseline scope and call instrumentation path. + +For first-time setup problems, start with the +[Troubleshooting Guide](troubleshooting-guide.md). For conceptual grounding, +refer to [Agent Runtime Primer](../getting-started/agent-runtime-primer.md), +[Scopes](../about/concepts/scopes.md), [Events](../about/concepts/events.md), +and [Subscribers](../about/concepts/subscribers.md). + +## Protect Sensitive Data First + +Do not collect raw prompts, model responses, authorization headers, tokens, +customer records, tool arguments, or provider payloads while triaging an +incident. Capture the smallest sanitized event sample that proves the failure. + +Before exporting incident artifacts outside the current trust boundary, verify +that sanitize guardrails or exporter filters remove sensitive fields. Sanitize +guardrails change emitted telemetry payloads only; they do not change the live +request or response passed to the tool, model provider, or application. Refer to +[Middleware](../about/concepts/middleware.md) and +[Add Middleware](../instrument-applications/advanced-guide.md) for the +guardrail boundary. + +## Triage By Symptom + +Use this table to choose the first check for the symptom you see. + +| Symptom | Likely Area | Start With | +|---|---|---| +| No traces | Missing instrumentation boundary or inactive exporter | [Confirm Instrumentation Boundary](#confirm-instrumentation-boundary) | +| Partial traces | Unwrapped calls, dropped streams, or late subscriber registration | [Confirm Managed Calls](#confirm-managed-calls) | +| Wrong parent or child scope | Scope propagation or shared scope stack issue | [Confirm Active Scope](#confirm-active-scope) | +| Events appear in process but export fails elsewhere | Exporter config, endpoint, environment, or flush path | [Confirm Exporter Setup](#confirm-exporter-setup) | +| Duplicate events | Duplicate subscribers, duplicate wrappers, or mixed manual and managed lifecycle calls | [Check For Duplicate Event Sources](#check-for-duplicate-event-sources) | +| Sensitive data appears in telemetry | Missing sanitize guardrails before subscribers or exporters | [Confirm Sanitization Before Export](#confirm-sanitization-before-export) | + +## Run The Ordered Checks + +Run these checks in order before changing exporter or application code. + +1. Confirm the instrumentation boundary. +2. Confirm the active scope and root scope ownership. +3. Confirm managed tool and LLM calls. +4. Confirm subscriber or exporter registration timing. +5. Confirm exporter endpoint, environment, and flush behavior. +6. Confirm sanitization before export. + +## Confirm Instrumentation Boundary + +Start with the code path that owns the real work. + +- If application code calls the tool or model provider directly, verify that the + call path uses [Instrument Applications](../instrument-applications/about.md) + guidance. +- If a framework owns scheduling, retries, callbacks, or provider payloads, + verify that the integration uses + [Integrate into Frameworks](../integrate-frameworks/about.md) guidance. +- If a plugin installs runtime behavior, verify that the plugin is activated + before the request path starts. + +Do not debug an exporter first if no in-process subscriber sees events. Add or +enable a sanitized in-process subscriber at the same boundary and confirm that +scope, tool, or LLM events exist before investigating external export. + +## Confirm Active Scope + +Trace gaps and wrong parent-child relationships usually start with scope +ownership. Verify these conditions: + +- Each request, agent run, or workflow starts under the intended top-level scope. +- Detached tasks, worker threads, callbacks, and async jobs receive the intended + scope stack when they should remain part of the same logical run. +- Independent requests receive fresh isolated scope stacks. +- Scope-local middleware and subscribers are registered on the owning scope or + an ancestor scope. + +Use [Adding Scopes and Marks](../instrument-applications/adding-scopes-and-marks.md) +and [Scopes](../about/concepts/scopes.md) to compare the intended root scope +with the emitted event `uuid` and `parent_uuid` values. + +## Confirm Managed Calls + +Partial traces often mean some work bypasses the runtime helpers. Check these +areas: + +- Tool calls that should emit tool start and end events use the managed tool + call path. +- Model calls that should emit LLM start and end events use the managed LLM call + path or an integration wrapper that emits equivalent lifecycle events. +- Manual lifecycle calls emit matched start and end events with the same + lifecycle UUID. +- Streaming LLM responses are drained until completion so final events, + collectors, and subscribers can observe the completed output. + +Refer to [Instrument a Tool Call](../instrument-applications/instrument-tool-call.md), +[Instrument an LLM Call](../instrument-applications/instrument-llm-call.md), +[Wrap Tool Calls](../integrate-frameworks/wrap-tool-calls.md), and +[Wrap LLM Calls](../integrate-frameworks/wrap-llm-calls.md). + +## Confirm Subscriber And Exporter Registration + +Events are not buffered for subscribers that register after the event has +already been emitted. Verify these conditions: + +- Plugin-managed observability components are loaded before the request path. +- Manual subscribers are registered before the scope, tool, or LLM events they + need to observe. +- Scope-local subscribers are registered on a scope that is active for the work + they should observe. +- Exporter filters match the intended root scope or event category. +- Shutdown, teardown, or request completion calls flush owned exporters before + the process exits or the container stops. + +Use [Observability](../plugins/observability/about.md), +[Observability Configuration](../plugins/observability/configuration.md), and +[Subscribers](../about/concepts/subscribers.md) to verify the registration +lifecycle. + +## Confirm Exporter Setup + +If in-process event inspection works but export fails elsewhere, isolate +exporter transport and configuration from runtime instrumentation. + +For file or trajectory export, confirm these settings: + +- Output paths are writable by the running process. +- The application shuts down or clears the exporter in a path that flushes + partial output. +- ATIF export is scoped to the intended agent root and does not mix concurrent + root scopes. + +For OpenTelemetry or OpenInference export, confirm these settings: + +- The OpenTelemetry Protocol (OTLP) endpoint, headers, credentials, and network + egress are available in the target environment. +- The exporter is enabled in the active configuration file or plugin document. +- The backend receives spans with `nemo_relay.uuid` and + `nemo_relay.parent_uuid` attributes. +- The application flushes and shuts down the subscriber during graceful + termination. + +Refer to [Agent Trajectory Observability Format (ATOF)](../plugins/observability/atof.md), +[Agent Trajectory Interchange Format (ATIF)](../plugins/observability/atif.md), +[OpenTelemetry](../plugins/observability/opentelemetry.md), and +[OpenInference](../plugins/observability/openinference.md). + +## Check For Duplicate Event Sources + +Duplicate events usually mean the same boundary is instrumented more than once. +Check these areas: + +- The application does not wrap a call that a framework integration already + wraps. +- Manual lifecycle calls are not emitted around the same call that already uses + managed tool or LLM helpers. +- Plugin-managed exporters and manually registered exporters are not both + active for the same output path or backend. +- Retry logic belongs to the framework or application and is not being counted + as duplicate telemetry for the same real call. + +If duplicate events are expected because a retry or fallback actually executed +more than once, preserve the events and add stable names or metadata that let +the downstream backend distinguish attempts. + +## Confirm Sanitization Before Export + +Sensitive data in telemetry is an incident. Use this order: + +1. Stop or disable the affected exporter if sensitive data is leaving the + intended trust boundary. +2. Keep the application path stable unless the live request itself is unsafe. +3. Add or fix sanitize-request and sanitize-response guardrails before + subscribers and exporters receive events. +4. Validate the sanitized event with ATOF JSONL or an in-process subscriber + before re-enabling external export. +5. Re-enable one exporter at a time and confirm the downstream backend no + longer receives sensitive fields. + +Use a request intercept only when the real request to the tool or provider must +change. Use a sanitize guardrail when only the recorded telemetry should change. + +## Escalation Capture Checklist + +Collect this information before escalating an incident: + +- NeMo Relay version and binding package version. +- Language binding and runtime version. +- Whether instrumentation is direct application code, a framework integration, + or plugin-managed behavior. +- Exporter type, configuration source, and activation path. +- Sanitized event sample that shows `uuid`, `parent_uuid`, `category`, + `scope_category`, name, and redacted metadata. +- Runtime shape, such as single process, worker pool, async tasks, sidecar, job + queue, or container orchestration. +- Reproduction scope, including whether the failure occurs for one request, one + tenant, one service, or all requests. +- Recent changes to instrumentation, plugin configuration, exporter endpoints, + runtime environment, or tracing backend configuration. + +Do not attach raw prompts, model responses, credentials, customer records, +authorization headers, or unredacted tool arguments to escalation artifacts. diff --git a/docs/troubleshooting/troubleshooting-guide.md b/docs/troubleshooting/troubleshooting-guide.md index c54b21bc..5ddd4f3b 100644 --- a/docs/troubleshooting/troubleshooting-guide.md +++ b/docs/troubleshooting/troubleshooting-guide.md @@ -7,6 +7,10 @@ SPDX-License-Identifier: Apache-2.0 Use this page when a NeMo Relay setup, build, or runtime workflow does not behave as expected. +For trace incidents involving missing traces, wrong scope attachment, export +failures, duplicate events, or sensitive telemetry, start with the +[Trace Incident Runbook](trace-incident-runbook.md). + ## Package Or Build Setup Fails Confirm that your environment matches [Prerequisites](../getting-started/prerequisites.md), then rerun the binding-specific setup command from [Installation](../getting-started/installation.md).