From 512d5dac5a2f1ae57ae30061720224bbe8dec2ac Mon Sep 17 00:00:00 2001 From: mnajafian-nv Date: Thu, 21 May 2026 17:45:48 -0700 Subject: [PATCH 1/2] docs: add production incident runbook Signed-off-by: mnajafian-nv --- docs/index.md | 2 + docs/plugins/observability/about.md | 4 + .../production-incident-runbook.md | 210 ++++++++++++++++++ docs/troubleshooting/troubleshooting-guide.md | 4 + 4 files changed, 220 insertions(+) create mode 100644 docs/troubleshooting/production-incident-runbook.md diff --git a/docs/index.md b/docs/index.md index a153fef8..05dad9b8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -68,6 +68,7 @@ Use the reading path that matches your task: | Observe a local coding-agent CLI | [NeMo Relay CLI](nemo-relay-cli/about.md) | | Package reusable behavior | [Build Plugins](build-plugins/about.md) | | Export traces or trajectories | [Observability](plugins/observability/about.md) | +| Debug production trace incidents | [Production Incident Runbook](troubleshooting/production-incident-runbook.md) | | Tune performance with adaptive behavior | [Adaptive](plugins/adaptive/about.md) | | Look up symbols | [APIs](reference/api/index.md) | @@ -270,6 +271,7 @@ reference/performance :maxdepth: 2 Troubleshooting Guide +Production Incident Runbook ``` ```{toctree} diff --git a/docs/plugins/observability/about.md b/docs/plugins/observability/about.md index 81b8741d..d251a5b5 100644 --- a/docs/plugins/observability/about.md +++ b/docs/plugins/observability/about.md @@ -60,6 +60,10 @@ Choose the exporter based on the downstream system: Start with local event inspection before production export. Add sanitize guardrails before exporters receive sensitive payloads. +For production incidents involving missing traces, wrong scope attachment, +export failures, duplicate events, or sensitive telemetry, use the +[Production Incident Runbook](../../troubleshooting/production-incident-runbook.md). + ## Correlating Trajectories And Traces When ATIF and trace exporters observe the same NeMo Relay events, they share diff --git a/docs/troubleshooting/production-incident-runbook.md b/docs/troubleshooting/production-incident-runbook.md new file mode 100644 index 00000000..c4d04134 --- /dev/null +++ b/docs/troubleshooting/production-incident-runbook.md @@ -0,0 +1,210 @@ + + +# Production Incident Runbook + +Use this runbook when a production NeMo Relay deployment has missing traces, +partial traces, incorrect scope parentage, exporter failures, duplicate events, +or sensitive data in telemetry. It assumes that the application already has a +baseline scope and call instrumentation path. + +For first-time setup problems, start with the +[Troubleshooting Guide](troubleshooting-guide.md). For conceptual grounding, +refer to [Agent Runtime Primer](../getting-started/agent-runtime-primer.md), +[Scopes](../about/concepts/scopes.md), [Events](../about/concepts/events.md), +and [Subscribers](../about/concepts/subscribers.md). + +## Protect Production Data First + +Do not collect raw prompts, model responses, authorization headers, tokens, +customer records, tool arguments, or provider payloads while triaging an +incident. Capture the smallest sanitized event sample that proves the failure. + +Before exporting incident artifacts outside the production environment, verify +that sanitize guardrails or exporter filters remove sensitive fields. Sanitize +guardrails change emitted telemetry payloads only; they do not change the live +request or response passed to the tool, model provider, or application. Refer to +[Middleware](../about/concepts/middleware.md) and +[Add Middleware](../instrument-applications/advanced-guide.md) for the +guardrail boundary. + +## Triage By Symptom + +Use this table to choose the first check for the symptom you see. + +| Symptom | Likely Area | Start With | +|---|---|---| +| No traces | Missing instrumentation boundary or inactive exporter | [Confirm Instrumentation Boundary](#confirm-instrumentation-boundary) | +| Partial traces | Unwrapped calls, dropped streams, or late subscriber registration | [Confirm Managed Calls](#confirm-managed-calls) | +| Wrong parent or child scope | Scope propagation or shared scope stack issue | [Confirm Active Scope](#confirm-active-scope) | +| Export works locally but not in production | Exporter config, endpoint, environment, or flush path | [Confirm Exporter Setup](#confirm-exporter-setup) | +| Duplicate events | Duplicate subscribers, duplicate wrappers, or mixed manual and managed lifecycle calls | [Check For Duplicate Event Sources](#check-for-duplicate-event-sources) | +| Sensitive data appears in telemetry | Missing sanitize guardrails before subscribers or exporters | [Confirm Sanitization Before Export](#confirm-sanitization-before-export) | + +## Run The Ordered Checks + +Run these checks in order before changing exporter or application code. + +1. Confirm the instrumentation boundary. +2. Confirm the active scope and root scope ownership. +3. Confirm managed tool and LLM calls. +4. Confirm subscriber or exporter registration timing. +5. Confirm exporter endpoint, environment, and flush behavior. +6. Confirm sanitization before export. + +## Confirm Instrumentation Boundary + +Start with the code path that owns the real work. + +- If application code calls the tool or model provider directly, verify that the + call path uses [Instrument Applications](../instrument-applications/about.md) + guidance. +- If a framework owns scheduling, retries, callbacks, or provider payloads, + verify that the integration uses + [Integrate into Frameworks](../integrate-frameworks/about.md) guidance. +- If a plugin installs runtime behavior, verify that the plugin is activated + before the request path starts. + +Do not debug an exporter first if no local subscriber sees events. Add or enable +a local, sanitized subscriber at the same boundary and confirm that scope, tool, +or LLM events exist before investigating production export. + +## Confirm Active Scope + +Trace gaps and wrong parent-child relationships usually start with scope +ownership. Verify these conditions: + +- Each request, agent run, or workflow starts under the intended top-level scope. +- Detached tasks, worker threads, callbacks, and async jobs receive the intended + scope stack when they should remain part of the same logical run. +- Independent requests receive fresh isolated scope stacks. +- Scope-local middleware and subscribers are registered on the owning scope or + an ancestor scope. + +Use [Adding Scopes and Marks](../instrument-applications/adding-scopes-and-marks.md) +and [Scopes](../about/concepts/scopes.md) to compare the intended root scope +with the emitted event `uuid` and `parent_uuid` values. + +## Confirm Managed Calls + +Partial traces often mean some work bypasses the runtime helpers. Check these +areas: + +- Tool calls that should emit tool start and end events use the managed tool + call path. +- Model calls that should emit LLM start and end events use the managed LLM call + path or an integration wrapper that emits equivalent lifecycle events. +- Manual lifecycle calls emit matched start and end events with the same + lifecycle UUID. +- Streaming LLM responses are drained until completion so final events, + collectors, and subscribers can observe the completed output. + +Refer to [Instrument a Tool Call](../instrument-applications/instrument-tool-call.md), +[Instrument an LLM Call](../instrument-applications/instrument-llm-call.md), +[Wrap Tool Calls](../integrate-frameworks/wrap-tool-calls.md), and +[Wrap LLM Calls](../integrate-frameworks/wrap-llm-calls.md). + +## Confirm Subscriber And Exporter Registration + +Events are not buffered for subscribers that register after the event has +already been emitted. Verify these conditions: + +- Plugin-managed observability components are loaded before the request path. +- Manual subscribers are registered before the scope, tool, or LLM events they + need to observe. +- Scope-local subscribers are registered on a scope that is active for the work + they should observe. +- Exporter filters match the intended root scope or event category. +- Shutdown, teardown, or request completion calls flush owned exporters before + the process exits or the container stops. + +Use [Observability](../plugins/observability/about.md), +[Observability Configuration](../plugins/observability/configuration.md), and +[Subscribers](../about/concepts/subscribers.md) to verify the registration +lifecycle. + +## Confirm Exporter Setup + +If local event inspection works but production export fails, isolate exporter +transport and configuration from runtime instrumentation. + +For file or trajectory export, confirm these settings: + +- Output paths are writable by the production process. +- The application shuts down or clears the exporter in a path that flushes + partial output. +- ATIF export is scoped to the intended agent root and does not mix concurrent + root scopes. + +For OpenTelemetry or OpenInference export, confirm these settings: + +- The OpenTelemetry Protocol (OTLP) endpoint, headers, credentials, and network + egress are available in the production environment. +- The exporter is enabled in the active configuration file or plugin document. +- The backend receives spans with `nemo_relay.uuid` and + `nemo_relay.parent_uuid` attributes. +- The application flushes and shuts down the subscriber during graceful + termination. + +Refer to [Agent Trajectory Observability Format (ATOF)](../plugins/observability/atof.md), +[Agent Trajectory Interchange Format (ATIF)](../plugins/observability/atif.md), +[OpenTelemetry](../plugins/observability/opentelemetry.md), and +[OpenInference](../plugins/observability/openinference.md). + +## Check For Duplicate Event Sources + +Duplicate events usually mean the same boundary is instrumented more than once. +Check these areas: + +- The application does not wrap a call that a framework integration already + wraps. +- Manual lifecycle calls are not emitted around the same call that already uses + managed tool or LLM helpers. +- Plugin-managed exporters and manually registered exporters are not both + active for the same output path or backend. +- Retry logic belongs to the framework or application and is not being counted + as duplicate telemetry for the same real call. + +If duplicate events are expected because a retry or fallback actually executed +more than once, preserve the events and add stable names or metadata that let +the downstream backend distinguish attempts. + +## Confirm Sanitization Before Export + +Sensitive data in telemetry is a production incident. Use this order: + +1. Stop or disable the affected exporter if sensitive data is leaving the + production trust boundary. +2. Keep the application path stable unless the live request itself is unsafe. +3. Add or fix sanitize-request and sanitize-response guardrails before + production subscribers and exporters receive events. +4. Validate the sanitized event locally with ATOF JSONL or an in-process + subscriber before re-enabling external export. +5. Re-enable one exporter at a time and confirm the downstream backend no + longer receives sensitive fields. + +Use a request intercept only when the real request to the tool or provider must +change. Use a sanitize guardrail when only the recorded telemetry should change. + +## Escalation Capture Checklist + +Collect this information before escalating an incident: + +- NeMo Relay version and binding package version. +- Language binding and runtime version. +- Whether instrumentation is direct application code, a framework integration, + or plugin-managed behavior. +- Exporter type, configuration source, and activation path. +- Sanitized event sample that shows `uuid`, `parent_uuid`, `category`, + `scope_category`, name, and redacted metadata. +- Deployment shape, such as single process, worker pool, async tasks, sidecar, + job queue, or container orchestration. +- Reproduction scope, including whether the failure occurs for one request, one + tenant, one service, or all production traffic. +- Recent changes to instrumentation, plugin configuration, exporter endpoints, + deployment environment, or tracing backend configuration. + +Do not attach raw prompts, model responses, credentials, customer records, +authorization headers, or unredacted tool arguments to escalation artifacts. diff --git a/docs/troubleshooting/troubleshooting-guide.md b/docs/troubleshooting/troubleshooting-guide.md index c54b21bc..c84c1ccc 100644 --- a/docs/troubleshooting/troubleshooting-guide.md +++ b/docs/troubleshooting/troubleshooting-guide.md @@ -7,6 +7,10 @@ SPDX-License-Identifier: Apache-2.0 Use this page when a NeMo Relay setup, build, or runtime workflow does not behave as expected. +For production incidents involving missing traces, wrong scope attachment, +export failures, duplicate events, or sensitive telemetry, start with the +[Production Incident Runbook](production-incident-runbook.md). + ## Package Or Build Setup Fails Confirm that your environment matches [Prerequisites](../getting-started/prerequisites.md), then rerun the binding-specific setup command from [Installation](../getting-started/installation.md). From bcf1d44426142d445be3ef728b9a088bc61f4194 Mon Sep 17 00:00:00 2001 From: mnajafian-nv Date: Thu, 21 May 2026 18:16:54 -0700 Subject: [PATCH 2/2] docs: generalize trace incident runbook Signed-off-by: mnajafian-nv --- docs/index.md | 4 +- docs/plugins/observability/about.md | 8 ++-- ...t-runbook.md => trace-incident-runbook.md} | 46 +++++++++---------- docs/troubleshooting/troubleshooting-guide.md | 6 +-- 4 files changed, 32 insertions(+), 32 deletions(-) rename docs/troubleshooting/{production-incident-runbook.md => trace-incident-runbook.md} (85%) diff --git a/docs/index.md b/docs/index.md index 05dad9b8..4d7cc2a3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -68,7 +68,7 @@ Use the reading path that matches your task: | Observe a local coding-agent CLI | [NeMo Relay CLI](nemo-relay-cli/about.md) | | Package reusable behavior | [Build Plugins](build-plugins/about.md) | | Export traces or trajectories | [Observability](plugins/observability/about.md) | -| Debug production trace incidents | [Production Incident Runbook](troubleshooting/production-incident-runbook.md) | +| Debug trace incidents | [Trace Incident Runbook](troubleshooting/trace-incident-runbook.md) | | Tune performance with adaptive behavior | [Adaptive](plugins/adaptive/about.md) | | Look up symbols | [APIs](reference/api/index.md) | @@ -271,7 +271,7 @@ reference/performance :maxdepth: 2 Troubleshooting Guide -Production Incident Runbook +Trace Incident Runbook ``` ```{toctree} diff --git a/docs/plugins/observability/about.md b/docs/plugins/observability/about.md index d251a5b5..2b49f832 100644 --- a/docs/plugins/observability/about.md +++ b/docs/plugins/observability/about.md @@ -57,12 +57,12 @@ Choose the exporter based on the downstream system: | Generic OTLP traces | [OpenTelemetry](opentelemetry.md) | | OpenInference-oriented agent and LLM spans | [OpenInference](openinference.md) | -Start with local event inspection before production export. Add sanitize +Start with in-process event inspection before exporting externally. Add sanitize guardrails before exporters receive sensitive payloads. -For production incidents involving missing traces, wrong scope attachment, -export failures, duplicate events, or sensitive telemetry, use the -[Production Incident Runbook](../../troubleshooting/production-incident-runbook.md). +For trace incidents involving missing traces, wrong scope attachment, export +failures, duplicate events, or sensitive telemetry, use the +[Trace Incident Runbook](../../troubleshooting/trace-incident-runbook.md). ## Correlating Trajectories And Traces diff --git a/docs/troubleshooting/production-incident-runbook.md b/docs/troubleshooting/trace-incident-runbook.md similarity index 85% rename from docs/troubleshooting/production-incident-runbook.md rename to docs/troubleshooting/trace-incident-runbook.md index c4d04134..d8d31682 100644 --- a/docs/troubleshooting/production-incident-runbook.md +++ b/docs/troubleshooting/trace-incident-runbook.md @@ -3,11 +3,11 @@ SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All SPDX-License-Identifier: Apache-2.0 --> -# Production Incident Runbook +# Trace Incident Runbook -Use this runbook when a production NeMo Relay deployment has missing traces, -partial traces, incorrect scope parentage, exporter failures, duplicate events, -or sensitive data in telemetry. It assumes that the application already has a +Use this runbook when a NeMo Relay application has missing traces, partial +traces, incorrect scope parentage, exporter failures, duplicate events, or +sensitive data in telemetry. It assumes that the application already has a baseline scope and call instrumentation path. For first-time setup problems, start with the @@ -16,13 +16,13 @@ refer to [Agent Runtime Primer](../getting-started/agent-runtime-primer.md), [Scopes](../about/concepts/scopes.md), [Events](../about/concepts/events.md), and [Subscribers](../about/concepts/subscribers.md). -## Protect Production Data First +## Protect Sensitive Data First Do not collect raw prompts, model responses, authorization headers, tokens, customer records, tool arguments, or provider payloads while triaging an incident. Capture the smallest sanitized event sample that proves the failure. -Before exporting incident artifacts outside the production environment, verify +Before exporting incident artifacts outside the current trust boundary, verify that sanitize guardrails or exporter filters remove sensitive fields. Sanitize guardrails change emitted telemetry payloads only; they do not change the live request or response passed to the tool, model provider, or application. Refer to @@ -39,7 +39,7 @@ Use this table to choose the first check for the symptom you see. | No traces | Missing instrumentation boundary or inactive exporter | [Confirm Instrumentation Boundary](#confirm-instrumentation-boundary) | | Partial traces | Unwrapped calls, dropped streams, or late subscriber registration | [Confirm Managed Calls](#confirm-managed-calls) | | Wrong parent or child scope | Scope propagation or shared scope stack issue | [Confirm Active Scope](#confirm-active-scope) | -| Export works locally but not in production | Exporter config, endpoint, environment, or flush path | [Confirm Exporter Setup](#confirm-exporter-setup) | +| Events appear in process but export fails elsewhere | Exporter config, endpoint, environment, or flush path | [Confirm Exporter Setup](#confirm-exporter-setup) | | Duplicate events | Duplicate subscribers, duplicate wrappers, or mixed manual and managed lifecycle calls | [Check For Duplicate Event Sources](#check-for-duplicate-event-sources) | | Sensitive data appears in telemetry | Missing sanitize guardrails before subscribers or exporters | [Confirm Sanitization Before Export](#confirm-sanitization-before-export) | @@ -67,9 +67,9 @@ Start with the code path that owns the real work. - If a plugin installs runtime behavior, verify that the plugin is activated before the request path starts. -Do not debug an exporter first if no local subscriber sees events. Add or enable -a local, sanitized subscriber at the same boundary and confirm that scope, tool, -or LLM events exist before investigating production export. +Do not debug an exporter first if no in-process subscriber sees events. Add or +enable a sanitized in-process subscriber at the same boundary and confirm that +scope, tool, or LLM events exist before investigating external export. ## Confirm Active Scope @@ -127,12 +127,12 @@ lifecycle. ## Confirm Exporter Setup -If local event inspection works but production export fails, isolate exporter -transport and configuration from runtime instrumentation. +If in-process event inspection works but export fails elsewhere, isolate +exporter transport and configuration from runtime instrumentation. For file or trajectory export, confirm these settings: -- Output paths are writable by the production process. +- Output paths are writable by the running process. - The application shuts down or clears the exporter in a path that flushes partial output. - ATIF export is scoped to the intended agent root and does not mix concurrent @@ -141,7 +141,7 @@ For file or trajectory export, confirm these settings: For OpenTelemetry or OpenInference export, confirm these settings: - The OpenTelemetry Protocol (OTLP) endpoint, headers, credentials, and network - egress are available in the production environment. + egress are available in the target environment. - The exporter is enabled in the active configuration file or plugin document. - The backend receives spans with `nemo_relay.uuid` and `nemo_relay.parent_uuid` attributes. @@ -173,15 +173,15 @@ the downstream backend distinguish attempts. ## Confirm Sanitization Before Export -Sensitive data in telemetry is a production incident. Use this order: +Sensitive data in telemetry is an incident. Use this order: 1. Stop or disable the affected exporter if sensitive data is leaving the - production trust boundary. + intended trust boundary. 2. Keep the application path stable unless the live request itself is unsafe. 3. Add or fix sanitize-request and sanitize-response guardrails before - production subscribers and exporters receive events. -4. Validate the sanitized event locally with ATOF JSONL or an in-process - subscriber before re-enabling external export. + subscribers and exporters receive events. +4. Validate the sanitized event with ATOF JSONL or an in-process subscriber + before re-enabling external export. 5. Re-enable one exporter at a time and confirm the downstream backend no longer receives sensitive fields. @@ -199,12 +199,12 @@ Collect this information before escalating an incident: - Exporter type, configuration source, and activation path. - Sanitized event sample that shows `uuid`, `parent_uuid`, `category`, `scope_category`, name, and redacted metadata. -- Deployment shape, such as single process, worker pool, async tasks, sidecar, - job queue, or container orchestration. +- Runtime shape, such as single process, worker pool, async tasks, sidecar, job + queue, or container orchestration. - Reproduction scope, including whether the failure occurs for one request, one - tenant, one service, or all production traffic. + tenant, one service, or all requests. - Recent changes to instrumentation, plugin configuration, exporter endpoints, - deployment environment, or tracing backend configuration. + runtime environment, or tracing backend configuration. Do not attach raw prompts, model responses, credentials, customer records, authorization headers, or unredacted tool arguments to escalation artifacts. diff --git a/docs/troubleshooting/troubleshooting-guide.md b/docs/troubleshooting/troubleshooting-guide.md index c84c1ccc..5ddd4f3b 100644 --- a/docs/troubleshooting/troubleshooting-guide.md +++ b/docs/troubleshooting/troubleshooting-guide.md @@ -7,9 +7,9 @@ SPDX-License-Identifier: Apache-2.0 Use this page when a NeMo Relay setup, build, or runtime workflow does not behave as expected. -For production incidents involving missing traces, wrong scope attachment, -export failures, duplicate events, or sensitive telemetry, start with the -[Production Incident Runbook](production-incident-runbook.md). +For trace incidents involving missing traces, wrong scope attachment, export +failures, duplicate events, or sensitive telemetry, start with the +[Trace Incident Runbook](trace-incident-runbook.md). ## Package Or Build Setup Fails