Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ Use the reading path that matches your task:
| Observe a local coding-agent CLI | [NeMo Relay CLI](nemo-relay-cli/about.md) |
| Package reusable behavior | [Build Plugins](build-plugins/about.md) |
| Export traces or trajectories | [Observability](plugins/observability/about.md) |
| Debug trace incidents | [Trace Incident Runbook](troubleshooting/trace-incident-runbook.md) |
| Tune performance with adaptive behavior | [Adaptive](plugins/adaptive/about.md) |
| Look up symbols | [APIs](reference/api/index.md) |

Expand Down Expand Up @@ -270,6 +271,7 @@ reference/performance
:maxdepth: 2

Troubleshooting Guide <troubleshooting/troubleshooting-guide>
Trace Incident Runbook <troubleshooting/trace-incident-runbook>
```

```{toctree}
Expand Down
6 changes: 5 additions & 1 deletion docs/plugins/observability/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,13 @@ Choose the exporter based on the downstream system:
| Generic OTLP traces | [OpenTelemetry](opentelemetry.md) |
| OpenInference-oriented agent and LLM spans | [OpenInference](openinference.md) |

Start with local event inspection before production export. Add sanitize
Start with in-process event inspection before exporting externally. Add sanitize
guardrails before exporters receive sensitive payloads.

For trace incidents involving missing traces, wrong scope attachment, export
failures, duplicate events, or sensitive telemetry, use the
[Trace Incident Runbook](../../troubleshooting/trace-incident-runbook.md).

## Correlating Trajectories And Traces

When ATIF and trace exporters observe the same NeMo Relay events, they share
Expand Down
210 changes: 210 additions & 0 deletions docs/troubleshooting/trace-incident-runbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
Comment thread
mnajafian-nv marked this conversation as resolved.

# Trace Incident Runbook

Use this runbook when a NeMo Relay application has missing traces, partial
traces, incorrect scope parentage, exporter failures, duplicate events, or
sensitive data in telemetry. It assumes that the application already has a
baseline scope and call instrumentation path.

For first-time setup problems, start with the
[Troubleshooting Guide](troubleshooting-guide.md). For conceptual grounding,
refer to [Agent Runtime Primer](../getting-started/agent-runtime-primer.md),
[Scopes](../about/concepts/scopes.md), [Events](../about/concepts/events.md),
and [Subscribers](../about/concepts/subscribers.md).

## Protect Sensitive Data First

Do not collect raw prompts, model responses, authorization headers, tokens,
customer records, tool arguments, or provider payloads while triaging an
incident. Capture the smallest sanitized event sample that proves the failure.

Before exporting incident artifacts outside the current trust boundary, verify
that sanitize guardrails or exporter filters remove sensitive fields. Sanitize
guardrails change emitted telemetry payloads only; they do not change the live
request or response passed to the tool, model provider, or application. Refer to
[Middleware](../about/concepts/middleware.md) and
[Add Middleware](../instrument-applications/advanced-guide.md) for the
guardrail boundary.

## Triage By Symptom

Use this table to choose the first check for the symptom you see.

| Symptom | Likely Area | Start With |
|---|---|---|
| No traces | Missing instrumentation boundary or inactive exporter | [Confirm Instrumentation Boundary](#confirm-instrumentation-boundary) |
| Partial traces | Unwrapped calls, dropped streams, or late subscriber registration | [Confirm Managed Calls](#confirm-managed-calls) |
| Wrong parent or child scope | Scope propagation or shared scope stack issue | [Confirm Active Scope](#confirm-active-scope) |
| Events appear in process but export fails elsewhere | Exporter config, endpoint, environment, or flush path | [Confirm Exporter Setup](#confirm-exporter-setup) |
| Duplicate events | Duplicate subscribers, duplicate wrappers, or mixed manual and managed lifecycle calls | [Check For Duplicate Event Sources](#check-for-duplicate-event-sources) |
| Sensitive data appears in telemetry | Missing sanitize guardrails before subscribers or exporters | [Confirm Sanitization Before Export](#confirm-sanitization-before-export) |

## Run The Ordered Checks

Run these checks in order before changing exporter or application code.

1. Confirm the instrumentation boundary.
2. Confirm the active scope and root scope ownership.
3. Confirm managed tool and LLM calls.
4. Confirm subscriber or exporter registration timing.
5. Confirm exporter endpoint, environment, and flush behavior.
6. Confirm sanitization before export.

## Confirm Instrumentation Boundary

Start with the code path that owns the real work.

- If application code calls the tool or model provider directly, verify that the
call path uses [Instrument Applications](../instrument-applications/about.md)
guidance.
- If a framework owns scheduling, retries, callbacks, or provider payloads,
verify that the integration uses
[Integrate into Frameworks](../integrate-frameworks/about.md) guidance.
- If a plugin installs runtime behavior, verify that the plugin is activated
before the request path starts.

Do not debug an exporter first if no in-process subscriber sees events. Add or
enable a sanitized in-process subscriber at the same boundary and confirm that
scope, tool, or LLM events exist before investigating external export.

## Confirm Active Scope

Trace gaps and wrong parent-child relationships usually start with scope
ownership. Verify these conditions:

- Each request, agent run, or workflow starts under the intended top-level scope.
- Detached tasks, worker threads, callbacks, and async jobs receive the intended
scope stack when they should remain part of the same logical run.
- Independent requests receive fresh isolated scope stacks.
- Scope-local middleware and subscribers are registered on the owning scope or
an ancestor scope.

Use [Adding Scopes and Marks](../instrument-applications/adding-scopes-and-marks.md)
and [Scopes](../about/concepts/scopes.md) to compare the intended root scope
with the emitted event `uuid` and `parent_uuid` values.

## Confirm Managed Calls

Partial traces often mean some work bypasses the runtime helpers. Check these
areas:

- Tool calls that should emit tool start and end events use the managed tool
call path.
- Model calls that should emit LLM start and end events use the managed LLM call
path or an integration wrapper that emits equivalent lifecycle events.
- Manual lifecycle calls emit matched start and end events with the same
lifecycle UUID.
- Streaming LLM responses are drained until completion so final events,
collectors, and subscribers can observe the completed output.

Refer to [Instrument a Tool Call](../instrument-applications/instrument-tool-call.md),
[Instrument an LLM Call](../instrument-applications/instrument-llm-call.md),
[Wrap Tool Calls](../integrate-frameworks/wrap-tool-calls.md), and
[Wrap LLM Calls](../integrate-frameworks/wrap-llm-calls.md).

## Confirm Subscriber And Exporter Registration

Events are not buffered for subscribers that register after the event has
already been emitted. Verify these conditions:

- Plugin-managed observability components are loaded before the request path.
- Manual subscribers are registered before the scope, tool, or LLM events they
need to observe.
- Scope-local subscribers are registered on a scope that is active for the work
they should observe.
- Exporter filters match the intended root scope or event category.
- Shutdown, teardown, or request completion calls flush owned exporters before
the process exits or the container stops.

Use [Observability](../plugins/observability/about.md),
[Observability Configuration](../plugins/observability/configuration.md), and
[Subscribers](../about/concepts/subscribers.md) to verify the registration
lifecycle.

## Confirm Exporter Setup

If in-process event inspection works but export fails elsewhere, isolate
exporter transport and configuration from runtime instrumentation.

For file or trajectory export, confirm these settings:

- Output paths are writable by the running process.
- The application shuts down or clears the exporter in a path that flushes
partial output.
- ATIF export is scoped to the intended agent root and does not mix concurrent
root scopes.

For OpenTelemetry or OpenInference export, confirm these settings:

- The OpenTelemetry Protocol (OTLP) endpoint, headers, credentials, and network
egress are available in the target environment.
- The exporter is enabled in the active configuration file or plugin document.
- The backend receives spans with `nemo_relay.uuid` and
`nemo_relay.parent_uuid` attributes.
- The application flushes and shuts down the subscriber during graceful
termination.

Refer to [Agent Trajectory Observability Format (ATOF)](../plugins/observability/atof.md),
[Agent Trajectory Interchange Format (ATIF)](../plugins/observability/atif.md),
[OpenTelemetry](../plugins/observability/opentelemetry.md), and
[OpenInference](../plugins/observability/openinference.md).

## Check For Duplicate Event Sources

Duplicate events usually mean the same boundary is instrumented more than once.
Check these areas:

- The application does not wrap a call that a framework integration already
wraps.
- Manual lifecycle calls are not emitted around the same call that already uses
managed tool or LLM helpers.
- Plugin-managed exporters and manually registered exporters are not both
active for the same output path or backend.
- Retry logic belongs to the framework or application and is not being counted
as duplicate telemetry for the same real call.

If duplicate events are expected because a retry or fallback actually executed
more than once, preserve the events and add stable names or metadata that let
the downstream backend distinguish attempts.

## Confirm Sanitization Before Export

Sensitive data in telemetry is an incident. Use this order:

1. Stop or disable the affected exporter if sensitive data is leaving the
intended trust boundary.
2. Keep the application path stable unless the live request itself is unsafe.
3. Add or fix sanitize-request and sanitize-response guardrails before
subscribers and exporters receive events.
4. Validate the sanitized event with ATOF JSONL or an in-process subscriber
before re-enabling external export.
5. Re-enable one exporter at a time and confirm the downstream backend no
longer receives sensitive fields.

Use a request intercept only when the real request to the tool or provider must
change. Use a sanitize guardrail when only the recorded telemetry should change.

## Escalation Capture Checklist

Collect this information before escalating an incident:

- NeMo Relay version and binding package version.
- Language binding and runtime version.
- Whether instrumentation is direct application code, a framework integration,
or plugin-managed behavior.
- Exporter type, configuration source, and activation path.
- Sanitized event sample that shows `uuid`, `parent_uuid`, `category`,
`scope_category`, name, and redacted metadata.
- Runtime shape, such as single process, worker pool, async tasks, sidecar, job
queue, or container orchestration.
- Reproduction scope, including whether the failure occurs for one request, one
tenant, one service, or all requests.
- Recent changes to instrumentation, plugin configuration, exporter endpoints,
runtime environment, or tracing backend configuration.

Do not attach raw prompts, model responses, credentials, customer records,
authorization headers, or unredacted tool arguments to escalation artifacts.
4 changes: 4 additions & 0 deletions docs/troubleshooting/troubleshooting-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ SPDX-License-Identifier: Apache-2.0

Use this page when a NeMo Relay setup, build, or runtime workflow does not behave as expected.

For trace incidents involving missing traces, wrong scope attachment, export
failures, duplicate events, or sensitive telemetry, start with the
[Trace Incident Runbook](trace-incident-runbook.md).

## Package Or Build Setup Fails

Confirm that your environment matches [Prerequisites](../getting-started/prerequisites.md), then rerun the binding-specific setup command from [Installation](../getting-started/installation.md).
Expand Down
Loading