fix(preflight): warn before gateway start when nvidia.com/gpu CDI spec is missing by laitingsheng · Pull Request #3159 · NVIDIA/NemoClaw

laitingsheng · 2026-05-07T04:40:39Z

Summary

On NVIDIA Linux hosts with nvidia-container-toolkit installed (e.g. DGX Spark / DGX Station with default Ubuntu 24.04 onboarding), Docker advertises CDISpecDirs in docker info and OpenShell's gateway start --gpu auto-selects CDI mode — which then fails minutes into onboarding when no nvidia.com/gpu CDI spec has been generated. Surface the sudo nvidia-ctk cdi generate remediation in preflight, before the gateway image download and health-wait window.

Related Issue

Closes #3152

Changes

parseDockerCdiSpecDirs(info) parses the CDISpecDirs array from docker info --format '{{json .}}' output.
hasNvidiaCdiSpec(specDirs, …) scans each configured spec directory for any YAML or JSON file whose kind is exactly nvidia.com/gpu. Prefix matches like nvidia.com/gpu-extra, comment-only mentions, and unrelated specs do not register.
assessHost() returns the new dockerCdiSpecDirs and cdiNvidiaGpuSpecMissing fields. The check fires only when platform is linux, an NVIDIA GPU is present, CDISpecDirs is non-empty, and no matching spec is found.
planHostRemediation() emits a blocking generate_nvidia_cdi_spec action with the sudo nvidia-ctk cdi generate --output=<dir>/nvidia.yaml command sequence.
assertCdiNvidiaGpuSpecPresent(host, optedOutGpuPassthrough) is invoked from both preflight() and the --resume branch in onboard() so a user who first hit the failure does not walk into the same CDI error on retry.
assessHost() accepts an optional readdirImpl injection that mirrors the existing readFileImpl / gpuProbeImpl test seams.
docs/reference/troubleshooting.md adds an entry next to "GPU passthrough on Spark" so users searching for the literal unresolvable CDI devices nvidia.com/gpu=all error find the remediation, including outside the preflight guard (older NemoClaw versions, manual openshell gateway start --gpu invocations).

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

New Features
- Onboarding now detects missing NVIDIA CDI GPU specs and blocks with guided remediation; explicit opt-out for GPU passthrough is supported and honored across resume flows.
Tests
- Expanded unit tests for CDI spec directory parsing, NVIDIA CDI spec detection, and remediation triggering.
Documentation
- Added troubleshooting steps to generate/list NVIDIA CDI specs or rerun onboarding with --no-gpu.

…c is missing Recent nvidia-container-toolkit releases configure the Docker daemon to advertise CDI spec directories (`"CDISpecDirs": ["/etc/cdi", "/var/run/cdi"]` in `docker info`) and OpenShell's `gateway start --gpu` then auto-selects CDI mode. On hosts where `nvidia-ctk cdi generate` has not been run yet — notably DGX Spark / DGX Station with default Ubuntu 24.04 onboarding — no `nvidia.com/gpu` CDI spec exists, and the gateway start fails minutes into onboarding with: Docker responded with status code 500: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all Detect the gap during preflight and surface the `sudo nvidia-ctk cdi generate` remediation immediately, before the gateway image download and the 1-minute health-check window. The check fires only on Linux + NVIDIA host + non-empty CDISpecDirs + no spec containing `kind: nvidia.com/gpu` under those directories. --no-gpu opts out of GPU passthrough entirely and therefore also opts out of this block, so the legacy runtime path is unaffected. parseDockerCdiSpecDirs() and the new YAML/JSON spec scanner are both injectable for tests; assessHost accepts an optional readdirImpl that mirrors the existing readFileImpl injection pattern. Tests cover: missing spec, present YAML/JSON spec, non-NVIDIA host, absent CDISpecDirs, macOS scoping, comment-only false-positive guard, and the planHostRemediation action shape. Closes #3152 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai · 2026-05-07T04:40:50Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds Docker CDI spec-dir parsing and NVIDIA CDI-spec scanning; records dockerCdiSpecDirs and cdiNvidiaGpuSpecMissing in HostAssessment; blocks onboarding with a remediation action when CDI dirs are configured but no nvidia.com/gpu spec is present; updates onboarding guard, tests, and troubleshooting docs.

Changes

CDI GPU Spec Detection and Preflight Early Exit

Layer / File(s)	Summary
Data Shape `src/lib/preflight.ts`	`HostAssessment` gains `dockerCdiSpecDirs: string[]` and `cdiNvidiaGpuSpecMissing: boolean`; `AssessHostOpts` adds optional `readdirImpl` for DI.
Parse Helper `src/lib/preflight.ts`	`parseDockerCdiSpecDirs` exported: extracts `CDISpecDirs` from `docker info` JSON output.
CDI Spec Scanner `src/lib/preflight.ts`	Internal `hasNvidiaCdiSpec` scans CDI spec dirs for files with `kind: nvidia.com/gpu`, tolerating readdir/read failures and supporting YAML/JSON forms.
Host Assessment Wiring `src/lib/preflight.ts`	`assessHost` wires `readdirImpl` default, computes `dockerCdiSpecDirs`, and sets `cdiNvidiaGpuSpecMissing` when Linux + NVIDIA GPU + CDI dirs advertised but no NVIDIA spec found.
Remediation Action `src/lib/preflight.ts`	`planHostRemediation` emits blocking `generate_nvidia_cdi_spec` action with `nvidia-ctk cdi generate --output=<first-dir>/nvidia.yaml`, `nvidia-ctk cdi list`, and retry instructions (or `--no-gpu`).
Onboard Preflight Guard `src/lib/onboard.ts`	`preflight(preflightOpts)` signature added and `assertCdiNvidiaGpuSpecPresent` invoked after assessment; `onboard()` passes `{ optedOutGpuPassthrough: opts.noGpu === true }` and re-runs the guard on resume.
Tests / Docs `src/lib/preflight.test.ts`, `docs/reference/troubleshooting.md`	Tests added for `parseDockerCdiSpecDirs`, expanded `assessHost` CDI cases, updated `planHostRemediation` fixtures and blocking-action test; docs add DGX Spark troubleshooting steps and remediation commands.

Sequence Diagram

sequenceDiagram
    actor User
    participant onboard as onboard()
    participant preflight as preflight()
    participant docker as DockerDaemon
    participant fs as Filesystem
    participant exit as EarlyExit

    User->>onboard: onboard(opts)
    onboard->>preflight: preflight({optedOutGpuPassthrough})
    
    rect rgba(100,150,200,0.5)
    Note over preflight,fs: CDI Spec Detection
    preflight->>docker: docker info --format {{json .}}
    docker-->>preflight: JSON with CDISpecDirs
    preflight->>preflight: parseDockerCdiSpecDirs()
    
    alt CDI dirs configured
        preflight->>fs: readdir(CDISpecDir)
        fs-->>preflight: [spec files]
        preflight->>fs: read nvidia.yaml/nvidia.json
        fs-->>preflight: spec content
        preflight->>preflight: hasNvidiaCdiSpec()
    end
    end
    
    alt NVIDIA GPU spec missing AND CDI dirs present AND GPU not opted out
        preflight->>exit: Exit with error
        exit-->>User: "CDI specs not found on host"
    else GPU passthrough opted out OR specs present
        preflight-->>onboard: Continue onboarding
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit spies the config trail,
Hops through dirs where specs might fail,
Sniffs out missing nvidia signs,
Nibbles at nvidia-ctk lines,
Onboarding springs, no longer frail.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: detecting missing nvidia.com/gpu CDI spec during preflight and warning before gateway start.
Linked Issues check	✅ Passed	All coding objectives from `#3152` are met: CDI spec detection (parseDockerCdiSpecDirs, hasNvidiaCdiSpec), assessHost augmentation, remediation planning, and preflight assertions prevent gateway start failure.
Out of Scope Changes check	✅ Passed	All changes directly support the core objective of detecting missing CDI specs and preventing gateway start failures; no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/issue-3152-cdi-spec-missing-preflight

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/onboard.ts (1)

9411-9416: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Re-run the CDI guard when preflight is skipped on --resume.

Line 9411 skips preflight() whenever preflight is already marked complete, so the new CDI check added at Line 3268 never runs on resume. In that case onboarding can still fail later at gateway start --gpu with the same CDI error this PR is meant to catch earlier.

Proposed fix

   if (resumePreflight) {
     skippedStepMessage("preflight", "cached");
     gpu = nim.detectGpu();
+    const host = assessHost();
+    if (host.cdiNvidiaGpuSpecMissing && opts.noGpu !== true) {
+      console.error(
+        "  Docker is configured for CDI device injection (CDISpecDirs is set), but no",
+      );
+      console.error(
+        "  nvidia.com/gpu CDI spec was found on the host. OpenShell's gateway start will",
+      );
+      console.error(
+        "  fail with `unresolvable CDI devices nvidia.com/gpu=all` (issue `#3152`).",
+      );
+      printRemediationActions(planHostRemediation(host));
+      process.exit(1);
+    }
   } else {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 9411 - 9416, When resuming (branch that
calls skippedStepMessage("preflight", "cached") and nim.detectGpu()), also
invoke the same CDI guard logic that preflight() runs so the CDI validation is
not skipped; either call the internal CDI-check helper used by preflight or
extract the CDI guard into a small function and call it from both preflight()
and the skipped-preflight branch (keep
onboardSession.markStepComplete("preflight") behavior unchanged).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/preflight.ts`:
- Around line 242-246: The YAML regex kindRe is too permissive and matches
prefixes like "nvidia.com/gpu-extra"; update kindRe to require an exact scalar
match by asserting the value boundary after the optional closing quote (e.g.
change the pattern used in src/lib/preflight.ts for kindRe so it matches
optional opening quote, "nvidia\.com\/gpu", optional closing quote, then a
lookahead for end-of-line, whitespace followed by a comment marker, or other
YAML value terminators like comma — ensure jsonRe remains for JSON but verify it
already requires the full quoted value), then add a regression test that ensures
a CDI spec named "nvidia.com/gpu-extra" does not satisfy the check (test the
function that uses kindRe to confirm the warning is not suppressed).

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 9411-9416: When resuming (branch that calls
skippedStepMessage("preflight", "cached") and nim.detectGpu()), also invoke the
same CDI guard logic that preflight() runs so the CDI validation is not skipped;
either call the internal CDI-check helper used by preflight or extract the CDI
guard into a small function and call it from both preflight() and the
skipped-preflight branch (keep onboardSession.markStepComplete("preflight")
behavior unchanged).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1d025314-75e6-47ba-a337-e324dd6c078f

📥 Commits

Reviewing files that changed from the base of the PR and between 3351fbd and 709586c.

📒 Files selected for processing (3)

src/lib/onboard.ts
src/lib/preflight.test.ts
src/lib/preflight.ts

Two findings on the original commit: 1. The YAML matcher in `hasNvidiaCdiSpec` accepted prefix matches like `kind: nvidia.com/gpu-extra`, so a sibling CDI spec on the host could silently satisfy the check and suppress the new preflight warning — the exact failure mode this PR is meant to catch. Anchor the pattern end-of-line, allow only an optional trailing comment, and require the *whole* scalar value to be `nvidia.com/gpu`. Drop the case-insensitive flag from the JSON matcher to match strictly. Regression test asserts `nvidia.com/gpu-extra` does not clear the spec-missing flag. 2. The `--resume` branch in `onboard()` skips `preflight()` entirely when the cached session marks preflight complete, so the new CDI guard never ran on the very path most likely to hit the failure (a user whose first onboard aborted at gateway-start with the CDI error and immediately reruns with `--resume`). Extract the guard into `assertCdiNvidiaGpuSpecPresent` and invoke it from both `preflight()` and the resume branch right after `nim.detectGpu()`. `assessHost()` is stateless and cheap to re-run, so the resume path stays fast; `markStepComplete("preflight")` semantics are unchanged. Also add a troubleshooting docs entry under the existing "GPU passthrough on Spark" section so users searching for the literal `unresolvable CDI devices nvidia.com/gpu=all` error message land on the remediation, and so the fix is discoverable for hosts that hit the same failure outside the new preflight guard (e.g. older NemoClaw versions). Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/lib/preflight.test.ts (1)

633-652: ⚡ Quick win

Add a JSON counterpart for the false-positive guard test.

The nvidia.com/gpu-extra spec test only exercises the YAML parsing path. The PR description explicitly states the matcher was tightened for both YAML and JSON formats; a parallel JSON fixture (e.g., {"kind":"nvidia.com/gpu-extra"}) would close that coverage gap.

🧪 Suggested additional test case

  it("does not accept a sibling device class such as nvidia.com/gpu-extra as a satisfying spec", () => {
    // … existing YAML test …
  });

+ it("does not accept nvidia.com/gpu-extra in a JSON CDI spec either", () => {
+   const result = assessHost({
+     platform: "linux",
+     env: {},
+     release: "6.8.0-58-generic",
+     readFileImpl: (filePath: string) =>
+       filePath.endsWith("nvidia-extra.json")
+         ? '{"cdiVersion":"0.5.0","kind":"nvidia.com/gpu-extra","devices":[]}'
+         : "Linux version 6.8.0-58-generic",
+     readdirImpl: (dir: string) => (dir === "/etc/cdi" ? ["nvidia-extra.json"] : []),
+     dockerInfoOutput: JSON.stringify({
+       ServerVersion: "27.0",
+       CDISpecDirs: ["/etc/cdi"],
+     }),
+     commandExistsImpl: (name: string) => name === "docker",
+     gpuProbeImpl: () => true,
+   });
+
+   expect(result.cdiNvidiaGpuSpecMissing).toBe(true);
+ });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/preflight.test.ts` around lines 633 - 652, Add a parallel test case
that exercises the JSON parsing path for assessHost (same intent as the existing
"nvidia.com/gpu-extra" YAML test) by creating a new it(...) that calls
assessHost with readdirImpl returning a file name like "nvidia-extra.json" and
readFileImpl returning a JSON string such as
'{"kind":"nvidia.com/gpu-extra","devices":[]}', keeping the same
dockerInfoOutput CDISpecDirs, commandExistsImpl, and gpuProbeImpl behavior;
assert result.cdiNvidiaGpuSpecMissing is true to ensure the matcher rejects the
sibling device class in JSON as well.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/troubleshooting.md`:
- Line 930: Replace the third-party GitHub link on the line mentioning
"Container Device Interface (CDI)" that OpenShell's `gateway start --gpu`
auto-selects: either remove the hyperlink so the text reads plain "Container
Device Interface (CDI)" or swap the link for an approved official documentation
source (e.g., the project's official docs or NVIDIA's official guide) per the
markdown policy; update the sentence to use plain text or the approved URL and
ensure the command `gateway start --gpu` remains unchanged.

In `@src/lib/onboard.ts`:
- Around line 9423-9432: The resume CDI guard currently only checks opts.noGpu
=== true and thus ignores any previously persisted "no-GPU" intent; change the
guard to compute an effective no-gpu flag that honors persisted intent (e.g.
persisted gpuPassthrough === false) as well as the current CLI flag, then pass
that effective flag into assertCdiNvidiaGpuSpecPresent(assessHost(),
effectiveNoGpu). Locate the surrounding code near assertCdiNvidiaGpuSpecPresent
and onboardSession (e.g. where onboardSession.markStepComplete("preflight") is
used) and read the persisted intent from the session/state (for example
onboardSession.persistedOptions.gpuPassthrough or a getter like
onboardSession.getPersistedOptions()) to derive effectiveNoGpu = (opts.noGpu ===
true) || (persistedOptions.gpuPassthrough === false) and use that value in the
assertion.

In `@src/lib/preflight.test.ts`:
- Around line 840-877: Update the test for the "generate_nvidia_cdi_spec" action
to assert all three emitted commands instead of only commands[0]; locate the
action lookup for id "generate_nvidia_cdi_spec" in src/lib/preflight.test.ts and
add expectations that action.commands[1] equals the verifier command (e.g.,
"sudo nvidia-ctk cdi list") and action.commands[2] equals the remediation retry
guidance (e.g., "sudo nemoclaw onboard" or the exact onboarding command used),
mirroring the pattern used in the docker_group_permission assertions so the full
remediation sequence is validated.

---

Nitpick comments:
In `@src/lib/preflight.test.ts`:
- Around line 633-652: Add a parallel test case that exercises the JSON parsing
path for assessHost (same intent as the existing "nvidia.com/gpu-extra" YAML
test) by creating a new it(...) that calls assessHost with readdirImpl returning
a file name like "nvidia-extra.json" and readFileImpl returning a JSON string
such as '{"kind":"nvidia.com/gpu-extra","devices":[]}', keeping the same
dockerInfoOutput CDISpecDirs, commandExistsImpl, and gpuProbeImpl behavior;
assert result.cdiNvidiaGpuSpecMissing is true to ensure the matcher rejects the
sibling device class in JSON as well.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4272118b-6aa7-4c3d-aa7f-73464b7f0bd0

📥 Commits

Reviewing files that changed from the base of the PR and between 709586c and 576f296.

📒 Files selected for processing (4)

docs/reference/troubleshooting.md
src/lib/onboard.ts
src/lib/preflight.test.ts
src/lib/preflight.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/preflight.ts

Three follow-up findings on the previous fix-up commit: 1. Drop the `https://github.com/cncf-tags/container-device-interface` link from the new troubleshooting docs entry. The repo's markdown policy (CLAUDE.md "No External Project Links") forbids third-party GitHub links even for CNCF projects. 2. The resume CDI guard in `onboard()` only honoured `opts.noGpu === true`, which would falsely block runs that previously persisted `gpuPassthrough: false` on the session and were resumed without re-passing `--no-gpu`. Mirror the same persisted-intent resolution that the regular `gpuPassthrough` block performs a few lines below (current invocation flag wins; otherwise honour persisted session intent). 3. The unit test for the new `generate_nvidia_cdi_spec` action only asserted `commands[0]`. Tighten it to cover all three remediation commands, matching the rigour of the existing `docker_group_permission` assertions and pinning the entire user-facing remediation sequence against future drift. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

docs/reference/troubleshooting.md (1)
931-932: ⚡ Quick win

Rewrite to active voice and apply inline code formatting.

Line 931 uses passive voice: "has been generated."
Rewrite actively, for example: "If the host does not have a nvidia.com/gpu CDI spec yet, gateway start fails with..."

Line 932: CDISpecDirs is a Docker configuration field name and must use inline code formatting per the formatting rules.

As per coding guidelines, "Active voice required. Flag passive constructions" and "CLI commands, file paths, flags, parameter names, and values must use inline code formatting."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/troubleshooting.md` around lines 931 - 932, Rewrite the two
sentences to use active voice and inline code formatting: change "has been
generated" to an active phrasing such as "the host does not have a
`nvidia.com/gpu` CDI spec yet" and ensure `CDISpecDirs` is wrapped in backticks
(e.g., `CDISpecDirs`); keep the rest of the meaning intact and preserve existing
references like `nvidia.com/gpu` and the `nemoclaw onboard` command.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/reference/troubleshooting.md`:
- Around line 931-932: Rewrite the two sentences to use active voice and inline
code formatting: change "has been generated" to an active phrasing such as "the
host does not have a `nvidia.com/gpu` CDI spec yet" and ensure `CDISpecDirs` is
wrapped in backticks (e.g., `CDISpecDirs`); keep the rest of the meaning intact
and preserve existing references like `nvidia.com/gpu` and the `nemoclaw
onboard` command.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2fdc8b44-665b-40ab-a46d-dd10fe387494

📥 Commits

Reviewing files that changed from the base of the PR and between 576f296 and bc89d8b.

📒 Files selected for processing (3)

docs/reference/troubleshooting.md
src/lib/onboard.ts
src/lib/preflight.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/preflight.test.ts

Address the CodeRabbit nitpick on #3159: the YAML-side regression test exists, but the JSON-side regex change (dropped case-insensitive flag) had no test of its own. Add a parallel case asserting that a CDI spec declaring `"kind":"nvidia.com/gpu-extra"` does not satisfy the nvidia.com/gpu check, so a future tweak that re-loosens the JSON matcher will fail loud. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

src/lib/preflight.test.ts (1)

503-520: ⚡ Quick win

Add a malformed-JSON regression for parseDockerCdiSpecDirs.

Current coverage checks empty input and missing/empty CDISpecDirs, but not malformed non-empty input. Add one case to lock in graceful fallback ([]) on invalid JSON.

Suggested test addition

 describe("parseDockerCdiSpecDirs", () => {
@@
   it("returns an empty array on empty input", () => {
     expect(parseDockerCdiSpecDirs("")).toEqual([]);
   });
+
+  it("returns an empty array on malformed JSON input", () => {
+    expect(parseDockerCdiSpecDirs("not-json")).toEqual([]);
+  });
 });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/preflight.test.ts` around lines 503 - 520, Add a unit test to the
existing parseDockerCdiSpecDirs suite to ensure malformed JSON returns an empty
array: add an it block (e.g., "returns an empty array on malformed JSON") that
calls parseDockerCdiSpecDirs with invalid JSON (for example "{" or "not a json")
and asserts it equals []; this locks in the graceful fallback behavior for
parseDockerCdiSpecDirs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/preflight.test.ts`:
- Around line 503-520: Add a unit test to the existing parseDockerCdiSpecDirs
suite to ensure malformed JSON returns an empty array: add an it block (e.g.,
"returns an empty array on malformed JSON") that calls parseDockerCdiSpecDirs
with invalid JSON (for example "{" or "not a json") and asserts it equals [];
this locks in the graceful fallback behavior for parseDockerCdiSpecDirs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 21f00f54-2240-417a-af82-a6df67c38c01

📥 Commits

Reviewing files that changed from the base of the PR and between bc89d8b and bd7fd9c.

📒 Files selected for processing (1)

src/lib/preflight.test.ts

coderabbitai

🧹 Nitpick comments (1)

src/lib/onboard.ts (1)
9423-9437: Recommend running the onboarding-focused E2E suite for this change.

Given this touches core src/lib/onboard.ts preflight/resume control flow, run the suggested E2E jobs (especially cloud-e2e and sandbox-operations-e2e) before merge.

As per coding guidelines: src/lib/onboard.ts: “This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow.” plus the listed E2E test recommendation set.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 9423 - 9437, This change touches core
onboarding control flow in src/lib/onboard.ts (notably the
resumeOptedOutGpuPassthrough logic, assessHost(),
assertCdiNvidiaGpuSpecPresent(), and preflight() call), so before merging run
the onboarding-focused E2E suite including the cloud-e2e and
sandbox-operations-e2e jobs (and any sandbox onboarding E2E pipelines) to
validate resume vs fresh preflight behavior and GPU passthrough handling; if any
E2E failures surface, iterate on the resumeOptedOutGpuPassthrough/preflight
handling and re-run the same suites until they pass.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 9423-9437: This change touches core onboarding control flow in
src/lib/onboard.ts (notably the resumeOptedOutGpuPassthrough logic,
assessHost(), assertCdiNvidiaGpuSpecPresent(), and preflight() call), so before
merging run the onboarding-focused E2E suite including the cloud-e2e and
sandbox-operations-e2e jobs (and any sandbox onboarding E2E pipelines) to
validate resume vs fresh preflight behavior and GPU passthrough handling; if any
E2E failures surface, iterate on the resumeOptedOutGpuPassthrough/preflight
handling and re-run the same suites until they pass.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 43fc0a8b-2ccb-4f37-ac63-793646083691

📥 Commits

Reviewing files that changed from the base of the PR and between bd7fd9c and a6e87db.

📒 Files selected for processing (1)

src/lib/onboard.ts

jyaunches

Approving per review in this session — the architectural concerns are tracked in #3213. Test coverage is thorough, the resume-path re-check is correct, CI is green on an up-to-date branch.

## Summary - Bump the docs release metadata to `0.0.37`. - Document release-prep updates for messaging policy presets, sandbox runtime utilities, and the GPU CDI troubleshooting path. - Refresh generated `nemoclaw-user-*` skills from the updated docs. ## Source summary - #3159 -> `docs/reference/troubleshooting.md`: Documents the GPU CDI preflight warning and remediation for `nvidia.com/gpu=all` gateway start failures. - #2415 -> `docs/reference/network-policies.md`, `docs/manage-sandboxes/messaging-channels.md`, `docs/network-policy/customize-network-policy.md`: Clarifies that Telegram, Discord, and Slack egress comes from opt-in messaging presets, not the baseline policy. - #3091 -> `docs/deployment/sandbox-hardening.md`, `docs/network-policy/customize-network-policy.md`: Documents the retained sandbox utilities `vi`, `jq`, and `dos2unix` while keeping host-side policy files as the durable source of truth. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `npm run build:cli` - `npm run typecheck:cli` - Commit and pre-push hooks: markdownlint, docs-to-skills verification, gitleaks, commitlint, CLI typecheck ## Skipped - #3193 and #3191 matched `docs/.docs-skip` entries for experimental shields/config paths. - #3200 and #3183 were test-only fixes. - #3189 and #3163 were internal documentation/refactor changes with no public docs impact. Made with [Cursor](https://cursor.com)  ## Summary by CodeRabbit * **Documentation** * Clarified which utilities remain in the sandbox runtime for lightweight inspection and cleanup * Noted that messaging endpoints (Discord, Slack, Telegram) are not in the baseline policy and that channel presets are applied during onboarding * Added GPU passthrough troubleshooting for gateway startup * Updated release/version bump and release-prep workflow guidance, including Discord preset description updates  --------- Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/lib/preflight.ts Outdated

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread docs/reference/troubleshooting.md Outdated

Comment thread src/lib/onboard.ts

Comment thread src/lib/preflight.test.ts

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

laitingsheng added 2 commits May 7, 2026 15:15

Merge branch 'main' into fix/issue-3152-cdi-spec-missing-preflight

3336ff1

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

laitingsheng added the v0.0.37 Release target label May 7, 2026

jyaunches mentioned this pull request May 7, 2026

refactor(cli): group inference and onboard support modules #3195

Merged

12 tasks

jyaunches self-assigned this May 7, 2026

jyaunches mentioned this pull request May 7, 2026

Epic: unify warnings, advisories, and fatal exits under a single registry #3213

Open

21 tasks

Merge branch 'main' into fix/issue-3152-cdi-spec-missing-preflight

a6e87db

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

jyaunches approved these changes May 7, 2026

View reviewed changes

jyaunches merged commit a4a9eb0 into main May 7, 2026
21 checks passed

miyoungc mentioned this pull request May 7, 2026

docs: prepare 0.0.37 release docs #3215

Merged

wscurran mentioned this pull request May 8, 2026

[DGX Spark][Install] Onboard blocked by missing CDI device spec on fresh Spark with Skip OTA #3252

Closed

Conversation

laitingsheng commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

jyaunches left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

laitingsheng commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading