Skip to content

fix(preflight): warn before gateway start when nvidia.com/gpu CDI spec is missing#3159

Merged
jyaunches merged 6 commits into
mainfrom
fix/issue-3152-cdi-spec-missing-preflight
May 7, 2026
Merged

fix(preflight): warn before gateway start when nvidia.com/gpu CDI spec is missing#3159
jyaunches merged 6 commits into
mainfrom
fix/issue-3152-cdi-spec-missing-preflight

Conversation

@laitingsheng
Copy link
Copy Markdown
Contributor

@laitingsheng laitingsheng commented May 7, 2026

Summary

On NVIDIA Linux hosts with nvidia-container-toolkit installed (e.g. DGX Spark / DGX Station with default Ubuntu 24.04 onboarding), Docker advertises CDISpecDirs in docker info and OpenShell's gateway start --gpu auto-selects CDI mode — which then fails minutes into onboarding when no nvidia.com/gpu CDI spec has been generated. Surface the sudo nvidia-ctk cdi generate remediation in preflight, before the gateway image download and health-wait window.

Related Issue

Closes #3152

Changes

  • parseDockerCdiSpecDirs(info) parses the CDISpecDirs array from docker info --format '{{json .}}' output.
  • hasNvidiaCdiSpec(specDirs, …) scans each configured spec directory for any YAML or JSON file whose kind is exactly nvidia.com/gpu. Prefix matches like nvidia.com/gpu-extra, comment-only mentions, and unrelated specs do not register.
  • assessHost() returns the new dockerCdiSpecDirs and cdiNvidiaGpuSpecMissing fields. The check fires only when platform is linux, an NVIDIA GPU is present, CDISpecDirs is non-empty, and no matching spec is found.
  • planHostRemediation() emits a blocking generate_nvidia_cdi_spec action with the sudo nvidia-ctk cdi generate --output=<dir>/nvidia.yaml command sequence.
  • assertCdiNvidiaGpuSpecPresent(host, optedOutGpuPassthrough) is invoked from both preflight() and the --resume branch in onboard() so a user who first hit the failure does not walk into the same CDI error on retry.
  • assessHost() accepts an optional readdirImpl injection that mirrors the existing readFileImpl / gpuProbeImpl test seams.
  • docs/reference/troubleshooting.md adds an entry next to "GPU passthrough on Spark" so users searching for the literal unresolvable CDI devices nvidia.com/gpu=all error find the remediation, including outside the preflight guard (older NemoClaw versions, manual openshell gateway start --gpu invocations).

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Onboarding now detects missing NVIDIA CDI GPU specs and blocks with guided remediation; explicit opt-out for GPU passthrough is supported and honored across resume flows.
  • Tests

    • Expanded unit tests for CDI spec directory parsing, NVIDIA CDI spec detection, and remediation triggering.
  • Documentation

    • Added troubleshooting steps to generate/list NVIDIA CDI specs or rerun onboarding with --no-gpu.

…c is missing

Recent nvidia-container-toolkit releases configure the Docker daemon to
advertise CDI spec directories (`"CDISpecDirs": ["/etc/cdi", "/var/run/cdi"]`
in `docker info`) and OpenShell's `gateway start --gpu` then auto-selects
CDI mode. On hosts where `nvidia-ctk cdi generate` has not been run yet —
notably DGX Spark / DGX Station with default Ubuntu 24.04 onboarding — no
`nvidia.com/gpu` CDI spec exists, and the gateway start fails minutes into
onboarding with:

  Docker responded with status code 500: CDI device injection failed:
  unresolvable CDI devices nvidia.com/gpu=all

Detect the gap during preflight and surface the `sudo nvidia-ctk cdi generate`
remediation immediately, before the gateway image download and the
1-minute health-check window. The check fires only on Linux + NVIDIA host +
non-empty CDISpecDirs + no spec containing `kind: nvidia.com/gpu` under
those directories. --no-gpu opts out of GPU passthrough entirely and
therefore also opts out of this block, so the legacy runtime path is
unaffected.

parseDockerCdiSpecDirs() and the new YAML/JSON spec scanner are both
injectable for tests; assessHost accepts an optional readdirImpl that
mirrors the existing readFileImpl injection pattern.

Tests cover: missing spec, present YAML/JSON spec, non-NVIDIA host,
absent CDISpecDirs, macOS scoping, comment-only false-positive guard,
and the planHostRemediation action shape.

Closes #3152

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Docker CDI spec-dir parsing and NVIDIA CDI-spec scanning; records dockerCdiSpecDirs and cdiNvidiaGpuSpecMissing in HostAssessment; blocks onboarding with a remediation action when CDI dirs are configured but no nvidia.com/gpu spec is present; updates onboarding guard, tests, and troubleshooting docs.

Changes

CDI GPU Spec Detection and Preflight Early Exit

Layer / File(s) Summary
Data Shape
src/lib/preflight.ts
HostAssessment gains dockerCdiSpecDirs: string[] and cdiNvidiaGpuSpecMissing: boolean; AssessHostOpts adds optional readdirImpl for DI.
Parse Helper
src/lib/preflight.ts
parseDockerCdiSpecDirs exported: extracts CDISpecDirs from docker info JSON output.
CDI Spec Scanner
src/lib/preflight.ts
Internal hasNvidiaCdiSpec scans CDI spec dirs for files with kind: nvidia.com/gpu, tolerating readdir/read failures and supporting YAML/JSON forms.
Host Assessment Wiring
src/lib/preflight.ts
assessHost wires readdirImpl default, computes dockerCdiSpecDirs, and sets cdiNvidiaGpuSpecMissing when Linux + NVIDIA GPU + CDI dirs advertised but no NVIDIA spec found.
Remediation Action
src/lib/preflight.ts
planHostRemediation emits blocking generate_nvidia_cdi_spec action with nvidia-ctk cdi generate --output=<first-dir>/nvidia.yaml, nvidia-ctk cdi list, and retry instructions (or --no-gpu).
Onboard Preflight Guard
src/lib/onboard.ts
preflight(preflightOpts) signature added and assertCdiNvidiaGpuSpecPresent invoked after assessment; onboard() passes { optedOutGpuPassthrough: opts.noGpu === true } and re-runs the guard on resume.
Tests / Docs
src/lib/preflight.test.ts, docs/reference/troubleshooting.md
Tests added for parseDockerCdiSpecDirs, expanded assessHost CDI cases, updated planHostRemediation fixtures and blocking-action test; docs add DGX Spark troubleshooting steps and remediation commands.

Sequence Diagram

sequenceDiagram
    actor User
    participant onboard as onboard()
    participant preflight as preflight()
    participant docker as DockerDaemon
    participant fs as Filesystem
    participant exit as EarlyExit

    User->>onboard: onboard(opts)
    onboard->>preflight: preflight({optedOutGpuPassthrough})
    
    rect rgba(100,150,200,0.5)
    Note over preflight,fs: CDI Spec Detection
    preflight->>docker: docker info --format {{json .}}
    docker-->>preflight: JSON with CDISpecDirs
    preflight->>preflight: parseDockerCdiSpecDirs()
    
    alt CDI dirs configured
        preflight->>fs: readdir(CDISpecDir)
        fs-->>preflight: [spec files]
        preflight->>fs: read nvidia.yaml/nvidia.json
        fs-->>preflight: spec content
        preflight->>preflight: hasNvidiaCdiSpec()
    end
    end
    
    alt NVIDIA GPU spec missing AND CDI dirs present AND GPU not opted out
        preflight->>exit: Exit with error
        exit-->>User: "CDI specs not found on host"
    else GPU passthrough opted out OR specs present
        preflight-->>onboard: Continue onboarding
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit spies the config trail,
Hops through dirs where specs might fail,
Sniffs out missing nvidia signs,
Nibbles at nvidia-ctk lines,
Onboarding springs, no longer frail.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: detecting missing nvidia.com/gpu CDI spec during preflight and warning before gateway start.
Linked Issues check ✅ Passed All coding objectives from #3152 are met: CDI spec detection (parseDockerCdiSpecDirs, hasNvidiaCdiSpec), assessHost augmentation, remediation planning, and preflight assertions prevent gateway start failure.
Out of Scope Changes check ✅ Passed All changes directly support the core objective of detecting missing CDI specs and preventing gateway start failures; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-3152-cdi-spec-missing-preflight

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)

9411-9416: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Re-run the CDI guard when preflight is skipped on --resume.

Line 9411 skips preflight() whenever preflight is already marked complete, so the new CDI check added at Line 3268 never runs on resume. In that case onboarding can still fail later at gateway start --gpu with the same CDI error this PR is meant to catch earlier.

Proposed fix
   if (resumePreflight) {
     skippedStepMessage("preflight", "cached");
     gpu = nim.detectGpu();
+    const host = assessHost();
+    if (host.cdiNvidiaGpuSpecMissing && opts.noGpu !== true) {
+      console.error(
+        "  Docker is configured for CDI device injection (CDISpecDirs is set), but no",
+      );
+      console.error(
+        "  nvidia.com/gpu CDI spec was found on the host. OpenShell's gateway start will",
+      );
+      console.error(
+        "  fail with `unresolvable CDI devices nvidia.com/gpu=all` (issue `#3152`).",
+      );
+      printRemediationActions(planHostRemediation(host));
+      process.exit(1);
+    }
   } else {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 9411 - 9416, When resuming (branch that
calls skippedStepMessage("preflight", "cached") and nim.detectGpu()), also
invoke the same CDI guard logic that preflight() runs so the CDI validation is
not skipped; either call the internal CDI-check helper used by preflight or
extract the CDI guard into a small function and call it from both preflight()
and the skipped-preflight branch (keep
onboardSession.markStepComplete("preflight") behavior unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/preflight.ts`:
- Around line 242-246: The YAML regex kindRe is too permissive and matches
prefixes like "nvidia.com/gpu-extra"; update kindRe to require an exact scalar
match by asserting the value boundary after the optional closing quote (e.g.
change the pattern used in src/lib/preflight.ts for kindRe so it matches
optional opening quote, "nvidia\.com\/gpu", optional closing quote, then a
lookahead for end-of-line, whitespace followed by a comment marker, or other
YAML value terminators like comma — ensure jsonRe remains for JSON but verify it
already requires the full quoted value), then add a regression test that ensures
a CDI spec named "nvidia.com/gpu-extra" does not satisfy the check (test the
function that uses kindRe to confirm the warning is not suppressed).

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 9411-9416: When resuming (branch that calls
skippedStepMessage("preflight", "cached") and nim.detectGpu()), also invoke the
same CDI guard logic that preflight() runs so the CDI validation is not skipped;
either call the internal CDI-check helper used by preflight or extract the CDI
guard into a small function and call it from both preflight() and the
skipped-preflight branch (keep onboardSession.markStepComplete("preflight")
behavior unchanged).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1d025314-75e6-47ba-a337-e324dd6c078f

📥 Commits

Reviewing files that changed from the base of the PR and between 3351fbd and 709586c.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/preflight.test.ts
  • src/lib/preflight.ts

Comment thread src/lib/preflight.ts Outdated
Two findings on the original commit:

1. The YAML matcher in `hasNvidiaCdiSpec` accepted prefix matches like
   `kind: nvidia.com/gpu-extra`, so a sibling CDI spec on the host could
   silently satisfy the check and suppress the new preflight warning —
   the exact failure mode this PR is meant to catch. Anchor the pattern
   end-of-line, allow only an optional trailing comment, and require the
   *whole* scalar value to be `nvidia.com/gpu`. Drop the case-insensitive
   flag from the JSON matcher to match strictly. Regression test asserts
   `nvidia.com/gpu-extra` does not clear the spec-missing flag.

2. The `--resume` branch in `onboard()` skips `preflight()` entirely when
   the cached session marks preflight complete, so the new CDI guard
   never ran on the very path most likely to hit the failure (a user
   whose first onboard aborted at gateway-start with the CDI error and
   immediately reruns with `--resume`). Extract the guard into
   `assertCdiNvidiaGpuSpecPresent` and invoke it from both
   `preflight()` and the resume branch right after `nim.detectGpu()`.
   `assessHost()` is stateless and cheap to re-run, so the resume path
   stays fast; `markStepComplete("preflight")` semantics are unchanged.

Also add a troubleshooting docs entry under the existing "GPU passthrough
on Spark" section so users searching for the literal `unresolvable CDI
devices nvidia.com/gpu=all` error message land on the remediation, and
so the fix is discoverable for hosts that hit the same failure outside
the new preflight guard (e.g. older NemoClaw versions).

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/lib/preflight.test.ts (1)

633-652: ⚡ Quick win

Add a JSON counterpart for the false-positive guard test.

The nvidia.com/gpu-extra spec test only exercises the YAML parsing path. The PR description explicitly states the matcher was tightened for both YAML and JSON formats; a parallel JSON fixture (e.g., {"kind":"nvidia.com/gpu-extra"}) would close that coverage gap.

🧪 Suggested additional test case
  it("does not accept a sibling device class such as nvidia.com/gpu-extra as a satisfying spec", () => {
    // … existing YAML test …
  });

+ it("does not accept nvidia.com/gpu-extra in a JSON CDI spec either", () => {
+   const result = assessHost({
+     platform: "linux",
+     env: {},
+     release: "6.8.0-58-generic",
+     readFileImpl: (filePath: string) =>
+       filePath.endsWith("nvidia-extra.json")
+         ? '{"cdiVersion":"0.5.0","kind":"nvidia.com/gpu-extra","devices":[]}'
+         : "Linux version 6.8.0-58-generic",
+     readdirImpl: (dir: string) => (dir === "/etc/cdi" ? ["nvidia-extra.json"] : []),
+     dockerInfoOutput: JSON.stringify({
+       ServerVersion: "27.0",
+       CDISpecDirs: ["/etc/cdi"],
+     }),
+     commandExistsImpl: (name: string) => name === "docker",
+     gpuProbeImpl: () => true,
+   });
+
+   expect(result.cdiNvidiaGpuSpecMissing).toBe(true);
+ });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/preflight.test.ts` around lines 633 - 652, Add a parallel test case
that exercises the JSON parsing path for assessHost (same intent as the existing
"nvidia.com/gpu-extra" YAML test) by creating a new it(...) that calls
assessHost with readdirImpl returning a file name like "nvidia-extra.json" and
readFileImpl returning a JSON string such as
'{"kind":"nvidia.com/gpu-extra","devices":[]}', keeping the same
dockerInfoOutput CDISpecDirs, commandExistsImpl, and gpuProbeImpl behavior;
assert result.cdiNvidiaGpuSpecMissing is true to ensure the matcher rejects the
sibling device class in JSON as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/troubleshooting.md`:
- Line 930: Replace the third-party GitHub link on the line mentioning
"Container Device Interface (CDI)" that OpenShell's `gateway start --gpu`
auto-selects: either remove the hyperlink so the text reads plain "Container
Device Interface (CDI)" or swap the link for an approved official documentation
source (e.g., the project's official docs or NVIDIA's official guide) per the
markdown policy; update the sentence to use plain text or the approved URL and
ensure the command `gateway start --gpu` remains unchanged.

In `@src/lib/onboard.ts`:
- Around line 9423-9432: The resume CDI guard currently only checks opts.noGpu
=== true and thus ignores any previously persisted "no-GPU" intent; change the
guard to compute an effective no-gpu flag that honors persisted intent (e.g.
persisted gpuPassthrough === false) as well as the current CLI flag, then pass
that effective flag into assertCdiNvidiaGpuSpecPresent(assessHost(),
effectiveNoGpu). Locate the surrounding code near assertCdiNvidiaGpuSpecPresent
and onboardSession (e.g. where onboardSession.markStepComplete("preflight") is
used) and read the persisted intent from the session/state (for example
onboardSession.persistedOptions.gpuPassthrough or a getter like
onboardSession.getPersistedOptions()) to derive effectiveNoGpu = (opts.noGpu ===
true) || (persistedOptions.gpuPassthrough === false) and use that value in the
assertion.

In `@src/lib/preflight.test.ts`:
- Around line 840-877: Update the test for the "generate_nvidia_cdi_spec" action
to assert all three emitted commands instead of only commands[0]; locate the
action lookup for id "generate_nvidia_cdi_spec" in src/lib/preflight.test.ts and
add expectations that action.commands[1] equals the verifier command (e.g.,
"sudo nvidia-ctk cdi list") and action.commands[2] equals the remediation retry
guidance (e.g., "sudo nemoclaw onboard" or the exact onboarding command used),
mirroring the pattern used in the docker_group_permission assertions so the full
remediation sequence is validated.

---

Nitpick comments:
In `@src/lib/preflight.test.ts`:
- Around line 633-652: Add a parallel test case that exercises the JSON parsing
path for assessHost (same intent as the existing "nvidia.com/gpu-extra" YAML
test) by creating a new it(...) that calls assessHost with readdirImpl returning
a file name like "nvidia-extra.json" and readFileImpl returning a JSON string
such as '{"kind":"nvidia.com/gpu-extra","devices":[]}', keeping the same
dockerInfoOutput CDISpecDirs, commandExistsImpl, and gpuProbeImpl behavior;
assert result.cdiNvidiaGpuSpecMissing is true to ensure the matcher rejects the
sibling device class in JSON as well.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4272118b-6aa7-4c3d-aa7f-73464b7f0bd0

📥 Commits

Reviewing files that changed from the base of the PR and between 709586c and 576f296.

📒 Files selected for processing (4)
  • docs/reference/troubleshooting.md
  • src/lib/onboard.ts
  • src/lib/preflight.test.ts
  • src/lib/preflight.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/preflight.ts

Comment thread docs/reference/troubleshooting.md Outdated
Comment thread src/lib/onboard.ts
Comment thread src/lib/preflight.test.ts
Three follow-up findings on the previous fix-up commit:

1. Drop the `https://github.com/cncf-tags/container-device-interface`
   link from the new troubleshooting docs entry. The repo's markdown
   policy (CLAUDE.md "No External Project Links") forbids third-party
   GitHub links even for CNCF projects.

2. The resume CDI guard in `onboard()` only honoured `opts.noGpu === true`,
   which would falsely block runs that previously persisted
   `gpuPassthrough: false` on the session and were resumed without
   re-passing `--no-gpu`. Mirror the same persisted-intent resolution
   that the regular `gpuPassthrough` block performs a few lines below
   (current invocation flag wins; otherwise honour persisted session
   intent).

3. The unit test for the new `generate_nvidia_cdi_spec` action only
   asserted `commands[0]`. Tighten it to cover all three remediation
   commands, matching the rigour of the existing
   `docker_group_permission` assertions and pinning the entire
   user-facing remediation sequence against future drift.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/reference/troubleshooting.md (1)

931-932: ⚡ Quick win

Rewrite to active voice and apply inline code formatting.

Line 931 uses passive voice: "has been generated."
Rewrite actively, for example: "If the host does not have a nvidia.com/gpu CDI spec yet, gateway start fails with..."

Line 932: CDISpecDirs is a Docker configuration field name and must use inline code formatting per the formatting rules.

As per coding guidelines, "Active voice required. Flag passive constructions" and "CLI commands, file paths, flags, parameter names, and values must use inline code formatting."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/troubleshooting.md` around lines 931 - 932, Rewrite the two
sentences to use active voice and inline code formatting: change "has been
generated" to an active phrasing such as "the host does not have a
`nvidia.com/gpu` CDI spec yet" and ensure `CDISpecDirs` is wrapped in backticks
(e.g., `CDISpecDirs`); keep the rest of the meaning intact and preserve existing
references like `nvidia.com/gpu` and the `nemoclaw onboard` command.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/reference/troubleshooting.md`:
- Around line 931-932: Rewrite the two sentences to use active voice and inline
code formatting: change "has been generated" to an active phrasing such as "the
host does not have a `nvidia.com/gpu` CDI spec yet" and ensure `CDISpecDirs` is
wrapped in backticks (e.g., `CDISpecDirs`); keep the rest of the meaning intact
and preserve existing references like `nvidia.com/gpu` and the `nemoclaw
onboard` command.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2fdc8b44-665b-40ab-a46d-dd10fe387494

📥 Commits

Reviewing files that changed from the base of the PR and between 576f296 and bc89d8b.

📒 Files selected for processing (3)
  • docs/reference/troubleshooting.md
  • src/lib/onboard.ts
  • src/lib/preflight.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/preflight.test.ts

Address the CodeRabbit nitpick on #3159: the YAML-side regression test
exists, but the JSON-side regex change (dropped case-insensitive flag)
had no test of its own. Add a parallel case asserting that a CDI spec
declaring `"kind":"nvidia.com/gpu-extra"` does not satisfy the
nvidia.com/gpu check, so a future tweak that re-loosens the JSON
matcher will fail loud.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/preflight.test.ts (1)

503-520: ⚡ Quick win

Add a malformed-JSON regression for parseDockerCdiSpecDirs.

Current coverage checks empty input and missing/empty CDISpecDirs, but not malformed non-empty input. Add one case to lock in graceful fallback ([]) on invalid JSON.

Suggested test addition
 describe("parseDockerCdiSpecDirs", () => {
@@
   it("returns an empty array on empty input", () => {
     expect(parseDockerCdiSpecDirs("")).toEqual([]);
   });
+
+  it("returns an empty array on malformed JSON input", () => {
+    expect(parseDockerCdiSpecDirs("not-json")).toEqual([]);
+  });
 });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/preflight.test.ts` around lines 503 - 520, Add a unit test to the
existing parseDockerCdiSpecDirs suite to ensure malformed JSON returns an empty
array: add an it block (e.g., "returns an empty array on malformed JSON") that
calls parseDockerCdiSpecDirs with invalid JSON (for example "{" or "not a json")
and asserts it equals []; this locks in the graceful fallback behavior for
parseDockerCdiSpecDirs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/preflight.test.ts`:
- Around line 503-520: Add a unit test to the existing parseDockerCdiSpecDirs
suite to ensure malformed JSON returns an empty array: add an it block (e.g.,
"returns an empty array on malformed JSON") that calls parseDockerCdiSpecDirs
with invalid JSON (for example "{" or "not a json") and asserts it equals [];
this locks in the graceful fallback behavior for parseDockerCdiSpecDirs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 21f00f54-2240-417a-af82-a6df67c38c01

📥 Commits

Reviewing files that changed from the base of the PR and between bc89d8b and bd7fd9c.

📒 Files selected for processing (1)
  • src/lib/preflight.test.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

9423-9437: Recommend running the onboarding-focused E2E suite for this change.

Given this touches core src/lib/onboard.ts preflight/resume control flow, run the suggested E2E jobs (especially cloud-e2e and sandbox-operations-e2e) before merge.

As per coding guidelines: src/lib/onboard.ts: “This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow.” plus the listed E2E test recommendation set.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 9423 - 9437, This change touches core
onboarding control flow in src/lib/onboard.ts (notably the
resumeOptedOutGpuPassthrough logic, assessHost(),
assertCdiNvidiaGpuSpecPresent(), and preflight() call), so before merging run
the onboarding-focused E2E suite including the cloud-e2e and
sandbox-operations-e2e jobs (and any sandbox onboarding E2E pipelines) to
validate resume vs fresh preflight behavior and GPU passthrough handling; if any
E2E failures surface, iterate on the resumeOptedOutGpuPassthrough/preflight
handling and re-run the same suites until they pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 9423-9437: This change touches core onboarding control flow in
src/lib/onboard.ts (notably the resumeOptedOutGpuPassthrough logic,
assessHost(), assertCdiNvidiaGpuSpecPresent(), and preflight() call), so before
merging run the onboarding-focused E2E suite including the cloud-e2e and
sandbox-operations-e2e jobs (and any sandbox onboarding E2E pipelines) to
validate resume vs fresh preflight behavior and GPU passthrough handling; if any
E2E failures surface, iterate on the resumeOptedOutGpuPassthrough/preflight
handling and re-run the same suites until they pass.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 43fc0a8b-2ccb-4f37-ac63-793646083691

📥 Commits

Reviewing files that changed from the base of the PR and between bd7fd9c and a6e87db.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

Copy link
Copy Markdown
Contributor

@jyaunches jyaunches left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving per review in this session — the architectural concerns are tracked in #3213. Test coverage is thorough, the resume-path re-check is correct, CI is green on an up-to-date branch.

@jyaunches jyaunches merged commit a4a9eb0 into main May 7, 2026
21 checks passed
jyaunches pushed a commit that referenced this pull request May 8, 2026
## Summary
- Bump the docs release metadata to `0.0.37`.
- Document release-prep updates for messaging policy presets, sandbox
runtime utilities, and the GPU CDI troubleshooting path.
- Refresh generated `nemoclaw-user-*` skills from the updated docs.

## Source summary
- #3159 -> `docs/reference/troubleshooting.md`: Documents the GPU CDI
preflight warning and remediation for `nvidia.com/gpu=all` gateway start
failures.
- #2415 -> `docs/reference/network-policies.md`,
`docs/manage-sandboxes/messaging-channels.md`,
`docs/network-policy/customize-network-policy.md`: Clarifies that
Telegram, Discord, and Slack egress comes from opt-in messaging presets,
not the baseline policy.
- #3091 -> `docs/deployment/sandbox-hardening.md`,
`docs/network-policy/customize-network-policy.md`: Documents the
retained sandbox utilities `vi`, `jq`, and `dos2unix` while keeping
host-side policy files as the durable source of truth.

## Test plan
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user`
- `make docs`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit and pre-push hooks: markdownlint, docs-to-skills verification,
gitleaks, commitlint, CLI typecheck

## Skipped
- #3193 and #3191 matched `docs/.docs-skip` entries for experimental
shields/config paths.
- #3200 and #3183 were test-only fixes.
- #3189 and #3163 were internal documentation/refactor changes with no
public docs impact.

Made with [Cursor](https://cursor.com)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Clarified which utilities remain in the sandbox runtime for
lightweight inspection and cleanup
* Noted that messaging endpoints (Discord, Slack, Telegram) are not in
the baseline policy and that channel presets are applied during
onboarding
  * Added GPU passthrough troubleshooting for gateway startup
* Updated release/version bump and release-prep workflow guidance,
including Discord preset description updates
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.37 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Spark and DGX station][Install] nemoclaw curl install fails during openclaw gateway start

3 participants