fix(preflight): warn before gateway start when nvidia.com/gpu CDI spec is missing#3159
Conversation
…c is missing Recent nvidia-container-toolkit releases configure the Docker daemon to advertise CDI spec directories (`"CDISpecDirs": ["/etc/cdi", "/var/run/cdi"]` in `docker info`) and OpenShell's `gateway start --gpu` then auto-selects CDI mode. On hosts where `nvidia-ctk cdi generate` has not been run yet — notably DGX Spark / DGX Station with default Ubuntu 24.04 onboarding — no `nvidia.com/gpu` CDI spec exists, and the gateway start fails minutes into onboarding with: Docker responded with status code 500: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all Detect the gap during preflight and surface the `sudo nvidia-ctk cdi generate` remediation immediately, before the gateway image download and the 1-minute health-check window. The check fires only on Linux + NVIDIA host + non-empty CDISpecDirs + no spec containing `kind: nvidia.com/gpu` under those directories. --no-gpu opts out of GPU passthrough entirely and therefore also opts out of this block, so the legacy runtime path is unaffected. parseDockerCdiSpecDirs() and the new YAML/JSON spec scanner are both injectable for tests; assessHost accepts an optional readdirImpl that mirrors the existing readFileImpl injection pattern. Tests cover: missing spec, present YAML/JSON spec, non-NVIDIA host, absent CDISpecDirs, macOS scoping, comment-only false-positive guard, and the planHostRemediation action shape. Closes #3152 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Docker CDI spec-dir parsing and NVIDIA CDI-spec scanning; records ChangesCDI GPU Spec Detection and Preflight Early Exit
Sequence DiagramsequenceDiagram
actor User
participant onboard as onboard()
participant preflight as preflight()
participant docker as DockerDaemon
participant fs as Filesystem
participant exit as EarlyExit
User->>onboard: onboard(opts)
onboard->>preflight: preflight({optedOutGpuPassthrough})
rect rgba(100,150,200,0.5)
Note over preflight,fs: CDI Spec Detection
preflight->>docker: docker info --format {{json .}}
docker-->>preflight: JSON with CDISpecDirs
preflight->>preflight: parseDockerCdiSpecDirs()
alt CDI dirs configured
preflight->>fs: readdir(CDISpecDir)
fs-->>preflight: [spec files]
preflight->>fs: read nvidia.yaml/nvidia.json
fs-->>preflight: spec content
preflight->>preflight: hasNvidiaCdiSpec()
end
end
alt NVIDIA GPU spec missing AND CDI dirs present AND GPU not opted out
preflight->>exit: Exit with error
exit-->>User: "CDI specs not found on host"
else GPU passthrough opted out OR specs present
preflight-->>onboard: Continue onboarding
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
9411-9416:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRe-run the CDI guard when preflight is skipped on
--resume.Line 9411 skips
preflight()whenever preflight is already marked complete, so the new CDI check added at Line 3268 never runs on resume. In that case onboarding can still fail later atgateway start --gpuwith the same CDI error this PR is meant to catch earlier.Proposed fix
if (resumePreflight) { skippedStepMessage("preflight", "cached"); gpu = nim.detectGpu(); + const host = assessHost(); + if (host.cdiNvidiaGpuSpecMissing && opts.noGpu !== true) { + console.error( + " Docker is configured for CDI device injection (CDISpecDirs is set), but no", + ); + console.error( + " nvidia.com/gpu CDI spec was found on the host. OpenShell's gateway start will", + ); + console.error( + " fail with `unresolvable CDI devices nvidia.com/gpu=all` (issue `#3152`).", + ); + printRemediationActions(planHostRemediation(host)); + process.exit(1); + } } else {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 9411 - 9416, When resuming (branch that calls skippedStepMessage("preflight", "cached") and nim.detectGpu()), also invoke the same CDI guard logic that preflight() runs so the CDI validation is not skipped; either call the internal CDI-check helper used by preflight or extract the CDI guard into a small function and call it from both preflight() and the skipped-preflight branch (keep onboardSession.markStepComplete("preflight") behavior unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/preflight.ts`:
- Around line 242-246: The YAML regex kindRe is too permissive and matches
prefixes like "nvidia.com/gpu-extra"; update kindRe to require an exact scalar
match by asserting the value boundary after the optional closing quote (e.g.
change the pattern used in src/lib/preflight.ts for kindRe so it matches
optional opening quote, "nvidia\.com\/gpu", optional closing quote, then a
lookahead for end-of-line, whitespace followed by a comment marker, or other
YAML value terminators like comma — ensure jsonRe remains for JSON but verify it
already requires the full quoted value), then add a regression test that ensures
a CDI spec named "nvidia.com/gpu-extra" does not satisfy the check (test the
function that uses kindRe to confirm the warning is not suppressed).
---
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 9411-9416: When resuming (branch that calls
skippedStepMessage("preflight", "cached") and nim.detectGpu()), also invoke the
same CDI guard logic that preflight() runs so the CDI validation is not skipped;
either call the internal CDI-check helper used by preflight or extract the CDI
guard into a small function and call it from both preflight() and the
skipped-preflight branch (keep onboardSession.markStepComplete("preflight")
behavior unchanged).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1d025314-75e6-47ba-a337-e324dd6c078f
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/preflight.test.tssrc/lib/preflight.ts
Two findings on the original commit:
1. The YAML matcher in `hasNvidiaCdiSpec` accepted prefix matches like
`kind: nvidia.com/gpu-extra`, so a sibling CDI spec on the host could
silently satisfy the check and suppress the new preflight warning —
the exact failure mode this PR is meant to catch. Anchor the pattern
end-of-line, allow only an optional trailing comment, and require the
*whole* scalar value to be `nvidia.com/gpu`. Drop the case-insensitive
flag from the JSON matcher to match strictly. Regression test asserts
`nvidia.com/gpu-extra` does not clear the spec-missing flag.
2. The `--resume` branch in `onboard()` skips `preflight()` entirely when
the cached session marks preflight complete, so the new CDI guard
never ran on the very path most likely to hit the failure (a user
whose first onboard aborted at gateway-start with the CDI error and
immediately reruns with `--resume`). Extract the guard into
`assertCdiNvidiaGpuSpecPresent` and invoke it from both
`preflight()` and the resume branch right after `nim.detectGpu()`.
`assessHost()` is stateless and cheap to re-run, so the resume path
stays fast; `markStepComplete("preflight")` semantics are unchanged.
Also add a troubleshooting docs entry under the existing "GPU passthrough
on Spark" section so users searching for the literal `unresolvable CDI
devices nvidia.com/gpu=all` error message land on the remediation, and
so the fix is discoverable for hosts that hit the same failure outside
the new preflight guard (e.g. older NemoClaw versions).
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/lib/preflight.test.ts (1)
633-652: ⚡ Quick winAdd a JSON counterpart for the false-positive guard test.
The
nvidia.com/gpu-extraspec test only exercises the YAML parsing path. The PR description explicitly states the matcher was tightened for both YAML and JSON formats; a parallel JSON fixture (e.g.,{"kind":"nvidia.com/gpu-extra"}) would close that coverage gap.🧪 Suggested additional test case
it("does not accept a sibling device class such as nvidia.com/gpu-extra as a satisfying spec", () => { // … existing YAML test … }); + it("does not accept nvidia.com/gpu-extra in a JSON CDI spec either", () => { + const result = assessHost({ + platform: "linux", + env: {}, + release: "6.8.0-58-generic", + readFileImpl: (filePath: string) => + filePath.endsWith("nvidia-extra.json") + ? '{"cdiVersion":"0.5.0","kind":"nvidia.com/gpu-extra","devices":[]}' + : "Linux version 6.8.0-58-generic", + readdirImpl: (dir: string) => (dir === "/etc/cdi" ? ["nvidia-extra.json"] : []), + dockerInfoOutput: JSON.stringify({ + ServerVersion: "27.0", + CDISpecDirs: ["/etc/cdi"], + }), + commandExistsImpl: (name: string) => name === "docker", + gpuProbeImpl: () => true, + }); + + expect(result.cdiNvidiaGpuSpecMissing).toBe(true); + });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/preflight.test.ts` around lines 633 - 652, Add a parallel test case that exercises the JSON parsing path for assessHost (same intent as the existing "nvidia.com/gpu-extra" YAML test) by creating a new it(...) that calls assessHost with readdirImpl returning a file name like "nvidia-extra.json" and readFileImpl returning a JSON string such as '{"kind":"nvidia.com/gpu-extra","devices":[]}', keeping the same dockerInfoOutput CDISpecDirs, commandExistsImpl, and gpuProbeImpl behavior; assert result.cdiNvidiaGpuSpecMissing is true to ensure the matcher rejects the sibling device class in JSON as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/reference/troubleshooting.md`:
- Line 930: Replace the third-party GitHub link on the line mentioning
"Container Device Interface (CDI)" that OpenShell's `gateway start --gpu`
auto-selects: either remove the hyperlink so the text reads plain "Container
Device Interface (CDI)" or swap the link for an approved official documentation
source (e.g., the project's official docs or NVIDIA's official guide) per the
markdown policy; update the sentence to use plain text or the approved URL and
ensure the command `gateway start --gpu` remains unchanged.
In `@src/lib/onboard.ts`:
- Around line 9423-9432: The resume CDI guard currently only checks opts.noGpu
=== true and thus ignores any previously persisted "no-GPU" intent; change the
guard to compute an effective no-gpu flag that honors persisted intent (e.g.
persisted gpuPassthrough === false) as well as the current CLI flag, then pass
that effective flag into assertCdiNvidiaGpuSpecPresent(assessHost(),
effectiveNoGpu). Locate the surrounding code near assertCdiNvidiaGpuSpecPresent
and onboardSession (e.g. where onboardSession.markStepComplete("preflight") is
used) and read the persisted intent from the session/state (for example
onboardSession.persistedOptions.gpuPassthrough or a getter like
onboardSession.getPersistedOptions()) to derive effectiveNoGpu = (opts.noGpu ===
true) || (persistedOptions.gpuPassthrough === false) and use that value in the
assertion.
In `@src/lib/preflight.test.ts`:
- Around line 840-877: Update the test for the "generate_nvidia_cdi_spec" action
to assert all three emitted commands instead of only commands[0]; locate the
action lookup for id "generate_nvidia_cdi_spec" in src/lib/preflight.test.ts and
add expectations that action.commands[1] equals the verifier command (e.g.,
"sudo nvidia-ctk cdi list") and action.commands[2] equals the remediation retry
guidance (e.g., "sudo nemoclaw onboard" or the exact onboarding command used),
mirroring the pattern used in the docker_group_permission assertions so the full
remediation sequence is validated.
---
Nitpick comments:
In `@src/lib/preflight.test.ts`:
- Around line 633-652: Add a parallel test case that exercises the JSON parsing
path for assessHost (same intent as the existing "nvidia.com/gpu-extra" YAML
test) by creating a new it(...) that calls assessHost with readdirImpl returning
a file name like "nvidia-extra.json" and readFileImpl returning a JSON string
such as '{"kind":"nvidia.com/gpu-extra","devices":[]}', keeping the same
dockerInfoOutput CDISpecDirs, commandExistsImpl, and gpuProbeImpl behavior;
assert result.cdiNvidiaGpuSpecMissing is true to ensure the matcher rejects the
sibling device class in JSON as well.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4272118b-6aa7-4c3d-aa7f-73464b7f0bd0
📒 Files selected for processing (4)
docs/reference/troubleshooting.mdsrc/lib/onboard.tssrc/lib/preflight.test.tssrc/lib/preflight.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/preflight.ts
Three follow-up findings on the previous fix-up commit: 1. Drop the `https://github.com/cncf-tags/container-device-interface` link from the new troubleshooting docs entry. The repo's markdown policy (CLAUDE.md "No External Project Links") forbids third-party GitHub links even for CNCF projects. 2. The resume CDI guard in `onboard()` only honoured `opts.noGpu === true`, which would falsely block runs that previously persisted `gpuPassthrough: false` on the session and were resumed without re-passing `--no-gpu`. Mirror the same persisted-intent resolution that the regular `gpuPassthrough` block performs a few lines below (current invocation flag wins; otherwise honour persisted session intent). 3. The unit test for the new `generate_nvidia_cdi_spec` action only asserted `commands[0]`. Tighten it to cover all three remediation commands, matching the rigour of the existing `docker_group_permission` assertions and pinning the entire user-facing remediation sequence against future drift. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/reference/troubleshooting.md (1)
931-932: ⚡ Quick winRewrite to active voice and apply inline code formatting.
Line 931 uses passive voice: "has been generated."
Rewrite actively, for example: "If the host does not have anvidia.com/gpuCDI spec yet, gateway start fails with..."Line 932:
CDISpecDirsis a Docker configuration field name and must use inline code formatting per the formatting rules.As per coding guidelines, "Active voice required. Flag passive constructions" and "CLI commands, file paths, flags, parameter names, and values must use inline
codeformatting."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/reference/troubleshooting.md` around lines 931 - 932, Rewrite the two sentences to use active voice and inline code formatting: change "has been generated" to an active phrasing such as "the host does not have a `nvidia.com/gpu` CDI spec yet" and ensure `CDISpecDirs` is wrapped in backticks (e.g., `CDISpecDirs`); keep the rest of the meaning intact and preserve existing references like `nvidia.com/gpu` and the `nemoclaw onboard` command.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@docs/reference/troubleshooting.md`:
- Around line 931-932: Rewrite the two sentences to use active voice and inline
code formatting: change "has been generated" to an active phrasing such as "the
host does not have a `nvidia.com/gpu` CDI spec yet" and ensure `CDISpecDirs` is
wrapped in backticks (e.g., `CDISpecDirs`); keep the rest of the meaning intact
and preserve existing references like `nvidia.com/gpu` and the `nemoclaw
onboard` command.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2fdc8b44-665b-40ab-a46d-dd10fe387494
📒 Files selected for processing (3)
docs/reference/troubleshooting.mdsrc/lib/onboard.tssrc/lib/preflight.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/preflight.test.ts
Address the CodeRabbit nitpick on #3159: the YAML-side regression test exists, but the JSON-side regex change (dropped case-insensitive flag) had no test of its own. Add a parallel case asserting that a CDI spec declaring `"kind":"nvidia.com/gpu-extra"` does not satisfy the nvidia.com/gpu check, so a future tweak that re-loosens the JSON matcher will fail loud. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/preflight.test.ts (1)
503-520: ⚡ Quick winAdd a malformed-JSON regression for
parseDockerCdiSpecDirs.Current coverage checks empty input and missing/empty
CDISpecDirs, but not malformed non-empty input. Add one case to lock in graceful fallback ([]) on invalid JSON.Suggested test addition
describe("parseDockerCdiSpecDirs", () => { @@ it("returns an empty array on empty input", () => { expect(parseDockerCdiSpecDirs("")).toEqual([]); }); + + it("returns an empty array on malformed JSON input", () => { + expect(parseDockerCdiSpecDirs("not-json")).toEqual([]); + }); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/preflight.test.ts` around lines 503 - 520, Add a unit test to the existing parseDockerCdiSpecDirs suite to ensure malformed JSON returns an empty array: add an it block (e.g., "returns an empty array on malformed JSON") that calls parseDockerCdiSpecDirs with invalid JSON (for example "{" or "not a json") and asserts it equals []; this locks in the graceful fallback behavior for parseDockerCdiSpecDirs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/preflight.test.ts`:
- Around line 503-520: Add a unit test to the existing parseDockerCdiSpecDirs
suite to ensure malformed JSON returns an empty array: add an it block (e.g.,
"returns an empty array on malformed JSON") that calls parseDockerCdiSpecDirs
with invalid JSON (for example "{" or "not a json") and asserts it equals [];
this locks in the graceful fallback behavior for parseDockerCdiSpecDirs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 21f00f54-2240-417a-af82-a6df67c38c01
📒 Files selected for processing (1)
src/lib/preflight.test.ts
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
9423-9437: Recommend running the onboarding-focused E2E suite for this change.Given this touches core
src/lib/onboard.tspreflight/resume control flow, run the suggested E2E jobs (especiallycloud-e2eandsandbox-operations-e2e) before merge.As per coding guidelines:
src/lib/onboard.ts: “This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow.” plus the listed E2E test recommendation set.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 9423 - 9437, This change touches core onboarding control flow in src/lib/onboard.ts (notably the resumeOptedOutGpuPassthrough logic, assessHost(), assertCdiNvidiaGpuSpecPresent(), and preflight() call), so before merging run the onboarding-focused E2E suite including the cloud-e2e and sandbox-operations-e2e jobs (and any sandbox onboarding E2E pipelines) to validate resume vs fresh preflight behavior and GPU passthrough handling; if any E2E failures surface, iterate on the resumeOptedOutGpuPassthrough/preflight handling and re-run the same suites until they pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 9423-9437: This change touches core onboarding control flow in
src/lib/onboard.ts (notably the resumeOptedOutGpuPassthrough logic,
assessHost(), assertCdiNvidiaGpuSpecPresent(), and preflight() call), so before
merging run the onboarding-focused E2E suite including the cloud-e2e and
sandbox-operations-e2e jobs (and any sandbox onboarding E2E pipelines) to
validate resume vs fresh preflight behavior and GPU passthrough handling; if any
E2E failures surface, iterate on the resumeOptedOutGpuPassthrough/preflight
handling and re-run the same suites until they pass.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 43fc0a8b-2ccb-4f37-ac63-793646083691
📒 Files selected for processing (1)
src/lib/onboard.ts
## Summary - Bump the docs release metadata to `0.0.37`. - Document release-prep updates for messaging policy presets, sandbox runtime utilities, and the GPU CDI troubleshooting path. - Refresh generated `nemoclaw-user-*` skills from the updated docs. ## Source summary - #3159 -> `docs/reference/troubleshooting.md`: Documents the GPU CDI preflight warning and remediation for `nvidia.com/gpu=all` gateway start failures. - #2415 -> `docs/reference/network-policies.md`, `docs/manage-sandboxes/messaging-channels.md`, `docs/network-policy/customize-network-policy.md`: Clarifies that Telegram, Discord, and Slack egress comes from opt-in messaging presets, not the baseline policy. - #3091 -> `docs/deployment/sandbox-hardening.md`, `docs/network-policy/customize-network-policy.md`: Documents the retained sandbox utilities `vi`, `jq`, and `dos2unix` while keeping host-side policy files as the durable source of truth. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `npm run build:cli` - `npm run typecheck:cli` - Commit and pre-push hooks: markdownlint, docs-to-skills verification, gitleaks, commitlint, CLI typecheck ## Skipped - #3193 and #3191 matched `docs/.docs-skip` entries for experimental shields/config paths. - #3200 and #3183 were test-only fixes. - #3189 and #3163 were internal documentation/refactor changes with no public docs impact. Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Clarified which utilities remain in the sandbox runtime for lightweight inspection and cleanup * Noted that messaging endpoints (Discord, Slack, Telegram) are not in the baseline policy and that channel presets are applied during onboarding * Added GPU passthrough troubleshooting for gateway startup * Updated release/version bump and release-prep workflow guidance, including Discord preset description updates <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
On NVIDIA Linux hosts with
nvidia-container-toolkitinstalled (e.g. DGX Spark / DGX Station with default Ubuntu 24.04 onboarding), Docker advertisesCDISpecDirsindocker infoand OpenShell'sgateway start --gpuauto-selects CDI mode — which then fails minutes into onboarding when nonvidia.com/gpuCDI spec has been generated. Surface thesudo nvidia-ctk cdi generateremediation in preflight, before the gateway image download and health-wait window.Related Issue
Closes #3152
Changes
parseDockerCdiSpecDirs(info)parses theCDISpecDirsarray fromdocker info --format '{{json .}}'output.hasNvidiaCdiSpec(specDirs, …)scans each configured spec directory for any YAML or JSON file whosekindis exactlynvidia.com/gpu. Prefix matches likenvidia.com/gpu-extra, comment-only mentions, and unrelated specs do not register.assessHost()returns the newdockerCdiSpecDirsandcdiNvidiaGpuSpecMissingfields. The check fires only when platform islinux, an NVIDIA GPU is present,CDISpecDirsis non-empty, and no matching spec is found.planHostRemediation()emits a blockinggenerate_nvidia_cdi_specaction with thesudo nvidia-ctk cdi generate --output=<dir>/nvidia.yamlcommand sequence.assertCdiNvidiaGpuSpecPresent(host, optedOutGpuPassthrough)is invoked from bothpreflight()and the--resumebranch inonboard()so a user who first hit the failure does not walk into the same CDI error on retry.assessHost()accepts an optionalreaddirImplinjection that mirrors the existingreadFileImpl/gpuProbeImpltest seams.docs/reference/troubleshooting.mdadds an entry next to "GPU passthrough on Spark" so users searching for the literalunresolvable CDI devices nvidia.com/gpu=allerror find the remediation, including outside the preflight guard (older NemoClaw versions, manualopenshell gateway start --gpuinvocations).Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Tests
Documentation