Skip to content

fix(install): auto-generate missing NVIDIA CDI device spec during installer preflight#3428

Merged
cv merged 5 commits into
mainfrom
fix/cdi-autorepair
May 14, 2026
Merged

fix(install): auto-generate missing NVIDIA CDI device spec during installer preflight#3428
cv merged 5 commits into
mainfrom
fix/cdi-autorepair

Conversation

@zyang-dev
Copy link
Copy Markdown
Contributor

@zyang-dev zyang-dev commented May 12, 2026

Summary

Auto-repair missing NVIDIA CDI device specs during the standard NemoClaw installer flow so fresh DGX Spark installs do not stop before onboarding. Direct nemoclaw onboard still reports the manual remediation if users bypass the installer.

Related Issue

Fixes #3252

Changes

  • Detect missing nvidia.com/gpu CDI specs during installer host preflight.
  • Generate the missing NVIDIA CDI spec with nvidia-ctk cdi generate when available.
  • Suppress noisy successful nvidia-ctk output while preserving failure diagnostics.
  • Keep direct onboard behavior as manual remediation.
  • Update troubleshooting docs and installer regression coverage.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

  • Documentation

    • Clarified troubleshooting for unresolvable NVIDIA CDI GPU devices and updated remediation console steps to ensure the CDI directory is created before generating and listing the GPU CDI spec.
  • Improvements

    • Installer preflight now detects missing NVIDIA GPU CDI specs, attempts automatic repair (including enabling a CDI refresh service), creates required directories, and reliably generates/verifies the GPU CDI spec so onboarding can proceed.
  • Tests

    • Added runtime tests covering both refresh-service repair and fallback direct CDI spec generation.

Review Change Stack

…taller preflight

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev zyang-dev self-assigned this May 12, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4c0cacd0-bad7-4a13-8c5b-adae0aca13dc

📥 Commits

Reviewing files that changed from the base of the PR and between b2655cf and 3b8181f.

📒 Files selected for processing (1)
  • docs/reference/troubleshooting.md
✅ Files skipped from review due to trivial changes (1)
  • docs/reference/troubleshooting.md

📝 Walkthrough

Walkthrough

Installer preflight now auto-detects missing NVIDIA CDI device specs, computes the expected spec path, ensures the target directory exists, attempts a systemd refresh unit, or runs nvidia-ctk cdi generate --output=<specPath>, verifies nvidia.com/gpu, and proceeds; tests and docs updated.

Changes

CDI Device Spec Auto-Generation

Layer / File(s) Summary
CDI spec path computation
src/lib/onboard/preflight.ts, src/lib/onboard/preflight.test.ts
Added getNvidiaCdiSpecPath() which normalizes the first configured CDI dir (default /etc/cdi) and returns the deterministic nvidia.yaml path; unit test added.
Remediation planning integration
src/lib/onboard/preflight.ts, src/lib/onboard/preflight.test.ts
planHostRemediation() now uses getNvidiaCdiSpecPath() and emits sudo mkdir -p <specDir> then sudo nvidia-ctk cdi generate --output=<specPath> with subsequent verification via nvidia-ctk cdi list; tests updated to expect the mkdir step and adjusted command order.
Installer CDI repair function & wiring
scripts/install.sh
Added repair_installer_nvidia_cdi_spec(preflight_module) which calls the preflight assess, computes/normalizes the target spec path, attempts a systemctl enable --now nvidia-cdi-refresh* refresh path, falls back to sudo nvidia-ctk cdi generate --output=<specPath> if needed, verifies via nvidia-ctk cdi list, and is invoked from run_installer_host_preflight() before the Node assessment/plan flow.
Integration test coverage
test/install-preflight.test.ts
Added runNvidiaCdiInstallerRepairTest helper and two Vitest cases that stub preflight.js, sudo, systemctl, nvidia-ctk, and id to validate both refresh-success and refresh-failure flows, expected messaging, sudo -v usage, and generated-spec output path.
Troubleshooting documentation
docs/reference/troubleshooting.md
Updated DGX Spark troubleshooting to document the refresh-unit remediation sequence, the direct-generation fallback, and the need to sudo mkdir -p /etc/cdi before nvidia-ctk cdi generate when following manual steps.

Sequence Diagram

sequenceDiagram
    participant Installer as Installer Shell
    participant NodePreflight as Node preflight module
    participant System as systemd (systemctl)
    participant CTK as nvidia-ctk
    participant FS as Host filesystem (/etc/cdi)

    Installer->>NodePreflight: assessHost()
    NodePreflight-->>Installer: cdiNvidiaGpuSpecMissing + dockerCdiSpecDirs
    Installer->>System: systemctl enable --now nvidia-cdi-refresh*
    System-->>Installer: success/failure
    alt refresh success
        System->>CTK: refresh unit triggers generation
        CTK->>FS: write /etc/cdi/nvidia.yaml
        CTK-->>Installer: cdi list shows nvidia.com/gpu
    else refresh failed
        Installer->>Installer: sudo -v (if non-root)
        Installer->>Installer: sudo mkdir -p <specDir>
        Installer->>CTK: sudo nvidia-ctk cdi generate --output=<specPath>
        CTK->>FS: write <specPath>
        CTK-->>Installer: nvidia-ctk cdi list -> contains nvidia.com/gpu
    end
    Installer->>Installer: continue onboarding
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Getting Started

Poem

🐰 I dug a dir where yaml sleeps,
I nudged the refresh till CTK peeps.
If unit fails, I make the nest,
Generate the spec, and off we zest.
Hops of code — onboard's blessed rest.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: auto-generating missing NVIDIA CDI device specs during installer preflight, which is the primary objective of this PR.
Linked Issues check ✅ Passed The PR addresses all coding requirements from #3252: detects missing nvidia.com/gpu CDI specs, auto-generates them via nvidia-ctk during installer preflight, handles both systemd service and fallback paths, and preserves direct onboard manual remediation.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing the auto-repair mechanism: installer preflight logic, CDI spec path helpers, troubleshooting documentation updates, and corresponding test coverage. No unrelated or extraneous changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cdi-autorepair

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@zyang-dev zyang-dev added v0.0.41 Release target and removed v0.0.41 Release target labels May 13, 2026
@zyang-dev zyang-dev marked this pull request as draft May 13, 2026 16:37
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 13, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wscurran wscurran added Platform: DGX Spark Support for DGX Spark NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). fix labels May 13, 2026
@wscurran
Copy link
Copy Markdown
Contributor

@wscurran wscurran removed the Platform: DGX Spark Support for DGX Spark label May 13, 2026
…eneration

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev zyang-dev marked this pull request as ready for review May 13, 2026 19:15
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, gpu-e2e
Optional E2E: onboard-repair-e2e, gpu-double-onboard-e2e

Dispatch hint: cloud-onboard-e2e,gpu-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-onboard-e2e (medium): Validates the installer-to-onboard path still succeeds on a clean Ubuntu cloud flow after changing installer host preflight sequencing and blocking remediation behavior.
  • gpu-e2e (high): Exercises the GPU/local Ollama install and onboarding path on a real NVIDIA GPU runner, the closest existing coverage for Docker CDI/NVIDIA Toolkit interactions and sandbox GPU passthrough affected by this PR.

Optional E2E

  • onboard-repair-e2e (medium): Adjacent confidence for onboarding recovery/session behavior after preflight changes, but it does not specifically cover NVIDIA CDI repair and is not merge-blocking for this PR.
  • gpu-double-onboard-e2e (high): Useful additional GPU/onboarding confidence for re-onboard with local Ollama after installer/preflight changes, but the primary risk is first-install CDI repair so gpu-e2e is the required GPU check.

New E2E recommendations

  • gpu-cdi-runtime (high): Existing GPU E2E validates a normal GPU runner but does not explicitly force the regression shape: Docker advertises CDISpecDirs, no kind: nvidia.com/gpu CDI spec exists, installer first tries nvidia-cdi-refresh.path/service, then falls back to nvidia-ctk cdi generate, and onboarding proceeds without surfacing a blocking preflight action.
    • Suggested test: Add an E2E or regression job that provisions/isolates a GPU Docker-CDI host with the NVIDIA CDI spec removed or hidden, runs install.sh --non-interactive with NEMOCLAW_PROVIDER=ollama, asserts the refresh/fallback repair path, verifies nvidia-ctk cdi list contains nvidia.com/gpu, and confirms sandbox creation with GPU passthrough succeeds.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: cloud-onboard-e2e,gpu-e2e

@zyang-dev zyang-dev added v0.0.42 Release target v0.0.41 Release target and removed v0.0.42 Release target labels May 13, 2026
@cv cv added v0.0.42 Release target and removed v0.0.41 Release target labels May 14, 2026
@cv cv added v0.0.43 Release target and removed v0.0.42 Release target labels May 14, 2026
@cv cv enabled auto-merge (squash) May 14, 2026 21:50
@cv cv merged commit d089166 into main May 14, 2026
28 checks passed
@zyang-dev zyang-dev deleted the fix/cdi-autorepair branch May 14, 2026 23:06
@miyoungc miyoungc mentioned this pull request May 15, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 15, 2026
## Summary
Refreshes the NemoClaw docs for the 0.0.43 release window, covering GPU
onboarding fixes, installer CDI repair behavior, and Linux uninstall
cleanup. Updates the docs version metadata and regenerates the user
skill references from the source docs.

## Related Issue
None.

## Changes
- #3428 -> `docs/reference/troubleshooting.md`: Documents the installer
path that repairs missing NVIDIA CDI device specs before onboarding.
- #3515 and #3543 -> `docs/about/release-notes.md` and
`docs/reference/troubleshooting.md`: Documents the Linux Docker-driver
GPU proof permission fix for `/proc/<pid>/task/<tid>/comm` writes.
- #3536 -> `docs/reference/commands.md`: Documents that `nemoclaw
uninstall` removes Linux gateway state under `~/.local/state/nemoclaw`.
- Refreshes generated `nemoclaw-user-*` skill references from the
updated source docs.
- Bumps `docs/project.json` and `docs/versions1.json` to 0.0.43.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved GPU onboarding on Linux Docker-driver with automatic CDI spec
repair and fallback mechanisms.
* Fixed permission issues affecting GPU proof writes during Linux
onboarding.
* Enhanced uninstall to properly clean up gateway state and auth proxy
processes on Linux.

* **Documentation**
* Updated release notes, command references, and troubleshooting guides
for v0.0.43.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3613)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). v0.0.43 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark][Install] Onboard blocked by missing CDI device spec on fresh Spark with Skip OTA

4 participants