Skip to content

fix(preflight): bootstrap nvidia-container-toolkit when nvidia-ctk is missing#3507

Merged
cv merged 2 commits into
mainfrom
fix/3506-nvidia-ctk-missing-hint
May 15, 2026
Merged

fix(preflight): bootstrap nvidia-container-toolkit when nvidia-ctk is missing#3507
cv merged 2 commits into
mainfrom
fix/3506-nvidia-ctk-missing-hint

Conversation

@laitingsheng
Copy link
Copy Markdown
Contributor

@laitingsheng laitingsheng commented May 14, 2026

Summary

Host preflight currently emits a nvidia-ctk cdi generate ... hint whenever Docker advertises CDI dirs and no nvidia.com/gpu spec is present, but it never checks whether nvidia-ctk itself is on PATH. On a fresh Ubuntu 24.04 host that has the NVIDIA driver but not the Container Toolkit, the hint fails with nvidia-ctk: command not found and onboarding can't proceed.

This PR adds a nvidiaContainerToolkitInstalled probe to the host assessment and branches the remediation action: when the toolkit is missing, preflight emits an install_nvidia_container_toolkit action whose commands prepend the apt (or dnf/yum) bootstrap before the existing nvidia-ctk cdi generate step. On unknown package managers, it points at NVIDIA's official install guide.

Related Issue

Fixes #3506

Changes

  • src/lib/onboard/preflight.ts: new nvidiaContainerToolkitInstalled field on HostAssessment, probed via the existing commandExists("nvidia-ctk", ...) helper. Remediation for cdiNvidiaGpuSpecMissing now branches on it: toolkit-present keeps the current generate_nvidia_cdi_spec action verbatim; toolkit-missing emits install_nvidia_container_toolkit with bootstrap commands from a new exported helper buildContainerToolkitBootstrapCommands(packageManager, generateCommands) (apt / dnf / yum / docs-pointer fallback).
  • src/lib/onboard/preflight.test.ts: existing CDI fixtures updated with nvidiaContainerToolkitInstalled: true to keep their behaviour. Two new cases cover the toolkit-missing branch on apt and on the docs-pointer fallback (unknown package manager).

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Host assessments now detect NVIDIA Container Toolkit presence.
    • Automatic onboarding flows will install the toolkit (apt and other package manager guidance) when missing.
    • CDI specification generation now runs with stronger prerequisite checks and verification steps.
  • Tests

    • Added coverage validating toolkit detection and remediation flows for missing NVIDIA tooling.

Review Change Stack

… missing (#3506)

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng added NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). fix labels May 14, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ab3c2810-2676-4b7d-8267-34198da7061d

📥 Commits

Reviewing files that changed from the base of the PR and between 32bfc9f and 156c2db.

📒 Files selected for processing (2)
  • src/lib/onboard/preflight.test.ts
  • src/lib/onboard/preflight.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/onboard/preflight.ts
  • src/lib/onboard/preflight.test.ts

📝 Walkthrough

Walkthrough

Adds detection of NVIDIA Container Toolkit to host assessment and updates remediation planning: if nvidia-ctk is missing, planHostRemediation emits an install action (package-manager-aware) that bootstraps nvidia-container-toolkit and then runs CDI generation; tests updated and two new cases added.

Changes

NVIDIA Container Toolkit Detection and Remediation

Layer / File(s) Summary
Host assessment contract & probe
src/lib/onboard/preflight.ts
HostAssessment gains nvidiaContainerToolkitInstalled: boolean. assessHost probes for nvidia-ctk (uses commandExistsImpl when provided) and stores the result in the assessment.
Bootstrap command builder
src/lib/onboard/preflight.ts
New exported buildContainerToolkitBootstrapCommands(packageManager, generateCommands) returns package-manager-specific install sequences (apt, dnf/yum, unknown) and appends the caller-provided CDI generation commands.
Remediation planning for missing CDI spec
src/lib/onboard/preflight.ts
planHostRemediation precomputes a generateCommands sequence. If nvidiaContainerToolkitInstalled is true, it emits only the CDI generation action; if false, it emits a blocking install_nvidia_container_toolkit action that uses buildContainerToolkitBootstrapCommands to install the toolkit then run generation/verification.
Tests: fixtures and new cases
src/lib/onboard/preflight.test.ts
Updated existing planHostRemediation fixtures to include nvidiaContainerToolkitInstalled: true. Added two tests asserting that when the toolkit is missing: (1) on apt-based hosts the remediation includes apt keyring + nvidia-container-toolkit install and subsequent CDI generation (no direct generate_nvidia_cdi_spec action); (2) on unknown package managers the remediation includes a docs-pointer install action plus CDI generation commands.

Sequence Diagram

sequenceDiagram
  participant Client
  participant assessHost
  participant HostAssessment
  participant planHostRemediation
  participant buildContainerToolkitBootstrapCommands
  participant RemediationPlan

  Client->>assessHost: run host assessment
  assessHost->>HostAssessment: probe for nvidia-ctk, set nvidiaContainerToolkitInstalled
  Client-->>planHostRemediation: pass HostAssessment

  planHostRemediation->>planHostRemediation: prepare generateCommands
  alt nvidiaContainerToolkitInstalled == true
    planHostRemediation->>RemediationPlan: emit generate_nvidia_cdi_spec action
  else
    planHostRemediation->>buildContainerToolkitBootstrapCommands: request install sequence + generateCommands
    buildContainerToolkitBootstrapCommands-->>planHostRemediation: return install + generate sequence
    planHostRemediation->>RemediationPlan: emit install_nvidia_container_toolkit action (blocking)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3428: Modifies the same NVIDIA CDI remediation flow and related spec generation/install sequencing.

Poem

🐰 I sniffed for ctk in the night,
No command found, so I made it right.
Repo, key, install—then generate with glee,
CDI blooms for GPUs, onboarding set free. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding detection and remediation for missing nvidia-ctk, which is the core fix for the problem stated in the linked issue.
Linked Issues check ✅ Passed The implementation fully addresses the objectives: detects missing nvidia-ctk, provides remediation via buildContainerToolkitBootstrapCommands for apt/dnf/yum with package-manager-specific bootstrap, and branches remediation to avoid assuming toolkit presence.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the issue requirements: adding nvidiaContainerToolkitInstalled detection, implementing bootstrap command builder, updating remediation logic, and adding comprehensive test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3506-nvidia-ctk-missing-hint

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, gpu-e2e
Optional E2E: gpu-double-onboard-e2e, onboard-repair-e2e

Dispatch hint: cloud-onboard-e2e,gpu-e2e

Auto-dispatched E2E: cloud-onboard-e2e, gpu-e2e via nightly-e2e.yaml at 156c2db0eaba09acc26e7a8736c0abd29822c0bbnightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-onboard-e2e (high): Preflight is part of the real onboarding path. Run the cloud onboard flow to catch regressions in non-GPU install/onboard behavior caused by the new host assessment field and nvidia-ctk probe.
  • gpu-e2e (high): This is the closest existing coverage for the changed GPU/CDI onboarding path. It validates a real NVIDIA GPU local Ollama onboard flow, GPU passthrough, sandbox creation, and inference on a GPU runner.

Optional E2E

  • gpu-double-onboard-e2e (high): Useful adjacent confidence for repeated GPU onboarding after preflight changes, but the PR does not directly modify Ollama proxy token persistence or re-onboard-specific logic.
  • onboard-repair-e2e (high): Adjacent coverage for onboarding repair/resume behavior if maintainers want extra confidence that preflight/remediation changes do not alter recovery flows.

New E2E recommendations

  • gpu-cdi-remediation (high): Existing GPU E2E validates the successful GPU onboard path, but it likely runs on a correctly provisioned GPU host and does not force nvidia-ctk missing while Docker advertises CDI dirs with no nvidia.com/gpu spec. Add targeted E2E or scenario coverage for the new blocking install_nvidia_container_toolkit remediation branch and command ordering.
    • Suggested test: Add a preflight remediation E2E/scenario that simulates or provisions a Linux GPU/CDI host with Docker CDI dirs present, missing nvidia.com/gpu CDI spec, and no nvidia-ctk, then verifies nemoclaw onboard blocks with install_nvidia_container_toolkit guidance before sandbox creation.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: cloud-onboard-e2e,gpu-e2e

@laitingsheng laitingsheng marked this pull request as ready for review May 14, 2026 08:18
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard/preflight.test.ts (1)

937-1029: ⚡ Quick win

Add a RHEL-family package-manager test for the new toolkit bootstrap branch.

Current coverage validates apt and unknown, but the new helper also has a dedicated dnf/yum path. A small parameterized case would lock that branch against regressions.

Suggested test shape
+  it.each(["dnf", "yum"] as const)(
+    "emits install_nvidia_container_toolkit with %s bootstrap when nvidia-ctk is missing",
+    (packageManager) => {
+      const actions = planHostRemediation({
+        platform: "linux",
+        isWsl: false,
+        runtime: "docker",
+        packageManager,
+        systemctlAvailable: true,
+        dockerServiceActive: true,
+        dockerServiceEnabled: true,
+        dockerInstalled: true,
+        dockerRunning: true,
+        dockerReachable: true,
+        nodeInstalled: true,
+        openshellInstalled: true,
+        dockerCgroupVersion: "v2",
+        dockerDefaultCgroupnsMode: "unknown",
+        isContainerRuntimeUnderProvisioned: false,
+        hasNestedOverlayConflict: false,
+        requiresHostCgroupnsFix: false,
+        isUnsupportedRuntime: false,
+        isHeadlessLikely: false,
+        hasNvidiaGpu: true,
+        dockerCdiSpecDirs: ["/etc/cdi"],
+        cdiNvidiaGpuSpecMissing: true,
+        nvidiaContainerToolkitInstalled: false,
+        notes: [],
+      });
+
+      const action = actions.find((entry) => entry.id === "install_nvidia_container_toolkit");
+      expect(action).toBeTruthy();
+      expect(action?.commands[1]).toBe(`sudo ${packageManager} install -y nvidia-container-toolkit`);
+    },
+  );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/preflight.test.ts` around lines 937 - 1029, Tests cover apt
and unknown packageManager branches for install_nvidia_container_toolkit but
miss the RHEL-family (dnf/yum) bootstrap branch; add a parameterized test
calling planHostRemediation with packageManager set to "dnf" (and/or "yum") and
the same input flags (hasNvidiaGpu: true, cdiNvidiaGpuSpecMissing: true,
nvidiaContainerToolkitInstalled: false) then assert the returned
install_nvidia_container_toolkit action exists and that its commands include the
RHEL-specific install command (e.g., "sudo dnf install -y
nvidia-container-toolkit" or the appropriate yum equivalent) and still include
the cdi generate command; place this alongside the existing apt/unknown tests
and reuse the same expectation checks (action?.kind, .blocking, .title, .reason,
cdi generate presence, and ordering relative to install command).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard/preflight.test.ts`:
- Around line 937-1029: Tests cover apt and unknown packageManager branches for
install_nvidia_container_toolkit but miss the RHEL-family (dnf/yum) bootstrap
branch; add a parameterized test calling planHostRemediation with packageManager
set to "dnf" (and/or "yum") and the same input flags (hasNvidiaGpu: true,
cdiNvidiaGpuSpecMissing: true, nvidiaContainerToolkitInstalled: false) then
assert the returned install_nvidia_container_toolkit action exists and that its
commands include the RHEL-specific install command (e.g., "sudo dnf install -y
nvidia-container-toolkit" or the appropriate yum equivalent) and still include
the cdi generate command; place this alongside the existing apt/unknown tests
and reuse the same expectation checks (action?.kind, .blocking, .title, .reason,
cdi generate presence, and ordering relative to install command).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8f83134f-2b4d-40d7-9e43-5692e5c7937f

📥 Commits

Reviewing files that changed from the base of the PR and between 25a9ee9 and 32bfc9f.

📒 Files selected for processing (2)
  • src/lib/onboard/preflight.test.ts
  • src/lib/onboard/preflight.ts

@laitingsheng laitingsheng added the v0.0.42 Release target label May 14, 2026
@laitingsheng laitingsheng changed the title fix(preflight): bootstrap nvidia-container-toolkit when nvidia-ctk is missing (#3506) fix(preflight): bootstrap nvidia-container-toolkit when nvidia-ctk is missing May 14, 2026
@cv cv added v0.0.43 Release target and removed v0.0.42 Release target labels May 14, 2026
…issing-hint

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

# Conflicts:
#	src/lib/onboard/preflight.ts
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25896582408
Target ref: 156c2db0eaba09acc26e7a8736c0abd29822c0bb
Workflow ref: main
Requested jobs: cloud-onboard-e2e,gpu-e2e
Summary: 1 passed, 0 failed, 1 skipped

Job Result
cloud-onboard-e2e ✅ success
gpu-e2e ⏭️ skipped

@cv cv added v0.0.44 Release target and removed v0.0.43 Release target labels May 15, 2026
@cv cv merged commit 5b887e1 into main May 15, 2026
27 checks passed
@miyoungc miyoungc mentioned this pull request May 16, 2026
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). v0.0.44 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Onboard] preflight tells user nvidia-ctk command not found

3 participants