Skip to content

fix(onboard): allow proc writes for Docker GPU patch#3543

Merged
cv merged 3 commits into
mainfrom
fix/spark-docker-gpu-proc-policy
May 14, 2026
Merged

fix(onboard): allow proc writes for Docker GPU patch#3543
cv merged 3 commits into
mainfrom
fix/spark-docker-gpu-proc-policy

Conversation

@cv
Copy link
Copy Markdown
Collaborator

@cv cv commented May 14, 2026

Summary

Allow the Linux Docker GPU patch path to mirror OpenShell's /proc GPU policy enrichment when NemoClaw recreates a sandbox container with GPU flags. This fixes DGX Spark installs where nvidia-smi succeeds but the direct GPU proof fails on /proc/<pid>/task/<tid>/comm with Permission denied.

Changes

  • Thread the Docker GPU patch decision into initial sandbox policy generation.
  • Add /proc to filesystem_policy.read_write only for the Docker GPU patch path, while preserving the native OpenShell --gpu policy behavior.
  • Add regression coverage for the Docker GPU patch policy shape.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

  • New Features

    • GPU patch detection is now resolved during sandbox creation and emits conditional GPU-related log messages.
    • Initial sandbox policy generation can optionally allow /proc as read-write for certain GPU patch scenarios.
  • Improvements

    • Supervisor exec readiness now suppresses execution output during the GPU patch probe.
  • Tests

    • Added/updated tests for /proc read-write policy behavior and GPU patch decision logic.

Review Change Stack

@cv cv self-assigned this May 14, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8f68f3ee-efca-41c6-b8fb-5e561d13c3f0

📥 Commits

Reviewing files that changed from the base of the PR and between 91a63ad and b9ed55a.

📒 Files selected for processing (2)
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/onboard/docker-gpu-patch.test.ts

📝 Walkthrough

Walkthrough

createSandbox resolves the Docker GPU sandbox-create plan before policy creation and conditionally logs its message; separately, the OpenShell readiness exec call now passes suppressOutput: true (alongside ignoreError: true and the existing timeout).

Changes

Docker GPU create plan and exec suppression

Layer / File(s) Summary
OpenShell exec suppression
src/lib/onboard/docker-gpu-patch.ts
waitForOpenShellSandboxExec now calls deps.runOpenshell(..., { ignoreError: true, suppressOutput: true, timeout: DOCKER_GPU_PATCH_TIMEOUT_MS }) to suppress subprocess output during the GPU patch readiness probe.
Sandbox create plan and policy wiring
src/lib/onboard.ts, src/lib/onboard/initial-policy.ts, src/lib/onboard/docker-gpu-sandbox-create.ts
createSandbox() resolves a Docker GPU sandbox-create plan early and passes { directGpu: <sandboxGpuEnabled>, dockerGpuPatch: <useDockerGpuPatch> } into prepareInitialSandboxCreatePolicy(...); GPU logging is now conditional on the resolved message.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

Suggested labels

fix, Docker, NemoClaw CLI, enhancement: policy

Suggested reviewers

  • ericksoa
  • jyaunches

Poem

🐰 I hopped through logs and quieted the noise,
Added hush to execs with tidy, swift poise,
Plans resolved early, flags set with care,
Policies ready — the sandbox breathes air,
A tiny rabbit clap for changes made fair.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): allow proc writes for Docker GPU patch' clearly and specifically summarizes the main change: enabling /proc write permissions for the Docker GPU patch path in the onboarding process.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/spark-docker-gpu-proc-policy

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

E2E Advisor Recommendation

Required E2E: gpu-e2e, cloud-onboard-e2e
Optional E2E: network-policy-e2e, gpu-double-onboard-e2e

Dispatch hint: gpu-e2e,cloud-onboard-e2e

Auto-dispatched E2E: gpu-e2e, cloud-onboard-e2e via nightly-e2e.yaml at b9ed55afcad4430cc430ccdd27806d403279cbeanightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high): Required because this PR changes the GPU onboarding path, Docker GPU patch selection, supervisor reconnect behavior, and /proc policy needed for real GPU sandbox boot and inference.
  • cloud-onboard-e2e (high): Required as a real onboarding regression check for createSandbox and initial sandbox policy generation; it also verifies Landlock/security behavior in a non-GPU cloud onboarding flow.

Optional E2E

  • network-policy-e2e (high): Useful adjacent coverage because prepareInitialSandboxCreatePolicy was changed and this job stresses policy presets, sandbox policy application, and live policy behavior, though the diff is primarily filesystem/GPU policy rather than network policy.
  • gpu-double-onboard-e2e (high): Useful adjacent confidence for repeated GPU onboarding and sandbox recreation behavior after the Docker GPU patch changes, but not merge-blocking unless the PR is intended to change re-onboard semantics.

New E2E recommendations

  • Docker GPU patch policy enrichment (medium): Existing gpu-e2e validates the real GPU onboarding path, but there is no narrowly targeted E2E that asserts the Linux Docker-driver GPU patch path boots with the patched initial /proc policy and explicitly proves /proc//task//comm writes plus cuInit after container recreation.
    • Suggested test: Add a GPU E2E scenario or nightly job that forces the Docker GPU patch path, captures the initial sandbox policy, verifies /proc is read-write only for the patched path, and runs the direct GPU proof commands after supervisor reconnect.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: gpu-e2e,cloud-onboard-e2e

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)

5636-5661: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move this orchestration back out of src/lib/onboard.ts to clear the entrypoint budget.

The added useDockerGpuPatch branching and log selection here is what trips the blocking CI / Onboard Entrypoint Budget job (src/lib/onboard.ts grew by 7 lines). Please hide this behind a helper in the GPU onboarding modules so this file stays orchestration-only.

As per coding guidelines src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 5636 - 5661, Extract the GPU-specific
branching and log-selection into a helper inside the GPU onboarding module
(e.g., add a function in dockerGpuSandboxCreate or a new gpuOnboard helper) that
accepts the GPU config and environment flags and returns { useDockerGpuPatch,
logMessage } or at least useDockerGpuPatch plus a descriptive message; replace
the local calls to dockerGpuSandboxCreate.shouldUseDockerGpuPatchForCreate and
the conditional console.log selection in src/lib/onboard.ts with a single call
to that helper and a single console.log of the returned message, leaving
prepareInitialSandboxCreatePolicy, initialSandboxPolicy handling
(cleanup/appliedPresets) and effectiveSandboxGpuConfig usage unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 5636-5661: Extract the GPU-specific branching and log-selection
into a helper inside the GPU onboarding module (e.g., add a function in
dockerGpuSandboxCreate or a new gpuOnboard helper) that accepts the GPU config
and environment flags and returns { useDockerGpuPatch, logMessage } or at least
useDockerGpuPatch plus a descriptive message; replace the local calls to
dockerGpuSandboxCreate.shouldUseDockerGpuPatchForCreate and the conditional
console.log selection in src/lib/onboard.ts with a single call to that helper
and a single console.log of the returned message, leaving
prepareInitialSandboxCreatePolicy, initialSandboxPolicy handling
(cleanup/appliedPresets) and effectiveSandboxGpuConfig usage unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dd99aa85-d1e5-4c12-b190-55caa60af8eb

📥 Commits

Reviewing files that changed from the base of the PR and between f42b9d9 and a989131.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/initial-policy.ts
  • test/onboard.test.ts

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25889896109
Target ref: a989131e58d306d3e6b5fdd1291952cbbe01612c
Workflow ref: main
Requested jobs: gpu-e2e,cloud-onboard-e2e
Summary: 1 passed, 0 failed, 1 skipped

Job Result
cloud-onboard-e2e ✅ success
gpu-e2e ⏭️ skipped

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25890428310
Target ref: 91a63ad626f6926dddfeff0c9408535eb939e029
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@cv cv added the v0.0.43 Release target label May 14, 2026
@cv cv enabled auto-merge (squash) May 14, 2026 23:02
@cv cv disabled auto-merge May 14, 2026 23:08
@cv cv enabled auto-merge (squash) May 14, 2026 23:15
@cv cv merged commit 84d847b into main May 14, 2026
24 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25891068954
Target ref: b9ed55afcad4430cc430ccdd27806d403279cbea
Workflow ref: main
Requested jobs: gpu-e2e,cloud-onboard-e2e
Summary: 1 passed, 0 failed, 1 skipped

Job Result
cloud-onboard-e2e ✅ success
gpu-e2e ⏭️ skipped

@miyoungc miyoungc mentioned this pull request May 15, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 15, 2026
## Summary
Refreshes the NemoClaw docs for the 0.0.43 release window, covering GPU
onboarding fixes, installer CDI repair behavior, and Linux uninstall
cleanup. Updates the docs version metadata and regenerates the user
skill references from the source docs.

## Related Issue
None.

## Changes
- #3428 -> `docs/reference/troubleshooting.md`: Documents the installer
path that repairs missing NVIDIA CDI device specs before onboarding.
- #3515 and #3543 -> `docs/about/release-notes.md` and
`docs/reference/troubleshooting.md`: Documents the Linux Docker-driver
GPU proof permission fix for `/proc/<pid>/task/<tid>/comm` writes.
- #3536 -> `docs/reference/commands.md`: Documents that `nemoclaw
uninstall` removes Linux gateway state under `~/.local/state/nemoclaw`.
- Refreshes generated `nemoclaw-user-*` skill references from the
updated source docs.
- Bumps `docs/project.json` and `docs/versions1.json` to 0.0.43.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved GPU onboarding on Linux Docker-driver with automatic CDI spec
repair and fallback mechanisms.
* Fixed permission issues affecting GPU proof writes during Linux
onboarding.
* Enhanced uninstall to properly clean up gateway state and auth proxy
processes on Linux.

* **Documentation**
* Updated release notes, command references, and troubleshooting guides
for v0.0.43.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3613)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
@coderabbitai coderabbitai Bot mentioned this pull request May 16, 2026
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.43 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants