Skip to content

fix(onboard): ensure GPU recreate has SYS_PTRACE + apparmor=unconfined for proc/comm write#3515

Merged
cv merged 3 commits into
mainfrom
fix/3511-gpu-recreate-proc-comm-flags
May 14, 2026
Merged

fix(onboard): ensure GPU recreate has SYS_PTRACE + apparmor=unconfined for proc/comm write#3515
cv merged 3 commits into
mainfrom
fix/3511-gpu-recreate-proc-comm-flags

Conversation

@laitingsheng
Copy link
Copy Markdown
Contributor

@laitingsheng laitingsheng commented May 14, 2026

Summary

The Docker-driver GPU patch (src/lib/onboard/docker-gpu-patch.ts) recreates the OpenShell-managed sandbox container with --gpus all plus a reconstruction of the baseline container's flags, then runs a three-step GPU proof in src/lib/onboard/initial-policy.ts — including PROC_COMM_WRITE_PROBE, which writes to /proc/<pid>/task/<tid>/comm.

On DGX Spark hosts (#3511) and other Docker-driver Linux hosts where the OpenShell-created baseline container's CapAdd does not include SYS_PTRACE and its SecurityOpt does not include apparmor=unconfined, the recreate inherits a flag set the kernel/LSM stack rejects for that proc write. The proof aborts onboarding:

✗ GPU proof failed: /proc/<pid>/task/<tid>/comm write
Error: GPU proof failed: /proc/<pid>/task/<tid>/comm write (status 2):
  sh: 1: cannot create /proc/<pid>/task/<tid>/comm: Permission denied

The reporter confirmed a bare docker run --rm --gpus all ubuntu sh -c "echo test > /proc/1/task/1/comm" succeeds on the same host, so the kernel itself allows the write under the right Docker flags — the problem is which flags reach the recreated container.

This PR makes the GPU recreate self-sufficient for the operations the GPU proof checks, regardless of the non-GPU baseline:

  • Always inject --cap-add SYS_PTRACE into the clone, deduped via a Set so baselines that already have it stay flat.
  • Inject --security-opt apparmor=unconfined only when the baseline did not pin a specific apparmor profile. Docker rejects duplicate --security-opt apparmor=… entries, so a baseline that explicitly chose apparmor=docker-default (or similar) is respected — this stays scoped to the GPU recreate path and does not override deliberate operator choices.

Related Issue

Resolves #3511

Related context:

Changes

  • src/lib/onboard/docker-gpu-patch.ts: dedup CapAdd / SecurityOpt via Set, always inject SYS_PTRACE, and inject apparmor=unconfined only when no apparmor profile is pinned by the baseline.
  • src/lib/onboard/docker-gpu-patch.test.ts: four new cases — SYS_PTRACE always added; SYS_PTRACE deduped when baseline already has it; apparmor=unconfined injected on empty baselines; baseline-pinned apparmor profile preserved.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • GPU-enabled containers are now configured with the necessary kernel security capabilities during clone/recreate operations.
    • Security profile handling was fixed to preserve existing security options and avoid unintended overrides or duplicate entries.
  • Tests

    • Added tests covering GPU container clone behavior across various capability and security-profile scenarios to prevent regressions.

Review Change Stack

…d for proc/comm write (#3511)

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng added NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). fix v0.0.42 Release target labels May 14, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8072a2c5-57c3-40bd-b102-dbe1529afe00

📥 Commits

Reviewing files that changed from the base of the PR and between 86f25ac and 12f5cc8.

📒 Files selected for processing (1)
  • src/lib/onboard/docker-gpu-patch.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-gpu-patch.test.ts

📝 Walkthrough

Walkthrough

Adds logic and tests so GPU-recreated sandbox containers include the SYS_PTRACE capability and conditionally add apparmor=unconfined only when the baseline has no AppArmor entry, preserving other baseline capability and security options.

Changes

GPU Container Recreation Permission Fix

Layer / File(s) Summary
Permission-aware GPU recreate args
src/lib/onboard/docker-gpu-patch.ts
Builds capability and security-option sets from the baseline, ensures SYS_PTRACE via --cap-add, preserves --cap-drop, and injects --security-opt apparmor=unconfined only if no existing apparmor entry is present.
Test coverage for capability and AppArmor handling
src/lib/onboard/docker-gpu-patch.test.ts
Adds four tests validating SYS_PTRACE is added when missing, not duplicated when present, apparmor=unconfined is injected when absent, and existing AppArmor/securityOpt entries are preserved.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3434: Related edits to docker-gpu-patch and buildDockerGpuCloneRunArgs around SYS_PTRACE and AppArmor handling.

Suggested labels

Docker, v0.0.41

Suggested reviewers

  • ericksoa
  • prekshivyas
  • jyaunches

Poem

🐰 I nudge the args with careful paws,
Adding SYS_PTRACE without a cause,
If AppArmor's missing, I unbind,
Else I keep what the host defined,
Hopping off—GPU proofs applaud!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: ensuring GPU recreate has SYS_PTRACE and apparmor=unconfined capabilities for /proc/comm write operations, which directly addresses the core issue.
Linked Issues check ✅ Passed The code changes meet the primary objective from issue #3511: the implementation adds SYS_PTRACE capability and conditional apparmor=unconfined to the GPU recreate flow, directly fixing the /proc/comm write permission denial.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the GPU recreate Docker argument construction and adding corresponding test coverage; no unrelated modifications are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3511-gpu-recreate-proc-comm-flags

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@laitingsheng laitingsheng removed the v0.0.42 Release target label May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

E2E Advisor Recommendation

Required E2E: gpu-e2e
Optional E2E: gpu-double-onboard-e2e

Dispatch hint: gpu-e2e

Auto-dispatched E2E: gpu-e2e via nightly-e2e.yaml at f07f682a9c1d079d9700116db8a2cee61ff6bd79nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high): Directly exercises the real local Ollama GPU onboarding flow, including Docker GPU patch activation, sandbox GPU proof, OpenShell reconnect path, and inference through the recreated GPU-capable sandbox. This is the required regression check for changes to Docker GPU clone/security flags.

Optional E2E

  • gpu-double-onboard-e2e (high): Useful adjacent coverage for running GPU onboarding twice with Ollama and verifying the sandbox/proxy still works after re-onboard. It also exercises the GPU sandbox lifecycle path, but the changed code is primarily covered by the single GPU onboarding E2E.

New E2E recommendations

  • docker-gpu-security-options (medium): Existing GPU E2E validates the Docker GPU patch indirectly through proof/inference, but it does not appear to explicitly assert that the recreated sandbox container received SYS_PTRACE and the intended AppArmor behavior. A targeted assertion would catch future regressions in the exact security flags changed here.
    • Suggested test: Extend the GPU E2E proof to inspect the recreated sandbox container and verify HostConfig.CapAdd contains SYS_PTRACE and that AppArmor handling matches the expected Docker GPU patch mode without duplicating or overriding pinned profiles.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: gpu-e2e

@laitingsheng laitingsheng marked this pull request as ready for review May 14, 2026 13:33
@github-actions
Copy link
Copy Markdown
Contributor

Brev E2E (full): FAILED on branch fix/3511-gpu-recreate-proc-comm-flagsSee logs

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25863084990
Target ref: 86f25ac28e5b47261c387a5520cc3e33d9ce9e1a
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@github-actions
Copy link
Copy Markdown
Contributor

Brev E2E (full): FAILED on branch fix/3511-gpu-recreate-proc-comm-flagsSee logs

@github-actions
Copy link
Copy Markdown
Contributor

Brev E2E (full): FAILED on branch fix/3511-gpu-recreate-proc-comm-flagsSee logs

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25867131940
Target ref: 12f5cc80851d0ebcd3b9a03ba1b29b291788ddb5
Workflow ref: main
Requested jobs: gpu-e2e,gpu-double-onboard-e2e
Summary: 0 passed, 0 failed, 2 skipped

Job Result
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped

@wscurran wscurran added the v0.0.42 Release target label May 14, 2026
@wscurran wscurran requested review from cv and jyaunches May 14, 2026 16:13
@wscurran
Copy link
Copy Markdown
Contributor

@cv @jyaunches adding for review, this resolves a QA blocker (#3511).

@cv cv added v0.0.43 Release target and removed v0.0.42 Release target labels May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25887406164
Target ref: f07f682a9c1d079d9700116db8a2cee61ff6bd79
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@cv cv merged commit f42b9d9 into main May 14, 2026
24 checks passed
@miyoungc miyoungc mentioned this pull request May 15, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 15, 2026
## Summary
Refreshes the NemoClaw docs for the 0.0.43 release window, covering GPU
onboarding fixes, installer CDI repair behavior, and Linux uninstall
cleanup. Updates the docs version metadata and regenerates the user
skill references from the source docs.

## Related Issue
None.

## Changes
- #3428 -> `docs/reference/troubleshooting.md`: Documents the installer
path that repairs missing NVIDIA CDI device specs before onboarding.
- #3515 and #3543 -> `docs/about/release-notes.md` and
`docs/reference/troubleshooting.md`: Documents the Linux Docker-driver
GPU proof permission fix for `/proc/<pid>/task/<tid>/comm` writes.
- #3536 -> `docs/reference/commands.md`: Documents that `nemoclaw
uninstall` removes Linux gateway state under `~/.local/state/nemoclaw`.
- Refreshes generated `nemoclaw-user-*` skill references from the
updated source docs.
- Bumps `docs/project.json` and `docs/versions1.json` to 0.0.43.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved GPU onboarding on Linux Docker-driver with automatic CDI spec
repair and fallback mechanisms.
* Fixed permission issues affecting GPU proof writes during Linux
onboarding.
* Enhanced uninstall to properly clean up gateway state and auth proxy
processes on Linux.

* **Documentation**
* Updated release notes, command references, and troubleshooting guides
for v0.0.43.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3613)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). v0.0.43 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark/Station][Install] Docker-driver GPU patch fails /proc/comm write — sandbox creation aborts with exit 1

3 participants