Skip to content

fix(snapshot): use gateway metadata for VM-driver health checks#3784

Merged
jyaunches merged 3 commits into
NVIDIA:mainfrom
se7en-agent:fix/3567-snapshot-openshell-state
May 20, 2026
Merged

fix(snapshot): use gateway metadata for VM-driver health checks#3784
jyaunches merged 3 commits into
NVIDIA:mainfrom
se7en-agent:fix/3567-snapshot-openshell-state

Conversation

@se7en-agent
Copy link
Copy Markdown
Contributor

@se7en-agent se7en-agent commented May 19, 2026

Summary

Fix nemoclaw <sandbox> snapshot create on macOS Apple Silicon VM-driver sandboxes by checking live OpenShell gateway health through gateway metadata instead of the legacy cluster container. This prevents healthy VM-driver sandboxes from being rejected before snapshot creation starts.

Related Issue

Fixes #3567

Changes

  • Treat openshellDriver: "vm" sandboxes like Docker-driver gateway paths for snapshot gateway health checks.
  • Use OpenShell status / gateway info metadata via the shared gateway health classifier instead of inspecting openshell-cluster-nemoclaw for VM-driver sandboxes.
  • Preserve the existing legacy gateway guard for non-Docker/VM paths so stopped legacy cluster gateways still fail closed.
  • Add a CLI-level regression test for VM-driver snapshot creation with healthy gateway metadata and no legacy cluster container.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Verify the snapshot creation on my MacOS apple Silicon device:

 node bin/nemoclaw.js my-assistant snapshot create --name my-snapshot
  Creating snapshot of 'my-assistant' (--name my-snapshot)...
  ✓ Snapshot v1 name=my-snapshot created (12 directories, 0 files)
    /Users/sevenc/.nemoclaw/rebuild-backups/my-assistant/2026-05-19T02-42-04-301Z

Signed-off-by: Se7en-Agent se7en.agent.ai@gmail.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved gateway running-state detection for snapshot commands, making snapshots more reliable for Docker- and VM-driven sandboxes by using metadata-based gateway health probing.
  • Tests

    • Added test helpers and suites that validate snapshot creation against healthy VM-driven gateways and guard against false "failed to query" errors.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7caa0d9f-429a-40a1-abb5-aad1414d7468

📥 Commits

Reviewing files that changed from the base of the PR and between 6be0028 and b54cf60.

📒 Files selected for processing (1)
  • src/lib/actions/sandbox/snapshot.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/actions/sandbox/snapshot.ts

📝 Walkthrough

Walkthrough

Snapshot runtime now detects gateway readiness for Docker/VM-driven sandboxes via OpenShell metadata (status + gateway info) evaluated with isGatewayHealthy; other drivers still use container State.Running. Tests add helpers and a healthy VM-driver environment to validate snapshot creation.

Changes

Gateway metadata probing and snapshot tests

Layer / File(s) Summary
Gateway metadata probing logic
src/lib/actions/sandbox/snapshot.ts
Removes stripAnsi, imports isGatewayHealthy, and changes gateway-running detection: queries OpenShell status and gateway info for named and active gateways and evaluates readiness via isGatewayHealthy for docker/vm; falls back to Docker container State.Running for other drivers.
Test helpers and stopped-gateway refactor
test/snapshot-gateway-guard.test.ts
Adds writeExecutable() and writeSandboxRegistry() helpers and refactors the stopped-gateway environment builder to use them while preserving stale openshell sandbox list and a non-running docker inspect.
Healthy VM-driver gateway environment
test/snapshot-gateway-guard.test.ts
Introduces makeHealthyVmGatewayEnv() that writes a sandbox registry with openshellDriver: "vm" and stubs openshell (gateway info, sandbox list, ssh-config, status), ssh, and docker inspect to simulate a healthy VM-driver gateway.
VM gateway snapshot creation test
test/snapshot-gateway-guard.test.ts
New test suite verifies alpha snapshot create --name baseline succeeds against the healthy VM-driver environment and does not output the "Failed to query live sandbox state from OpenShell" error.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3454: Changes onboarding to record sandbox openshellDriver as "docker", which affects driver-based probing logic used here.

Suggested labels

Platform: ARM64, Docker

Suggested reviewers

  • ericksoa

Poem

🐰 I hop where registries gleam in the night,

I whisper to OpenShell, "Are you all right?"
Metadata answers, health lights the track,
VM and Docker nod — snapshots no longer lack.
A rabbit cheers softly: tests pass, whee!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: using gateway metadata for VM-driver health checks in snapshot operations, which directly addresses the core objective of fixing snapshot creation on VM-driver sandboxes.
Linked Issues check ✅ Passed The code changes directly address issue #3567 by switching snapshot gateway health checks from legacy cluster container probing to OpenShell gateway metadata for VM-driver sandboxes, enabling snapshot creation to succeed on Apple Silicon with live Docker-driver/OpenShell gateways.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing snapshot gateway health checks for VM-driver sandboxes; modifications to snapshot.ts update health detection logic and test file adds VM-driver gateway environment builder and regression test aligned with issue #3567 requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@se7en-agent se7en-agent force-pushed the fix/3567-snapshot-openshell-state branch from 8e87dfb to 4523318 Compare May 19, 2026 02:46
@cr7258 cr7258 requested a review from cv May 19, 2026 02:49
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/snapshot-gateway-guard.test.ts (1)

84-141: ⚡ Quick win

Consider refactoring to use the new test helpers.

The makeStoppedGatewayEnv() function manually creates the sandbox registry and executable stubs, duplicating logic now available in writeSandboxRegistry() and writeExecutable(). Refactoring to use these helpers would improve consistency and maintainability.

♻️ Example refactor
 function makeStoppedGatewayEnv(prefix: string): Record<string, string> {
   const home = fs.mkdtempSync(path.join(os.tmpdir(), prefix));
   const localBin = path.join(home, "bin");
   fs.mkdirSync(localBin, { recursive: true });
+  writeSandboxRegistry(home, "alpha");

-  const registryDir = path.join(home, ".nemoclaw");
-  fs.mkdirSync(registryDir, { recursive: true });
-  fs.writeFileSync(
-    path.join(registryDir, "sandboxes.json"),
-    JSON.stringify({
-      sandboxes: {
-        alpha: {
-          name: "alpha",
-          model: "test-model",
-          provider: "nvidia-prod",
-          gpuEnabled: false,
-          policies: [],
-        },
-      },
-      defaultSandbox: "alpha",
-    }),
-    { mode: 0o600 },
-  );

   // openshell lies: sandbox list exits 0 and lists alpha as Ready even though
   // the gateway container is down (reads stale local registry/cache).
-  fs.writeFileSync(
-    path.join(localBin, "openshell"),
+  writeExecutable(path.join(localBin, "openshell"), [
-    [
-      "#!/bin/sh",
       'if [ "$1" = "sandbox" ] && [ "$2" = "list" ]; then',
       '  printf "NAME STATUS\\nalpha Ready\\n"',
       "  exit 0",
       "fi",
       "exit 0",
-    ].join("\n"),
-    { mode: 0o755 },
-  );
+  ]);

   // docker inspect: returns "false" for State.Running (gateway stopped).
-  fs.writeFileSync(
-    path.join(localBin, "docker"),
+  writeExecutable(path.join(localBin, "docker"), [
-    [
-      "#!/bin/sh",
       'if [ "$1" = "inspect" ]; then',
       '  echo "false"',
       "  exit 0",
       "fi",
       "exit 0",
-    ].join("\n"),
-    { mode: 0o755 },
-  );
+  ]);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/snapshot-gateway-guard.test.ts` around lines 84 - 141, Replace the
manual file/dir setup in makeStoppedGatewayEnv with the shared test helpers:
call writeSandboxRegistry to create the .nemoclaw/sandboxes.json (passing the
same registry object with alpha/defaultSandbox and file mode 0o600) and use
writeExecutable to create the two executables ("openshell" and "docker") in the
temp bin directory with the same script bodies and mode 0o755; preserve
returning the same env object (HOME and PATH including the created local bin)
and keep the script behavior (openshell returns "alpha Ready" on `sandbox list`,
docker returns "false" for inspect) so tests remain functionally identical while
removing duplicated fs logic in makeStoppedGatewayEnv.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/actions/sandbox/snapshot.ts`:
- Around line 204-221: The function probeDockerDriverGatewayRunning() is
misnamed because it handles both Docker and VM drivers; rename it to
probeGatewayMetadataHealth (or probeGatewayRunningViaMetadata) and update all
call sites (including the usage that currently invokes
probeDockerDriverGatewayRunning at the sandbox snapshot logic) to the new name;
ensure you update any exports, imports, and tests that reference
probeDockerDriverGatewayRunning and keep the function body (calls to
captureOpenshell and isGatewayHealthy) unchanged.

In `@test/snapshot-gateway-guard.test.ts`:
- Line 189: The test suite name string in the describe block is
misleading—rename the describe title from "snapshot Docker-driver gateway guard"
to match the actual VM-driver scenario (e.g., "snapshot VM-driver gateway guard"
or "snapshot Docker/VM-driver gateway guard") so it reflects the test using
makeHealthyVmGatewayEnv() (and any other VM-specific setup); update the
describe() call's first argument accordingly to keep test semantics unchanged.

---

Nitpick comments:
In `@test/snapshot-gateway-guard.test.ts`:
- Around line 84-141: Replace the manual file/dir setup in makeStoppedGatewayEnv
with the shared test helpers: call writeSandboxRegistry to create the
.nemoclaw/sandboxes.json (passing the same registry object with
alpha/defaultSandbox and file mode 0o600) and use writeExecutable to create the
two executables ("openshell" and "docker") in the temp bin directory with the
same script bodies and mode 0o755; preserve returning the same env object (HOME
and PATH including the created local bin) and keep the script behavior
(openshell returns "alpha Ready" on `sandbox list`, docker returns "false" for
inspect) so tests remain functionally identical while removing duplicated fs
logic in makeStoppedGatewayEnv.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4c2f0d44-e21a-43cf-bbfc-c4f883227b42

📥 Commits

Reviewing files that changed from the base of the PR and between 5a03166 and 4523318.

📒 Files selected for processing (2)
  • src/lib/actions/sandbox/snapshot.ts
  • test/snapshot-gateway-guard.test.ts

Comment thread src/lib/actions/sandbox/snapshot.ts
Comment thread test/snapshot-gateway-guard.test.ts Outdated
@wscurran wscurran added fix NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents Sandbox Use this label to identify issues related to the NemoClaw isolated environment based on OpenShell. labels May 19, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this detailed PR about fixing the snapshot creation issue on macOS Apple Silicon VM-driver sandboxes. This proposes a code change to use OpenShell gateway metadata for health checks instead of the legacy cluster container, which should resolve the issue reported in #3567.


Related open issues:

1 similar comment
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this detailed PR about fixing the snapshot creation issue on macOS Apple Silicon VM-driver sandboxes. This proposes a code change to use OpenShell gateway metadata for health checks instead of the legacy cluster container, which should resolve the issue reported in #3567.


Related open issues:

@wscurran wscurran added the Platform: macOS Support for macOS label May 19, 2026
@jyaunches jyaunches merged commit 36491d2 into NVIDIA:main May 20, 2026
20 checks passed
ericksoa added a commit that referenced this pull request May 20, 2026
## Summary
- Reverts the squash commit from PR #3832 exactly:
b7deb55
- Restores dependency/runtime versions and OpenClaw remediation files to
the pre-#3832 state while preserving the later main commit
fix(snapshot): use gateway metadata for VM-driver health checks (#3784)

## Verification
- git revert --signoff --no-edit
b7deb55 applied cleanly from current
origin/main
- git diff --check HEAD^ HEAD

Note: This PR intentionally undoes the merged dependency upgrade. It has
not been merged.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Chores**
* Updated OpenClaw to version 2026.4.24, OpenShell to 0.0.39, and Hermes
to 2026.4.23.
  * Updated WeChat plugin dependency from 2.4.3 to 2.4.2.
* Streamlined WeChat account configuration logic and refined tool-call
handling in Kimi inference compatibility.
  * Updated internal test suites and validation scripts.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3924?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
miyoungc added a commit that referenced this pull request May 21, 2026
## Summary
Refreshes NemoClaw release notes for v0.0.47 and v0.0.48, then
regenerates the corresponding user-skill references so agent-facing docs
match the source pages.

Preview:
https://nvidia-preview-docs-release-notes-47-48.docs.buildwithfern.com/nemoclaw/about/release-notes

## Changes
- Adds explicit v0.0.47 and v0.0.48 sections to
`docs/about/release-notes.mdx`.
- Documents follow-up WSL Ollama, sandbox image, share mount, and
troubleshooting updates from recent release changes.
- Regenerates `nemoclaw-user-*` skill references from the Fern MDX
source docs.

## Source Summary
- #4003 -> `docs/about/release-notes.mdx`: Notes the messaging manifest
registry work as part of v0.0.48 release coverage.
- #3984 -> `docs/about/release-notes.mdx`: Captures Hermes messaging
policy scoping in the v0.0.48 release notes.
- #3963 -> `docs/about/release-notes.mdx`: Captures DGX Spark Hermes GPU
recreation startup recovery in the v0.0.48 release notes.
- #3961 -> `docs/about/release-notes.mdx`: Captures Discord loopback
proxy routing in the v0.0.48 release notes.
- #3940 -> `docs/about/release-notes.mdx`: Captures installer prompt
clarification and express-install behavior in the v0.0.48 release notes.
- #3946 -> `docs/about/release-notes.mdx`: Carries forward the Homebrew
preinstall clarification in release coverage.
- #3937 -> `docs/about/release-notes.mdx`: Carries forward the dashboard
URL command and post-install next steps coverage.
- #3921 -> `docs/about/release-notes.mdx`: Carries forward managed vLLM
default behavior for DGX Spark and DGX Station.
- #3931 -> `docs/about/release-notes.mdx`,
`docs/reference/architecture.mdx`: Documents the sandbox `python` to
`python3` compatibility symlink.
- #1485 -> `docs/about/release-notes.mdx`,
`docs/reference/architecture.mdx`: Documents the sandbox image Docker
health check.
- #3784 -> `docs/about/release-notes.mdx`: Captures VM-driver snapshot
health-check reliability in release notes.
- #3917 -> `docs/about/release-notes.mdx`: Captures package-based
workspace template resolution in release notes.
- #3170 -> `docs/about/release-notes.mdx`: Captures installer checksum
compatibility from preferring `sha256sum`.
- #3898 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage
for messaging provider scenario validation.
- #3897 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage
for baseline onboarding scenario validation.
- #3834 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage
for PR review advisor automation.
- #3838 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage
for CLI display registry refactoring.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

`make docs` was attempted but could not complete because `npx fern-api`
failed with `403 Forbidden` from `https://registry.npmjs.org/fern-api`
in this environment. Pre-commit and pre-push hooks passed after
refreshing the local CLI build output with `npm run build:cli`; no build
artifacts were committed.

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Added WSL onboarding notes for Windows-host Ollama detection, restart
guidance, and PowerShell checks.
* Clarified express-install behavior (non-interactive, sudo prompts) and
default sandbox policy selection.
* Added Windows preparation guidance when installer tooling is missing
(winget/App Installer or Docker Desktop).
* Expanded sandbox docs with Docker health checks, Homebrew/python
compatibility helpers, share-mount path validation, Discord
troubleshooting, and new v0.0.48/v0.0.47 release notes.
* **Chores**
  * Improved docs preview workflow error handling.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4007?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents Platform: macOS Support for macOS Sandbox Use this label to identify issues related to the NemoClaw isolated environment based on OpenShell.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[macOS][Sandbox] nemoclaw snapshot create fails on Apple Silicon: "Failed to query live sandbox state from OpenShell"

4 participants