fix(snapshot): use gateway metadata for VM-driver health checks#3784
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughSnapshot runtime now detects gateway readiness for Docker/VM-driven sandboxes via OpenShell metadata (status + gateway info) evaluated with isGatewayHealthy; other drivers still use container State.Running. Tests add helpers and a healthy VM-driver environment to validate snapshot creation. ChangesGateway metadata probing and snapshot tests
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
8e87dfb to
4523318
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/snapshot-gateway-guard.test.ts (1)
84-141: ⚡ Quick winConsider refactoring to use the new test helpers.
The
makeStoppedGatewayEnv()function manually creates the sandbox registry and executable stubs, duplicating logic now available inwriteSandboxRegistry()andwriteExecutable(). Refactoring to use these helpers would improve consistency and maintainability.♻️ Example refactor
function makeStoppedGatewayEnv(prefix: string): Record<string, string> { const home = fs.mkdtempSync(path.join(os.tmpdir(), prefix)); const localBin = path.join(home, "bin"); fs.mkdirSync(localBin, { recursive: true }); + writeSandboxRegistry(home, "alpha"); - const registryDir = path.join(home, ".nemoclaw"); - fs.mkdirSync(registryDir, { recursive: true }); - fs.writeFileSync( - path.join(registryDir, "sandboxes.json"), - JSON.stringify({ - sandboxes: { - alpha: { - name: "alpha", - model: "test-model", - provider: "nvidia-prod", - gpuEnabled: false, - policies: [], - }, - }, - defaultSandbox: "alpha", - }), - { mode: 0o600 }, - ); // openshell lies: sandbox list exits 0 and lists alpha as Ready even though // the gateway container is down (reads stale local registry/cache). - fs.writeFileSync( - path.join(localBin, "openshell"), + writeExecutable(path.join(localBin, "openshell"), [ - [ - "#!/bin/sh", 'if [ "$1" = "sandbox" ] && [ "$2" = "list" ]; then', ' printf "NAME STATUS\\nalpha Ready\\n"', " exit 0", "fi", "exit 0", - ].join("\n"), - { mode: 0o755 }, - ); + ]); // docker inspect: returns "false" for State.Running (gateway stopped). - fs.writeFileSync( - path.join(localBin, "docker"), + writeExecutable(path.join(localBin, "docker"), [ - [ - "#!/bin/sh", 'if [ "$1" = "inspect" ]; then', ' echo "false"', " exit 0", "fi", "exit 0", - ].join("\n"), - { mode: 0o755 }, - ); + ]);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/snapshot-gateway-guard.test.ts` around lines 84 - 141, Replace the manual file/dir setup in makeStoppedGatewayEnv with the shared test helpers: call writeSandboxRegistry to create the .nemoclaw/sandboxes.json (passing the same registry object with alpha/defaultSandbox and file mode 0o600) and use writeExecutable to create the two executables ("openshell" and "docker") in the temp bin directory with the same script bodies and mode 0o755; preserve returning the same env object (HOME and PATH including the created local bin) and keep the script behavior (openshell returns "alpha Ready" on `sandbox list`, docker returns "false" for inspect) so tests remain functionally identical while removing duplicated fs logic in makeStoppedGatewayEnv.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/snapshot.ts`:
- Around line 204-221: The function probeDockerDriverGatewayRunning() is
misnamed because it handles both Docker and VM drivers; rename it to
probeGatewayMetadataHealth (or probeGatewayRunningViaMetadata) and update all
call sites (including the usage that currently invokes
probeDockerDriverGatewayRunning at the sandbox snapshot logic) to the new name;
ensure you update any exports, imports, and tests that reference
probeDockerDriverGatewayRunning and keep the function body (calls to
captureOpenshell and isGatewayHealthy) unchanged.
In `@test/snapshot-gateway-guard.test.ts`:
- Line 189: The test suite name string in the describe block is
misleading—rename the describe title from "snapshot Docker-driver gateway guard"
to match the actual VM-driver scenario (e.g., "snapshot VM-driver gateway guard"
or "snapshot Docker/VM-driver gateway guard") so it reflects the test using
makeHealthyVmGatewayEnv() (and any other VM-specific setup); update the
describe() call's first argument accordingly to keep test semantics unchanged.
---
Nitpick comments:
In `@test/snapshot-gateway-guard.test.ts`:
- Around line 84-141: Replace the manual file/dir setup in makeStoppedGatewayEnv
with the shared test helpers: call writeSandboxRegistry to create the
.nemoclaw/sandboxes.json (passing the same registry object with
alpha/defaultSandbox and file mode 0o600) and use writeExecutable to create the
two executables ("openshell" and "docker") in the temp bin directory with the
same script bodies and mode 0o755; preserve returning the same env object (HOME
and PATH including the created local bin) and keep the script behavior
(openshell returns "alpha Ready" on `sandbox list`, docker returns "false" for
inspect) so tests remain functionally identical while removing duplicated fs
logic in makeStoppedGatewayEnv.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4c2f0d44-e21a-43cf-bbfc-c4f883227b42
📒 Files selected for processing (2)
src/lib/actions/sandbox/snapshot.tstest/snapshot-gateway-guard.test.ts
|
✨ Thanks for submitting this detailed PR about fixing the snapshot creation issue on macOS Apple Silicon VM-driver sandboxes. This proposes a code change to use OpenShell gateway metadata for health checks instead of the legacy cluster container, which should resolve the issue reported in #3567. Related open issues: |
1 similar comment
|
✨ Thanks for submitting this detailed PR about fixing the snapshot creation issue on macOS Apple Silicon VM-driver sandboxes. This proposes a code change to use OpenShell gateway metadata for health checks instead of the legacy cluster container, which should resolve the issue reported in #3567. Related open issues: |
## Summary - Reverts the squash commit from PR #3832 exactly: b7deb55 - Restores dependency/runtime versions and OpenClaw remediation files to the pre-#3832 state while preserving the later main commit fix(snapshot): use gateway metadata for VM-driver health checks (#3784) ## Verification - git revert --signoff --no-edit b7deb55 applied cleanly from current origin/main - git diff --check HEAD^ HEAD Note: This PR intentionally undoes the merged dependency upgrade. It has not been merged. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Chores** * Updated OpenClaw to version 2026.4.24, OpenShell to 0.0.39, and Hermes to 2026.4.23. * Updated WeChat plugin dependency from 2.4.3 to 2.4.2. * Streamlined WeChat account configuration logic and refined tool-call handling in Kimi inference compatibility. * Updated internal test suites and validation scripts. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3924?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
## Summary Refreshes NemoClaw release notes for v0.0.47 and v0.0.48, then regenerates the corresponding user-skill references so agent-facing docs match the source pages. Preview: https://nvidia-preview-docs-release-notes-47-48.docs.buildwithfern.com/nemoclaw/about/release-notes ## Changes - Adds explicit v0.0.47 and v0.0.48 sections to `docs/about/release-notes.mdx`. - Documents follow-up WSL Ollama, sandbox image, share mount, and troubleshooting updates from recent release changes. - Regenerates `nemoclaw-user-*` skill references from the Fern MDX source docs. ## Source Summary - #4003 -> `docs/about/release-notes.mdx`: Notes the messaging manifest registry work as part of v0.0.48 release coverage. - #3984 -> `docs/about/release-notes.mdx`: Captures Hermes messaging policy scoping in the v0.0.48 release notes. - #3963 -> `docs/about/release-notes.mdx`: Captures DGX Spark Hermes GPU recreation startup recovery in the v0.0.48 release notes. - #3961 -> `docs/about/release-notes.mdx`: Captures Discord loopback proxy routing in the v0.0.48 release notes. - #3940 -> `docs/about/release-notes.mdx`: Captures installer prompt clarification and express-install behavior in the v0.0.48 release notes. - #3946 -> `docs/about/release-notes.mdx`: Carries forward the Homebrew preinstall clarification in release coverage. - #3937 -> `docs/about/release-notes.mdx`: Carries forward the dashboard URL command and post-install next steps coverage. - #3921 -> `docs/about/release-notes.mdx`: Carries forward managed vLLM default behavior for DGX Spark and DGX Station. - #3931 -> `docs/about/release-notes.mdx`, `docs/reference/architecture.mdx`: Documents the sandbox `python` to `python3` compatibility symlink. - #1485 -> `docs/about/release-notes.mdx`, `docs/reference/architecture.mdx`: Documents the sandbox image Docker health check. - #3784 -> `docs/about/release-notes.mdx`: Captures VM-driver snapshot health-check reliability in release notes. - #3917 -> `docs/about/release-notes.mdx`: Captures package-based workspace template resolution in release notes. - #3170 -> `docs/about/release-notes.mdx`: Captures installer checksum compatibility from preferring `sha256sum`. - #3898 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for messaging provider scenario validation. - #3897 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for baseline onboarding scenario validation. - #3834 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for PR review advisor automation. - #3838 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for CLI display registry refactoring. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) `make docs` was attempted but could not complete because `npx fern-api` failed with `403 Forbidden` from `https://registry.npmjs.org/fern-api` in this environment. Pre-commit and pre-push hooks passed after refreshing the local CLI build output with `npm run build:cli`; no build artifacts were committed. --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added WSL onboarding notes for Windows-host Ollama detection, restart guidance, and PowerShell checks. * Clarified express-install behavior (non-interactive, sudo prompts) and default sandbox policy selection. * Added Windows preparation guidance when installer tooling is missing (winget/App Installer or Docker Desktop). * Expanded sandbox docs with Docker health checks, Homebrew/python compatibility helpers, share-mount path validation, Discord troubleshooting, and new v0.0.48/v0.0.47 release notes. * **Chores** * Improved docs preview workflow error handling. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4007?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Fix
nemoclaw <sandbox> snapshot createon macOS Apple Silicon VM-driver sandboxes by checking live OpenShell gateway health through gateway metadata instead of the legacy cluster container. This prevents healthy VM-driver sandboxes from being rejected before snapshot creation starts.Related Issue
Fixes #3567
Changes
openshellDriver: "vm"sandboxes like Docker-driver gateway paths for snapshot gateway health checks.status/gateway infometadata via the shared gateway health classifier instead of inspectingopenshell-cluster-nemoclawfor VM-driver sandboxes.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Verify the snapshot creation on my MacOS apple Silicon device:
node bin/nemoclaw.js my-assistant snapshot create --name my-snapshot Creating snapshot of 'my-assistant' (--name my-snapshot)... ✓ Snapshot v1 name=my-snapshot created (12 directories, 0 files) /Users/sevenc/.nemoclaw/rebuild-backups/my-assistant/2026-05-19T02-42-04-301ZSigned-off-by: Se7en-Agent se7en.agent.ai@gmail.com
Summary by CodeRabbit
Bug Fixes
Tests