feat(vm): openshell-vm microVM gateway backend (full lifecycle, E2E 30/30)#1791
feat(vm): openshell-vm microVM gateway backend (full lifecycle, E2E 30/30)#1791
Conversation
OpenShell v0.0.26 ships openshell-vm, a standalone binary that boots a hardware-isolated microVM via libkrun. This commit lays the groundwork for NemoClaw to support both the existing Docker/k3s gateway and the new microVM gateway as selectable backends. Phase 1 changes: - Bump min_openshell_version from 0.0.24 to 0.0.26 across blueprint, install script, onboard preflight, CI scripts, E2E tests, and docs - Add gateway_backends field to blueprint.yaml schema (docker, vm) - Add isOpenshellVmAvailable() and getInstalledOpenshellVmVersion() to openshell.ts for detecting the openshell-vm binary - Add detectGatewayBackend() to platform.ts with NEMOCLAW_GATEWAY_BACKEND env var override, auto-detection preferring VM when available and falling back to Docker, and mandatory Docker for GPU workloads - Add gatewayBackend field to onboard session schema for persisting the selected backend across resume cycles - Add tests for all new functions The VM backend requires no Docker daemon and provides faster boot, but has no NVIDIA GPU passthrough (libkrun lacks PCI/VFIO support), so the Docker backend remains mandatory for local inference on GPU workstations. Refs: NVIDIA/OpenShell#611 Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
🚀 Docs preview ready! |
Phase 4 of the dual-backend feature (PR #1791): - New test/e2e/test-vm-backend-e2e.sh: full E2E journey for the VM backend — install openshell-vm from release assets, onboard with NEMOCLAW_GATEWAY_BACKEND=vm, verify sandbox creation, live inference through the microVM gateway, resume after openshell-vm kill, and reset to clean slate. - New vm-e2e job in nightly-e2e.yaml: runs on ubuntu-latest (has /dev/kvm), installs openshell-vm, executes the VM backend E2E test. - New vm-backend test suite in brev-e2e.test.ts: allows running the VM backend E2E on ephemeral Brev instances via TEST_SUITE=vm-backend. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The openshell-vm binary is published on the vm-dev pre-release tag (not v0.0.26) with gnu libc (not musl). Also downloads the VM runtime (kernel + rootfs) needed by libkrun, and installs zstd for decompression. Asset corrections: - Tag: vm-dev (not v0.0.26) - Binary: openshell-vm-*-unknown-linux-gnu.tar.gz (not musl) - Checksums: vm-binary-checksums-sha256.txt - Runtime: vm-runtime-linux-*.tar.zst → ~/.local/share/openshell-vm/ Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
When detectGatewayBackend() returns "vm", the onboard and runtime recovery flows now use openshell-vm instead of Docker: - startGatewayWithOptions() detects the backend and delegates to startVmGatewayProcess() for the VM path, which spawns openshell-vm as a detached background process with PID tracking - VM lifecycle helpers: writeVmPidFile, readVmPid, isVmProcessAlive, stopVmGateway (SIGTERM→SIGKILL), isVmGatewayHealthy - destroyGateway() checks session.gatewayBackend and stops the VM process instead of Docker volume cleanup when backend is "vm" - recoverGatewayRuntime() reads session.gatewayBackend to choose VM vs Docker recovery path - recoverNamedGatewayRuntime() in nemoclaw.ts skips Docker-specific gateway select commands for VM backend - cleanupGatewayAfterLastSandbox() stops VM process instead of Docker cleanup when backend is "vm" - Gateway backend is saved to onboard session on step completion so resume cycles know which backend to use - Resume flow checks VM health via isVmGatewayHealthy() instead of Docker gateway state when session records VM backend Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Reading the openshell-vm source (NVIDIA/OpenShell crates/openshell-vm)
revealed three bugs in the Phase 2 implementation:
1. openshell-vm was spawned with no arguments. It needs `--name nemoclaw`
so it extracts the rootfs to the correct instance directory and
registers gateway metadata under the right identity.
2. openshell-vm prefixes instance names: gateway_name("nemoclaw")
produces "openshell-vm-nemoclaw". All OPENSHELL_GATEWAY env vars
and `openshell gateway select` calls for the VM path now use
VM_GATEWAY_NAME ("openshell-vm-nemoclaw") instead of GATEWAY_NAME.
3. Health poll was too short (15 × 2s = 30s). The VM boots k3s inside
a microVM with its own 90s internal health check. Increased to
60 × 3s = 180s to avoid racing the inner bootstrap. Also log the
last 10 lines from openshell-vm.log on failure for diagnostics.
Additionally: the VM gateway listens on port 30051 (NodePort), not
8080. The openshell-vm binary handles the port mapping internally
(gvproxy host:30051 → VM:30051 → kube-proxy → pod:8080).
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The GitHub-hosted ubuntu runner has /dev/kvm (crw-rw---- root:kvm) but the runner user is not in the kvm group. openshell-vm opens /dev/kvm directly via libkrun and fails with EACCES. Fix by chmod 666 /dev/kvm in the KVM verification step. Also add the user to the kvm group for completeness, though the chmod is sufficient for the current process. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…tics Three changes to address gvproxy crash on GitHub Actions: 1. Pass --mem 4096 to openshell-vm. The default 8GB is half the 16GB runner memory. With k3s pulling container images inside the VM, host processes (gvproxy, gvisor netstack) get starved. 4GB is enough for a lightweight gateway without GPU workloads. 2. Detect and use the E2E-downloaded VM runtime via OPENSHELL_VM_RUNTIME_DIR. The test script downloads gvproxy/libkrun/libkrunfw from the vm-dev release to ~/.local/share/openshell-vm/ but never tells openshell-vm to use it. The downloaded runtime may contain fixes not in the binary's embedded copy. When the downloaded runtime exists (has gvproxy), set OPENSHELL_VM_RUNTIME_DIR to prefer it. 3. Add VM diagnostics step to CI: openshell-vm log, gvproxy log, dmesg OOM check, memory stats, and VM console log. This will show the actual root cause if gvproxy crashes again. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The openshell-vm kernel has CONFIG_POSIX_MQUEUE=y but the init script
(openshell-vm-init.sh) never mounts the mqueue filesystem. When k3s
creates a pod, runc tries to mount mqueue at /dev/mqueue inside the
container namespace and gets ENODEV ("no such device") because the
host mount point doesn't exist.
Fix: run `openshell-vm prepare-rootfs` to extract the rootfs, then
patch the init script to mkdir + mount mqueue alongside the existing
devpts/shm mounts. The patch is idempotent — skipped if the init
script already contains /dev/mqueue.
Root cause found by tracing the VM console log:
runc create failed: error mounting "mqueue" to rootfs at
"/dev/mqueue": no such device
This should be fixed upstream in the OpenShell init script.
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The previous mqueue patch ran silently when prepare-rootfs failed or the init script wasn't found. Add verbose logging at each step so CI output shows exactly what happened: rootfs path, init script location, whether the string replacement matched, and any errors. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The vm-dev kernel (built 2026-04-09) predates the CONFIG_POSIX_MQUEUE addition to the kconfig (2026-04-10 d8cf7951). runc fails with ENODEV when trying to mount -t mqueue inside container namespaces. Previous approach (mounting mqueue in init script) didn't work because the kernel itself lacks the filesystem type — mount returns ENODEV. New approach: install a runc wrapper shim that strips mqueue mount entries from the OCI config.json before calling the real runc binary. The shim is only installed when the kernel actually lacks mqueue support (tested by attempting mount -t mqueue). When a future kernel rebuild includes CONFIG_POSIX_MQUEUE, the shim is not installed. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…pt breakage The previous approach embedded a heredoc in the init script patch, which broke the init script (VM died before exec agent started). New approach: write runc-wrapper.sh as a separate file in /opt/nemoclaw/ of the rootfs, then patch the init script with a simple test-and-swap block. The wrapper is a standalone script that strips mqueue entries from config.json via sed before calling the real runc binary. The init script patch is minimal — just an if/fi block that copies the wrapper over runc when the kernel lacks mqueue. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…shim Previous approach of wrapping /usr/bin/runc didn't work because k3s bundles its own runc in /var/lib/rancher/k3s/data/<hash>/bin/, extracted fresh at each startup. Our host-side wrapper was never called. New approach: write /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl in the init script, which k3s reads before generating its containerd config. The template sets BinaryName to /opt/nemoclaw/runc-shim, which strips mqueue mount entries from config.json before calling the k3s-bundled runc. This routes ALL runc invocations through the shim, correctly handling the k3s data directory extraction timing. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
When the session's gatewayBackend is "vm", the standard `openshell sandbox create --from Dockerfile` path fails because the openshell CLI internally uses Docker put_archive to push the image into the gateway container — which doesn't exist for VM gateways. Fix: detect VM backend during sandbox creation and: 1. Build the image locally with Docker (same Dockerfile) 2. Export with docker save to the VM's rootfs via virtio-fs (host can write directly to the rootfs/tmp/ directory) 3. Import into VM containerd via openshell-vm exec + ctr images import 4. Pass --from <image-ref> instead of --from <Dockerfile> so the openshell CLI treats it as a pre-existing image (skips Docker push) Falls back to the standard Dockerfile path if any step fails, so Docker backend behavior is unchanged. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…erlay) Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…g issues Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Two post-sandbox fixes for VM backend: 1. DNS proxy: setup-dns-proxy.sh uses docker exec to reach kubectl inside the gateway container. VM backend has no Docker container. Skip the DNS proxy — gvproxy provides NAT networking with working DNS through the gateway IP (192.168.127.1). 2. Sandbox readiness: the sandbox pod may briefly flip Ready→NotReady during init container restarts in the VM. Add a 30s wait loop in setupOpenclaw before attempting sandbox connect, preventing the FailedPrecondition "sandbox is not ready" error. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The vm-dev openshell-sandbox binary was built with cargo-zigbuild on a host with glibc 2.39, but sandbox containers use Ubuntu 22.04 (glibc 2.35). The binary crashes at startup: GLIBC_2.38 not found (required by /opt/openshell/bin/openshell-sandbox) Fix: extract the openshell-sandbox binary from the Docker gateway image (ghcr.io/nvidia/openshell/cluster:<version>), which was built with rust:1.88-slim (Debian bookworm, glibc 2.36 — compatible with Ubuntu 22.04's glibc 2.35). Replace the VM rootfs copy before boot. The supervisor is side-loaded into sandbox pods via hostPath volume from /opt/openshell/bin on the k3s node, so this fix propagates to all sandbox pods automatically. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
waitForSandboxReady() used `openshell doctor exec -- kubectl` which runs docker exec inside the gateway container. This is Docker-specific and fails silently for VM gateways (no Docker container exists). For VM backend, use `openshell sandbox list` + isSandboxReady() which goes through the gRPC API and works for both Docker and VM gateways. Docker backend behavior is unchanged. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…ateway When openshell-vm is killed after a successful onboard, the session is marked complete (resumable=false). Running `nemoclaw onboard --resume` correctly reports "No resumable session found" — but the E2E test expects resume to restart the gateway and reconnect the sandbox. Fix: when --resume finds a completed session with gatewayBackend=vm and the VM gateway is dead, restart openshell-vm and check if the sandbox is still alive. If the sandbox reconnects after gateway restart, exit 0 (recovery successful). If not, mark the session resumable and fall through to the normal re-onboard path. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…ng resume Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
GPU inference is routed through inference.local (OpenShell L7 proxy → host inference server). The sandbox and gateway never need direct GPU access — the GPU is used by the host-side Ollama/vLLM/NIM process. The VM backend works for all scenarios including local GPU inference. Remove the gpuRequested parameter and the conditional that forced Docker when GPU was detected. VM is now preferred whenever openshell-vm is available, regardless of GPU presence. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Replace the two-container DinD architecture (dockerd sidecar + workspace) with a single container using openshell-vm. Removes: - docker:24-dind sidecar container (8Gi RAM, 2 CPU) - Docker socket shared volume - Docker storage volume - Docker config init container (cgroup v2 workaround) - docker.io apt package install - Docker daemon wait loop The workspace container now installs openshell-vm via the upstream install script and sets NEMOCLAW_GATEWAY_BACKEND=vm. Requires /dev/kvm on the k8s node (bare-metal or nested virt enabled). Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
What dropping Docker-in-Docker means in practiceThe k8s manifest ( Security
Resource usage
Startup time
Operational simplicity
Requirements
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/brev-launchable-ci-cpu.sh (1)
31-31:⚠️ Potential issue | 🟡 MinorStale comment: default version mismatch.
Line 31 documents the default as
v0.0.20, but line 43 sets the actual default tov0.0.26. Update the comment to match:-# OPENSHELL_VERSION — OpenShell CLI release tag (default: v0.0.20) +# OPENSHELL_VERSION — OpenShell CLI release tag (default: v0.0.26)Also applies to: 43-43
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/brev-launchable-ci-cpu.sh` at line 31, Update the stale comment that documents the default OpenShell CLI release tag so it matches the actual default; change the commented line referencing "OPENSHELL_VERSION — OpenShell CLI release tag (default: v0.0.20)" to reflect the current default v0.0.26 used where OPENSHELL_VERSION is set, ensuring the comment and the OPENSHELL_VERSION default value are consistent.
🧹 Nitpick comments (6)
test/e2e/test-vm-backend-e2e.sh (3)
34-34: Consider addingset -efor fail-fast behavior.The script uses
set -uo pipefailbut omits-e(errexit). This means commands can fail silently without stopping the test. While the explicitpass/failtracking handles some cases, intermediate commands (likecd,tar,install) could fail and the script would continue, potentially producing misleading results.If intentional (to allow the test framework to track each failure), consider documenting why
-eis omitted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-vm-backend-e2e.sh` at line 34, Update the shell options line that currently reads "set -uo pipefail" to include "-e" so the script uses "set -euo pipefail" to enable fail-fast behavior (or, if omission is intentional, add an explicit comment above the "set -uo pipefail" line explaining why "-e" is omitted and how failures are tracked); modify the single-line "set -uo pipefail" occurrence in test/e2e/test-vm-backend-e2e.sh (the shell options line) accordingly to ensure intermediate command failures stop the script or are clearly documented.
291-295:apt-getusage assumes Debian/Ubuntu.While the script targets GitHub
ubuntu-latestrunners, users running locally on other Linux distributions (Fedora, Arch) would see failures. Consider adding a fallback or clearer error message.Suggested defensive check
# zstd may not be installed — install it if needed if ! command -v zstd >/dev/null 2>&1; then info "Installing zstd for runtime decompression..." - sudo apt-get update -qq && sudo apt-get install -y -qq zstd >/dev/null 2>&1 + if command -v apt-get >/dev/null 2>&1; then + sudo apt-get update -qq && sudo apt-get install -y -qq zstd >/dev/null 2>&1 + elif command -v dnf >/dev/null 2>&1; then + sudo dnf install -y -q zstd >/dev/null 2>&1 + else + fail "zstd not found and no supported package manager available" + exit 1 + fi fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-vm-backend-e2e.sh` around lines 291 - 295, The current zstd install block (the if ! command -v zstd conditional) assumes apt-get is available; replace it with a defensive install routine that first tries apt-get, then falls back to common package managers (dnf, yum, pacman) and finally prints a clear error/usage message if none are present; update the block around the "Installing zstd for runtime decompression..." message to attempt each package manager in order, and on failure exit with an informative message telling the user to install zstd manually.
488-495: Thepkill -f "openshell-vm"pattern may be overly broad.This pattern matches any process with "openshell-vm" anywhere in its command line, which could unintentionally kill unrelated processes in shared environments. Consider using the PID file that
onboard.tswrites (~/.nemoclaw/openshell-vm.pid) for targeted termination.Suggested alternative using PID file
info "Killing openshell-vm process to simulate crash..." -# Find and kill any openshell-vm gateway process -if pkill -f "openshell-vm" 2>/dev/null; then +# Kill via PID file for targeted termination +VM_PID_FILE="$HOME/.nemoclaw/openshell-vm.pid" +if [ -f "$VM_PID_FILE" ] && kill -0 "$(cat "$VM_PID_FILE")" 2>/dev/null; then + kill "$(cat "$VM_PID_FILE")" 2>/dev/null pass "openshell-vm process killed" sleep 3 else info "No openshell-vm process found to kill (may already be stopped)" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-vm-backend-e2e.sh` around lines 488 - 495, Replace the broad pkill usage with a PID-file based shutdown: read the PID from ~/.nemoclaw/openshell-vm.pid if it exists, verify the PID corresponds to a running process whose command line contains "openshell-vm" (to avoid killing unrelated PIDs), send a targeted kill to that PID and remove the pidfile; if the pidfile is missing or verification fails, fall back to the existing pkill -f "openshell-vm" behavior as a last resort and log appropriate info messages. This change should be applied where pkill -f "openshell-vm" is used in the test/e2e/test-vm-backend-e2e.sh script.src/lib/onboard.ts (3)
2141-2235: Complex rootfs patching logic is well-documented but could benefit from extraction.The mqueue fix (lines 2141-2235) is a significant workaround with detailed comments explaining the kernel limitation. While the documentation is excellent, this ~95-line block could be extracted into a separate helper function (e.g.,
patchVmRootfsForMqueue()) to improve readability and testability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2141 - 2235, Extract the mqueue workaround block into a new helper function named patchVmRootfsForMqueue(rootfsDir, vmEnv) and call it where the current block lives; move all logic that creates shimDir/shimPath, writes the runc-shim, patches the init script (referencing initScript, shimDir, shimPath, and the K3S_ARGS replacement) and the related fs operations and logging into that function so the main flow around spawnSync(prepResult) stays concise; ensure the helper receives rootfsDir (from prepOutput) and returns success/failure or throws so existing console logs (e.g., "Patched VM init..." / "Init script not found...") remain consistent and tests can target patchVmRootfsForMqueue independently.
2815-2849: Import script approach is clever but embeds shell code in TypeScript.The import script (lines 2817-2833) written to the rootfs is a reasonable workaround for the exec agent limitations. However, the shell script embedded as a JavaScript array is hard to maintain and test.
Consider extracting this to a separate
.shfile that gets copied to the rootfs, similar to how other scripts are handled in thescripts/directory.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2815 - 2849, The embedded shell script written to importScript makes maintenance hard; move the script text into a standalone file (e.g., scripts/import-image.sh) and have onboard.ts copy that file into the VM rootfs instead of constructing the array inline. Update the code around importScript, fs.writeFileSync, and fs.chmodSync to locate the new script (use path.join(__dirname, "..", "scripts", "import-image.sh") or equivalent), copy it into rootfs (fs.copyFileSync or streaming copy), then set executable bits and keep the existing run(...) and runCapture(...) calls that reference /opt/nemoclaw/import-image.sh; ensure any template values (if needed) are either left static or replaced before copying and adjust tests accordingly.
2245-2265: Two functions namedgetInstalledOpenshellVersionexist with different signatures—consider renaming for clarity.Line 2246 correctly calls the imported version from
openshell.tswith signature(binary: string, opts: CaptureOpenshellOptions). However, a local version at line 393 exists with signature(versionOutput = null)that serves a different purpose (parsing pre-captured output). While the code uses the correct function, having two identically-named functions with different signatures and purposes invites confusion during maintenance.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2245 - 2265, There are two different functions named getInstalledOpenshellVersion: one imported from "./openshell" (signature (binary: string, opts: CaptureOpenshellOptions)) and a local helper that parses pre-captured output (signature (versionOutput = null)); rename the local parser to something unambiguous like parseOpenshellVersionOutput (or parseCapturedOpenshellVersion), update all internal call sites to that new name, and leave the imported getInstalledOpenshellVersion unchanged so callers like the gatewayImage construction continue to use the correct function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/platform.ts`:
- Around line 123-135: The detectGatewayBackend function currently defaults
opts.vmAvailable to false which causes dual-capable hosts to prefer "docker"
silently; update detectGatewayBackend to either (A) perform VM detection when
opts.vmAvailable is undefined (mirror docker detection by calling an appropriate
VM detection helper, e.g., detectVMHost or similar, and set vmAvailable based on
its result) or (B) make opts.vmAvailable required and throw a clear error if
it's not provided; ensure you update the vmAvailable initialization (and any
callers) instead of hard-coding false and keep the dockerAvailable logic that
calls detectDockerHost unchanged.
In `@src/nemoclaw.ts`:
- Around line 523-543: The VM branch constructs vmGatewayName but calls
getNamedGatewayLifecycleState() with no name; update both lifecycle checks to
call getNamedGatewayLifecycleState(vmGatewayName) so the code inspects the VM
gateway (before and after), keep runOpenshell(["gateway","select",
vmGatewayName]) and setting process.env.OPENSHELL_GATEWAY as-is, and leave
startGatewayForRecovery() behavior unchanged unless it also needs a gateway name
in future.
In `@test/openshell-vm.test.ts`:
- Around line 129-164: The VM tests are reading the real PID file and process
table causing flakiness; update the helpers or tests so they don't touch real
system state: modify readVmPid to accept an optional pidFilePath (or add an
injectable function) and make isVmGatewayHealthy accept a liveness-check
function (or inject isVmProcessAlive), then in the tests call these helpers with
a temp/test PID file and a mocked liveness probe, or alternatively mock fs
(e.g., fs.readFileSync/existsSync) and process checks used by
isVmProcessAlive/isVmGatewayHealthy; adjust tests to create controlled PID
contents and to stub process liveness so assertions are deterministic.
---
Outside diff comments:
In `@scripts/brev-launchable-ci-cpu.sh`:
- Line 31: Update the stale comment that documents the default OpenShell CLI
release tag so it matches the actual default; change the commented line
referencing "OPENSHELL_VERSION — OpenShell CLI release tag (default: v0.0.20)"
to reflect the current default v0.0.26 used where OPENSHELL_VERSION is set,
ensuring the comment and the OPENSHELL_VERSION default value are consistent.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2141-2235: Extract the mqueue workaround block into a new helper
function named patchVmRootfsForMqueue(rootfsDir, vmEnv) and call it where the
current block lives; move all logic that creates shimDir/shimPath, writes the
runc-shim, patches the init script (referencing initScript, shimDir, shimPath,
and the K3S_ARGS replacement) and the related fs operations and logging into
that function so the main flow around spawnSync(prepResult) stays concise;
ensure the helper receives rootfsDir (from prepOutput) and returns
success/failure or throws so existing console logs (e.g., "Patched VM init..." /
"Init script not found...") remain consistent and tests can target
patchVmRootfsForMqueue independently.
- Around line 2815-2849: The embedded shell script written to importScript makes
maintenance hard; move the script text into a standalone file (e.g.,
scripts/import-image.sh) and have onboard.ts copy that file into the VM rootfs
instead of constructing the array inline. Update the code around importScript,
fs.writeFileSync, and fs.chmodSync to locate the new script (use
path.join(__dirname, "..", "scripts", "import-image.sh") or equivalent), copy it
into rootfs (fs.copyFileSync or streaming copy), then set executable bits and
keep the existing run(...) and runCapture(...) calls that reference
/opt/nemoclaw/import-image.sh; ensure any template values (if needed) are either
left static or replaced before copying and adjust tests accordingly.
- Around line 2245-2265: There are two different functions named
getInstalledOpenshellVersion: one imported from "./openshell" (signature
(binary: string, opts: CaptureOpenshellOptions)) and a local helper that parses
pre-captured output (signature (versionOutput = null)); rename the local parser
to something unambiguous like parseOpenshellVersionOutput (or
parseCapturedOpenshellVersion), update all internal call sites to that new name,
and leave the imported getInstalledOpenshellVersion unchanged so callers like
the gatewayImage construction continue to use the correct function.
In `@test/e2e/test-vm-backend-e2e.sh`:
- Line 34: Update the shell options line that currently reads "set -uo pipefail"
to include "-e" so the script uses "set -euo pipefail" to enable fail-fast
behavior (or, if omission is intentional, add an explicit comment above the "set
-uo pipefail" line explaining why "-e" is omitted and how failures are tracked);
modify the single-line "set -uo pipefail" occurrence in
test/e2e/test-vm-backend-e2e.sh (the shell options line) accordingly to ensure
intermediate command failures stop the script or are clearly documented.
- Around line 291-295: The current zstd install block (the if ! command -v zstd
conditional) assumes apt-get is available; replace it with a defensive install
routine that first tries apt-get, then falls back to common package managers
(dnf, yum, pacman) and finally prints a clear error/usage message if none are
present; update the block around the "Installing zstd for runtime
decompression..." message to attempt each package manager in order, and on
failure exit with an informative message telling the user to install zstd
manually.
- Around line 488-495: Replace the broad pkill usage with a PID-file based
shutdown: read the PID from ~/.nemoclaw/openshell-vm.pid if it exists, verify
the PID corresponds to a running process whose command line contains
"openshell-vm" (to avoid killing unrelated PIDs), send a targeted kill to that
PID and remove the pidfile; if the pidfile is missing or verification fails,
fall back to the existing pkill -f "openshell-vm" behavior as a last resort and
log appropriate info messages. This change should be applied where pkill -f
"openshell-vm" is used in the test/e2e/test-vm-backend-e2e.sh script.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: accc1d5d-64cf-43eb-8220-b2e5190b92f1
📒 Files selected for processing (18)
.agents/skills/nemoclaw-user-reference/references/troubleshooting.md.github/workflows/nightly-e2e.yamldocs/reference/troubleshooting.mdk8s/nemoclaw-k8s.yamlnemoclaw-blueprint/blueprint.yamlschemas/blueprint.schema.jsonscripts/brev-launchable-ci-cpu.shscripts/install-openshell.shsrc/lib/onboard-session.tssrc/lib/onboard.tssrc/lib/openshell.tssrc/lib/platform.tssrc/nemoclaw.tstest/e2e/brev-e2e.test.tstest/e2e/test-sandbox-survival.shtest/e2e/test-vm-backend-e2e.shtest/openshell-vm.test.tstest/platform.test.ts
| function detectGatewayBackend(opts = {}) { | ||
| const env = opts.env ?? process.env; | ||
| const override = env.NEMOCLAW_GATEWAY_BACKEND; | ||
| if (override === "vm" || override === "docker") return override; | ||
|
|
||
| const vmAvailable = | ||
| typeof opts.vmAvailable === "boolean" | ||
| ? opts.vmAvailable | ||
| : false; // caller should pass actual detection result | ||
| const dockerAvailable = | ||
| typeof opts.dockerAvailable === "boolean" | ||
| ? opts.dockerAvailable | ||
| : detectDockerHost(opts) !== null; |
There was a problem hiding this comment.
Don't hard-code the default VM availability to false.
When callers omit opts.vmAvailable, this helper still resolves a dual-capable host to "docker", so the new default silently stops preferring VM. Either detect VM availability here or make vmAvailable a required input for this API.
💡 One way to keep the default preference correct
+const { isOpenshellVmAvailable } = require("./openshell");
+
function detectGatewayBackend(opts = {}) {
const env = opts.env ?? process.env;
const override = env.NEMOCLAW_GATEWAY_BACKEND;
if (override === "vm" || override === "docker") return override;
const vmAvailable =
typeof opts.vmAvailable === "boolean"
? opts.vmAvailable
- : false; // caller should pass actual detection result
+ : isOpenshellVmAvailable();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function detectGatewayBackend(opts = {}) { | |
| const env = opts.env ?? process.env; | |
| const override = env.NEMOCLAW_GATEWAY_BACKEND; | |
| if (override === "vm" || override === "docker") return override; | |
| const vmAvailable = | |
| typeof opts.vmAvailable === "boolean" | |
| ? opts.vmAvailable | |
| : false; // caller should pass actual detection result | |
| const dockerAvailable = | |
| typeof opts.dockerAvailable === "boolean" | |
| ? opts.dockerAvailable | |
| : detectDockerHost(opts) !== null; | |
| const { isOpenshellVmAvailable } = require("./openshell"); | |
| function detectGatewayBackend(opts = {}) { | |
| const env = opts.env ?? process.env; | |
| const override = env.NEMOCLAW_GATEWAY_BACKEND; | |
| if (override === "vm" || override === "docker") return override; | |
| const vmAvailable = | |
| typeof opts.vmAvailable === "boolean" | |
| ? opts.vmAvailable | |
| : isOpenshellVmAvailable(); | |
| const dockerAvailable = | |
| typeof opts.dockerAvailable === "boolean" | |
| ? opts.dockerAvailable | |
| : detectDockerHost(opts) !== null; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/platform.ts` around lines 123 - 135, The detectGatewayBackend
function currently defaults opts.vmAvailable to false which causes dual-capable
hosts to prefer "docker" silently; update detectGatewayBackend to either (A)
perform VM detection when opts.vmAvailable is undefined (mirror docker detection
by calling an appropriate VM detection helper, e.g., detectVMHost or similar,
and set vmAvailable based on its result) or (B) make opts.vmAvailable required
and throw a clear error if it's not provided; ensure you update the vmAvailable
initialization (and any callers) instead of hard-coding false and keep the
dockerAvailable logic that calls detectDockerHost unchanged.
| if (session?.gatewayBackend === "vm") { | ||
| // openshell-vm registers as "openshell-vm-nemoclaw", not "nemoclaw" | ||
| const vmGatewayName = `openshell-vm-${NEMOCLAW_GATEWAY_NAME}`; | ||
| runOpenshell(["gateway", "select", vmGatewayName], { ignoreError: true }); | ||
| const before = getNamedGatewayLifecycleState(); | ||
| if (before.state === "healthy_named") { | ||
| process.env.OPENSHELL_GATEWAY = vmGatewayName; | ||
| return { recovered: true, before, after: before, attempted: false }; | ||
| } | ||
| try { | ||
| await startGatewayForRecovery(); | ||
| } catch { | ||
| /* fall through */ | ||
| } | ||
| runOpenshell(["gateway", "select", vmGatewayName], { ignoreError: true }); | ||
| const after = getNamedGatewayLifecycleState(); | ||
| if (after.state === "healthy_named") { | ||
| process.env.OPENSHELL_GATEWAY = vmGatewayName; | ||
| return { recovered: true, before, after, attempted: true, via: "start" }; | ||
| } | ||
| return { recovered: false, before, after, attempted: true }; |
There was a problem hiding this comment.
Parameterize the lifecycle check with the VM gateway name.
This branch selects openshell-vm-nemoclaw, but getNamedGatewayLifecycleState() still queries and compares against "nemoclaw". As written, the VM path can never observe healthy_named, so recovery falls through as failed even when the VM gateway is already up.
🛠️ Suggested shape of the fix
-function getNamedGatewayLifecycleState() {
+function getNamedGatewayLifecycleState(gatewayName = NEMOCLAW_GATEWAY_NAME) {
const status = captureOpenshell(["status"]);
- const gatewayInfo = captureOpenshell(["gateway", "info", "-g", "nemoclaw"]);
+ const gatewayInfo = captureOpenshell(["gateway", "info", "-g", gatewayName]);
const cleanStatus = stripAnsi(status.output);
const activeGateway = getActiveGatewayName(status.output);
const connected = /^\s*Status:\s*Connected\b/im.test(cleanStatus);
- const named = hasNamedGateway(gatewayInfo.output);
+ const named = stripAnsi(gatewayInfo.output).includes(`Gateway: ${gatewayName}`);
const refusing = /Connection refused|client error \(Connect\)|tcp connect error/i.test(
cleanStatus,
);
- if (connected && activeGateway === "nemoclaw" && named) {
+ if (connected && activeGateway === gatewayName && named) {
return { state: "healthy_named", status: status.output, gatewayInfo: gatewayInfo.output };
}
...
}And in this branch:
-const before = getNamedGatewayLifecycleState();
+const before = getNamedGatewayLifecycleState(vmGatewayName);
...
-const after = getNamedGatewayLifecycleState();
+const after = getNamedGatewayLifecycleState(vmGatewayName);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/nemoclaw.ts` around lines 523 - 543, The VM branch constructs
vmGatewayName but calls getNamedGatewayLifecycleState() with no name; update
both lifecycle checks to call getNamedGatewayLifecycleState(vmGatewayName) so
the code inspects the VM gateway (before and after), keep
runOpenshell(["gateway","select", vmGatewayName]) and setting
process.env.OPENSHELL_GATEWAY as-is, and leave startGatewayForRecovery()
behavior unchanged unless it also needs a gateway name in future.
| describe("readVmPid", () => { | ||
| it("returns null when PID file does not exist", () => { | ||
| // readVmPid reads from the real path (~/.nemoclaw/openshell-vm.pid) | ||
| // but we can test the isVmProcessAlive helper with known PIDs | ||
| expect(readVmPid()).toSatisfy((v) => v === null || typeof v === "number"); | ||
| }); | ||
| }); | ||
|
|
||
| describe("isVmProcessAlive", () => { | ||
| it("returns false for null PID", () => { | ||
| expect(isVmProcessAlive(null)).toBe(false); | ||
| }); | ||
|
|
||
| it("returns false for PID 0", () => { | ||
| expect(isVmProcessAlive(0)).toBe(false); | ||
| }); | ||
|
|
||
| it("returns false for negative PID", () => { | ||
| expect(isVmProcessAlive(-1)).toBe(false); | ||
| }); | ||
|
|
||
| it("returns true for current process PID", () => { | ||
| expect(isVmProcessAlive(process.pid)).toBe(true); | ||
| }); | ||
|
|
||
| it("returns false for non-existent PID", () => { | ||
| // Use a very high PID that's unlikely to exist | ||
| expect(isVmProcessAlive(4_000_000)).toBe(false); | ||
| }); | ||
| }); | ||
|
|
||
| describe("isVmGatewayHealthy", () => { | ||
| it("returns false when no VM process is running", () => { | ||
| // With no PID file, there's no VM process to check | ||
| expect(isVmGatewayHealthy()).toBe(false); | ||
| }); |
There was a problem hiding this comment.
Make these VM lifecycle tests hermetic.
readVmPid() and isVmGatewayHealthy() here read the real ~/.nemoclaw/openshell-vm.pid and process table, so the assertions change based on whatever is already running on the host. That makes this unit suite flaky on dev machines and shared CI workers. Inject the pid-file path / liveness probe into the helpers, or mock fs and the process checks for these cases instead. As per coding guidelines: test/**/*.test.{js,ts}: Mock external dependencies in unit tests; do not call real NVIDIA APIs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/openshell-vm.test.ts` around lines 129 - 164, The VM tests are reading
the real PID file and process table causing flakiness; update the helpers or
tests so they don't touch real system state: modify readVmPid to accept an
optional pidFilePath (or add an injectable function) and make isVmGatewayHealthy
accept a liveness-check function (or inject isVmProcessAlive), then in the tests
call these helpers with a temp/test PID file and a mocked liveness probe, or
alternatively mock fs (e.g., fs.readFileSync/existsSync) and process checks used
by isVmProcessAlive/isVmGatewayHealthy; adjust tests to create controlled PID
contents and to stub process liveness so assertions are deterministic.
… version field Refactor test/e2e/brev-e2e.test.ts for issue #1390: - Extract named helpers from the ~350-line monolithic beforeAll: cleanupLeftoverInstance(), createBrevInstance(), refreshAndWaitForSsh(), bootstrapLaunchable(), pollForSandboxReady(), writeManualRegistry(). The beforeAll now reads as a high-level orchestration (~50 lines). - Deduplicate brev create + error recovery: the deploy-cli and launchable paths shared duplicated brev refresh + waitForSsh patterns, now consolidated into refreshAndWaitForSsh(). - Remove phantom version: 1 from the manual registry write. The SandboxRegistry interface in src/lib/registry.ts has no version field; registerSandbox() doesn't write one either. - bootstrapLaunchable() returns { remoteDir, needsOnboard } instead of mutating module-level state as a hidden side-effect. - instanceCreated is set at call sites in beforeAll, not hidden inside createBrevInstance(). - Remove dead sleep() helper (defined but never called). Pure refactoring — no behavior changes. // @ts-nocheck pragma preserved. Note: PR #1791 (openshell-vm microVM) also touches this file — whoever merges second will need a rebase.
… version field Refactor test/e2e/brev-e2e.test.ts for issue #1390: - Extract named helpers from the ~350-line monolithic beforeAll: cleanupLeftoverInstance(), createBrevInstance(), refreshAndWaitForSsh(), bootstrapLaunchable(), pollForSandboxReady(), writeManualRegistry(). The beforeAll now reads as a high-level orchestration (~50 lines). - Deduplicate brev create + error recovery: the deploy-cli and launchable paths shared duplicated brev refresh + waitForSsh patterns, now consolidated into refreshAndWaitForSsh(). - Remove phantom version: 1 from the manual registry write. The SandboxRegistry interface in src/lib/registry.ts has no version field; registerSandbox() doesn't write one either. - bootstrapLaunchable() returns { remoteDir, needsOnboard } instead of mutating module-level state as a hidden side-effect. - instanceCreated is set at call sites in beforeAll, not hidden inside createBrevInstance(). - Remove dead sleep() helper (defined but never called). Pure refactoring — no behavior changes. // @ts-nocheck pragma preserved. Note: PR #1791 (openshell-vm microVM) also touches this file — whoever merges second will need a rebase.
…nstaller detection (#1888) ## Summary Addresses all three items from #1390, plus a bonus installer fix discovered while running the pre-push hooks. ### Issue #1390 — Brev E2E cleanup - **Extract helpers from the monolithic beforeAll**: The ~350-line `beforeAll` block is now ~50 lines of high-level orchestration calling named helpers: `cleanupLeftoverInstance()`, `createBrevInstance()`, `refreshAndWaitForSsh()`, `bootstrapLaunchable()`, `pollForSandboxReady()`, `writeManualRegistry()`. - **Deduplicate brev create + error recovery**: The deploy-cli and launchable paths shared duplicated `brev refresh` + `waitForSsh` patterns, now consolidated into `refreshAndWaitForSsh()`. - **Remove phantom `version: 1`** from the manual registry write. The `SandboxRegistry` interface in `src/lib/registry.ts` has no version field; `registerSandbox()` doesn't write one either. Additional cleanup: - `bootstrapLaunchable()` returns `{ remoteDir, needsOnboard }` instead of mutating module-level state as a hidden side-effect. - `instanceCreated` is set at call sites in `beforeAll`, not hidden inside `createBrevInstance()`. - Dead `sleep()` helper removed. Pure refactoring — no behavior changes. `// @ts-nocheck` pragma preserved. ### Installer worktree fix `scripts/install.sh` used `-d "${repo_root}/.git"` to detect source checkouts, but in a git worktree `.git` is a file, not a directory. This caused `is_source_checkout()` to return false, falling through to the GitHub clone path. Fixed by using `-e` (exists) instead of `-d` (is directory). This resolved 12 pre-existing test failures in `test/install-preflight.test.ts`. ## Related Issue Closes #1390 ## Note PR #1791 also touches `test/e2e/brev-e2e.test.ts` (appends a new `vm-backend` test case). Clean merge — whoever merges second rebases trivially. ## Type of Change - [x] Code change for a new feature, bug fix, or refactor. ## Testing - [x] `npx prek run --all-files` passes (all pre-commit and pre-push hooks green). - [x] `npx vitest run --project cli` — 1213 passed, 0 failed (was 12 failed before installer fix). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Enhanced end-to-end test infrastructure and setup orchestration for improved test reliability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
Summary
Replaces the Docker-based gateway with openshell-vm (libkrun microVM) as the default gateway backend. Eliminates Docker-in-Docker from the k8s deployment and removes Docker as a runtime dependency for the gateway. Docker is still used on the host as a build tool for the sandbox image.
E2E status: 30/30 passing on GitHub Actions ubuntu-latest with KVM.
Why microVM instead of Docker
openshell-vm) replacesdockerd+ k3s-in-Docker. The k8s manifest went from 122 lines (2 containers, 3 volumes, init container) to 56 lines (1 container, no volumes).GPU inference works without Docker
GPU inference is routed through
inference.local(OpenShell's L7 proxy → host-side Ollama/vLLM/NIM). The sandbox never touches the GPU directly — it makes HTTP requests through the gateway's inference route. The microVM backend works for all scenarios including local GPU inference.What's implemented
platform.ts):detectGatewayBackend()prefers VM when available, falls back to Docker.NEMOCLAW_GATEWAY_BACKENDenv override.onboard.ts): spawnopenshell-vm --name nemoclaw --mem 4096as detached process, PID tracking, health polling via gRPC, SIGTERM/SIGKILL cleanupgatewayBackendfield stored in onboard session for resume/recoveryctr images importvia exec agent →openshell sandbox create --from <image-ref>--resumedetects dead VM gateway, restarts it, waits for sandbox + inference routeWhere Docker is still used
Docker is used on the host as a build tool —
docker buildto create the sandbox image anddocker saveto export it. This is not Docker-in-Docker; it's the same as runningdocker buildin any CI pipeline. Future work could replace this with podman/buildah or building inside the VM's containerd.Workarounds for current vm-dev release (discussion needed)
We're patching the openshell-vm rootfs at onboard time to work around issues in the current
vm-devrelease. Each patch is designed to be self-disabling — it detects whether the issue exists before applying, so when the issues are fixed upstream, our patches become no-ops.Here's what each one does and why, in plain language:
1. Kernel missing mqueue support (runc shim)
What happens without the fix: Every container inside the VM fails to start. You get
error mounting "mqueue" to rootfs: no such devicein the kubelet logs. No pods run — not even CoreDNS — so the entire k3s cluster is dead.Why it happens: When a container starts, the container runtime (runc) tries to set up a special filesystem called
mqueueinside the container. This requires the Linux kernel to have "POSIX message queue" support compiled in. Based on what we can see in the repo, the kernel config (CONFIG_POSIX_MQUEUE=y) appears to have been added to the kconfig file after the vm-dev release binary was built. The kernel in the current release appears to not have this feature compiled in.What we do: Before starting the VM, we write a containerd config file that tells k3s to use our wrapper script instead of the real runc. The wrapper does one thing: before creating a container, it edits the container's configuration to request a plain tmpfs filesystem instead of mqueue. The container starts fine because tmpfs always works. Almost no software actually uses mqueue — it's a POSIX inter-process communication mechanism that OpenClaw doesn't need.
How it self-disables: The init script tests
mount -t mqueueat boot. If the kernel supports mqueue, the test succeeds and the wrapper is never installed.Upstream fix: Rebuild the vm-dev release from current main (which has the kernel config fix at commit
d8cf7951).2. Supervisor binary glibc mismatch
What happens without the fix: The sandbox container starts but immediately crashes in a loop. The error is
GLIBC_2.38 not found (required by /opt/openshell/bin/openshell-sandbox).Why it happens: Every sandbox pod runs a supervisor process (
openshell-sandbox) that manages the agent inside the container. This binary isn't part of the sandbox image — OpenShell injects it from the gateway's filesystem into the pod via a Kubernetes volume mount. In the VM gateway, this binary appears to have been cross-compiled in a way that linked it against glibc 2.39 (a newer C library version than the sandbox container provides). But the sandbox container is based on Ubuntu 22.04, which has glibc 2.35. The binary can't run because it needs a newer C library than the container provides.In the Docker gateway, the equivalent binary appears to be built against a compatible glibc version, which is why the Docker path works.
What we do: At onboard time, we pull the Docker gateway image (
ghcr.io/nvidia/openshell/cluster:0.0.26), extract the compatibleopenshell-sandboxbinary from it, and replace the VM rootfs copy. This takes ~7 seconds and happens once.How it self-disables: We only replace the binary if the Docker gateway image is available (Docker is installed). If the VM release is rebuilt with a compatible binary, this step overwrites with an equivalent binary — harmless.
Upstream fix: Build
openshell-sandboxfor the VM rootfs targeting a glibc version compatible with the sandbox base image.3. DNS proxy skip
Why: NemoClaw's DNS proxy setup script uses
docker execto reach kubectl inside the gateway container. With the microVM backend, there is no Docker container. We skip the DNS proxy — the VM uses gvproxy for networking, which provides NAT with working DNS out of the box. This isn't really a workaround — it's the correct behavior for the VM networking architecture.4. Sandbox readiness check via gRPC instead of docker exec
Why:
waitForSandboxReady()usedopenshell doctor exec -- kubectlwhich runsdocker execinternally. For the VM backend, we useopenshell sandbox listvia the gRPC API instead. This also isn't really a workaround — it's using the right API for the backend.Files changed (18 files, +1600/-100)
src/lib/onboard.tssrc/lib/platform.tsdetectGatewayBackend()(VM preferred, no GPU constraint)src/lib/openshell.tsisOpenshellVmAvailable(),getInstalledOpenshellVmVersion()src/lib/onboard-session.tsgatewayBackendsession fieldsrc/nemoclaw.tsnemoclaw-blueprint/blueprint.yamlgateway_backends: [docker, vm], min version bumpk8s/nemoclaw-k8s.yaml.github/workflows/nightly-e2e.yamltest/e2e/test-vm-backend-e2e.shtest/openshell-vm.test.tstest/platform.test.tsdetectGatewayBackend()E2E test phases (all passing)
Test plan
npm test— 1386 tests passed, 0 failuresKnown gaps
cloud-experimental-e2e: Fails on this branch AND on main — pre-existing, unrelated.docker build+docker saverun on the host. Not DinD, but still a Docker dependency. Could be replaced with podman/buildah or in-VM builds.Refs: NVIDIA/OpenShell#611
Summary by CodeRabbit
New Features
Documentation
Tests