Skip to content

feat(vm): openshell-vm microVM gateway backend (full lifecycle, E2E 30/30)#1791

Draft
ericksoa wants to merge 31 commits intomainfrom
feat/openshell-vm-backend
Draft

feat(vm): openshell-vm microVM gateway backend (full lifecycle, E2E 30/30)#1791
ericksoa wants to merge 31 commits intomainfrom
feat/openshell-vm-backend

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Apr 11, 2026

Summary

Replaces the Docker-based gateway with openshell-vm (libkrun microVM) as the default gateway backend. Eliminates Docker-in-Docker from the k8s deployment and removes Docker as a runtime dependency for the gateway. Docker is still used on the host as a build tool for the sandbox image.

E2E status: 30/30 passing on GitHub Actions ubuntu-latest with KVM.

Why microVM instead of Docker

  • Security: Sandboxes run inside hardware-isolated microVMs (KVM), not Docker containers sharing the host kernel. No privileged Docker daemon, no shared Docker socket.
  • Simplicity: One process (openshell-vm) replaces dockerd + k3s-in-Docker. The k8s manifest went from 122 lines (2 containers, 3 volumes, init container) to 56 lines (1 container, no volumes).
  • Resources: ~8Gi less memory (no Docker daemon sidecar). VM boots in ~6s vs 60s Docker daemon startup.
  • No Docker-in-Docker: The old k8s deployment ran a Docker daemon inside a k8s pod. That's gone entirely.

GPU inference works without Docker

GPU inference is routed through inference.local (OpenShell's L7 proxy → host-side Ollama/vLLM/NIM). The sandbox never touches the GPU directly — it makes HTTP requests through the gateway's inference route. The microVM backend works for all scenarios including local GPU inference.

What's implemented

  • Backend detection (platform.ts): detectGatewayBackend() prefers VM when available, falls back to Docker. NEMOCLAW_GATEWAY_BACKEND env override.
  • Gateway lifecycle (onboard.ts): spawn openshell-vm --name nemoclaw --mem 4096 as detached process, PID tracking, health polling via gRPC, SIGTERM/SIGKILL cleanup
  • Session persistence: gatewayBackend field stored in onboard session for resume/recovery
  • Sandbox image import: Docker build → docker save → write to VM rootfs via virtio-fs → ctr images import via exec agent → openshell sandbox create --from <image-ref>
  • Resume/recovery: --resume detects dead VM gateway, restarts it, waits for sandbox + inference route
  • K8s manifest: Dropped DinD, single container with openshell-vm
  • CI workflow: KVM access, VM diagnostics on failure

Where Docker is still used

Docker is used on the host as a build tool — docker build to create the sandbox image and docker save to export it. This is not Docker-in-Docker; it's the same as running docker build in any CI pipeline. Future work could replace this with podman/buildah or building inside the VM's containerd.


Workarounds for current vm-dev release (discussion needed)

We're patching the openshell-vm rootfs at onboard time to work around issues in the current vm-dev release. Each patch is designed to be self-disabling — it detects whether the issue exists before applying, so when the issues are fixed upstream, our patches become no-ops.

Here's what each one does and why, in plain language:

1. Kernel missing mqueue support (runc shim)

What happens without the fix: Every container inside the VM fails to start. You get error mounting "mqueue" to rootfs: no such device in the kubelet logs. No pods run — not even CoreDNS — so the entire k3s cluster is dead.

Why it happens: When a container starts, the container runtime (runc) tries to set up a special filesystem called mqueue inside the container. This requires the Linux kernel to have "POSIX message queue" support compiled in. Based on what we can see in the repo, the kernel config (CONFIG_POSIX_MQUEUE=y) appears to have been added to the kconfig file after the vm-dev release binary was built. The kernel in the current release appears to not have this feature compiled in.

What we do: Before starting the VM, we write a containerd config file that tells k3s to use our wrapper script instead of the real runc. The wrapper does one thing: before creating a container, it edits the container's configuration to request a plain tmpfs filesystem instead of mqueue. The container starts fine because tmpfs always works. Almost no software actually uses mqueue — it's a POSIX inter-process communication mechanism that OpenClaw doesn't need.

How it self-disables: The init script tests mount -t mqueue at boot. If the kernel supports mqueue, the test succeeds and the wrapper is never installed.

Upstream fix: Rebuild the vm-dev release from current main (which has the kernel config fix at commit d8cf7951).

2. Supervisor binary glibc mismatch

What happens without the fix: The sandbox container starts but immediately crashes in a loop. The error is GLIBC_2.38 not found (required by /opt/openshell/bin/openshell-sandbox).

Why it happens: Every sandbox pod runs a supervisor process (openshell-sandbox) that manages the agent inside the container. This binary isn't part of the sandbox image — OpenShell injects it from the gateway's filesystem into the pod via a Kubernetes volume mount. In the VM gateway, this binary appears to have been cross-compiled in a way that linked it against glibc 2.39 (a newer C library version than the sandbox container provides). But the sandbox container is based on Ubuntu 22.04, which has glibc 2.35. The binary can't run because it needs a newer C library than the container provides.

In the Docker gateway, the equivalent binary appears to be built against a compatible glibc version, which is why the Docker path works.

What we do: At onboard time, we pull the Docker gateway image (ghcr.io/nvidia/openshell/cluster:0.0.26), extract the compatible openshell-sandbox binary from it, and replace the VM rootfs copy. This takes ~7 seconds and happens once.

How it self-disables: We only replace the binary if the Docker gateway image is available (Docker is installed). If the VM release is rebuilt with a compatible binary, this step overwrites with an equivalent binary — harmless.

Upstream fix: Build openshell-sandbox for the VM rootfs targeting a glibc version compatible with the sandbox base image.

3. DNS proxy skip

Why: NemoClaw's DNS proxy setup script uses docker exec to reach kubectl inside the gateway container. With the microVM backend, there is no Docker container. We skip the DNS proxy — the VM uses gvproxy for networking, which provides NAT with working DNS out of the box. This isn't really a workaround — it's the correct behavior for the VM networking architecture.

4. Sandbox readiness check via gRPC instead of docker exec

Why: waitForSandboxReady() used openshell doctor exec -- kubectl which runs docker exec internally. For the VM backend, we use openshell sandbox list via the gRPC API instead. This also isn't really a workaround — it's using the right API for the backend.


Files changed (18 files, +1600/-100)

File Purpose
src/lib/onboard.ts VM gateway lifecycle, image import, rootfs patches, resume recovery
src/lib/platform.ts detectGatewayBackend() (VM preferred, no GPU constraint)
src/lib/openshell.ts isOpenshellVmAvailable(), getInstalledOpenshellVmVersion()
src/lib/onboard-session.ts gatewayBackend session field
src/nemoclaw.ts VM-aware cleanup, recovery
nemoclaw-blueprint/blueprint.yaml gateway_backends: [docker, vm], min version bump
k8s/nemoclaw-k8s.yaml Replaced DinD with single-container openshell-vm
.github/workflows/nightly-e2e.yaml vm-e2e job, KVM access, diagnostics
test/e2e/test-vm-backend-e2e.sh Full E2E: install → onboard → inference → resume → reset
test/openshell-vm.test.ts Unit tests for VM detection, session, lifecycle
test/platform.test.ts Unit tests for detectGatewayBackend()

E2E test phases (all passing)

Phase Description Status
0 Prerequisites (Linux, KVM, API key, network) PASS
1 Install openshell-vm binary + runtime PASS
2 Install NemoClaw with VM backend PASS
3 Post-install verification (registry, list, status, no Docker container) PASS
4 Live inference through VM sandbox (PONG test) PASS
5 Resume after openshell-vm kill PASS
6 Reset — destroy and clean slate re-onboard PASS
7 Final cleanup PASS

Test plan

  • npm test — 1386 tests passed, 0 failures
  • Pre-commit and pre-push hooks pass
  • vm-e2e: 30/30 on GitHub Actions ubuntu-latest
  • cloud-e2e: passes (Docker fallback path not regressed)
  • sandbox-survival-e2e: passes
  • hermes-e2e: passes
  • messaging-providers-e2e: passes
  • skip-permissions-e2e: passes
  • Squash commits before merge
  • Manual testing on Mac with openshell-vm (macOS + Hypervisor.framework)

Known gaps

  • Commit squashing: Branch has 30 iterative commits. Needs squash before merge.
  • cloud-experimental-e2e: Fails on this branch AND on main — pre-existing, unrelated.
  • Mac testing: E2E runs on Linux only. openshell-vm supports macOS ARM64 via Hypervisor.framework but hasn't been tested with this NemoClaw integration yet.
  • Host Docker still needed for sandbox image build: docker build + docker save run on the host. Not DinD, but still a Docker dependency. Could be replaced with podman/buildah or in-VM builds.

Refs: NVIDIA/OpenShell#611

Summary by CodeRabbit

  • New Features

    • Added microVM gateway backend support as an alternative to Docker, enabling CPU-only sandbox environments
    • Introduced automatic gateway backend detection for flexible deployment configuration
  • Documentation

    • Updated troubleshooting guides to reflect OpenShell version 0.0.26+ requirements
  • Tests

    • Added end-to-end tests validating microVM backend workflow including sandbox lifecycle operations

OpenShell v0.0.26 ships openshell-vm, a standalone binary that boots a
hardware-isolated microVM via libkrun. This commit lays the groundwork
for NemoClaw to support both the existing Docker/k3s gateway and the new
microVM gateway as selectable backends.

Phase 1 changes:
- Bump min_openshell_version from 0.0.24 to 0.0.26 across blueprint,
  install script, onboard preflight, CI scripts, E2E tests, and docs
- Add gateway_backends field to blueprint.yaml schema (docker, vm)
- Add isOpenshellVmAvailable() and getInstalledOpenshellVmVersion() to
  openshell.ts for detecting the openshell-vm binary
- Add detectGatewayBackend() to platform.ts with NEMOCLAW_GATEWAY_BACKEND
  env var override, auto-detection preferring VM when available and
  falling back to Docker, and mandatory Docker for GPU workloads
- Add gatewayBackend field to onboard session schema for persisting the
  selected backend across resume cycles
- Add tests for all new functions

The VM backend requires no Docker daemon and provides faster boot, but
has no NVIDIA GPU passthrough (libkrun lacks PCI/VFIO support), so the
Docker backend remains mandatory for local inference on GPU workstations.

Refs: NVIDIA/OpenShell#611
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 11, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 30061766-16fd-47f9-ba49-702c8ca9429d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/openshell-vm-backend

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🚀 Docs preview ready!

https://NVIDIA.github.io/NemoClaw/pr-preview/pr-1791/

ericksoa added 27 commits April 11, 2026 14:15
Phase 4 of the dual-backend feature (PR #1791):

- New test/e2e/test-vm-backend-e2e.sh: full E2E journey for the VM
  backend — install openshell-vm from release assets, onboard with
  NEMOCLAW_GATEWAY_BACKEND=vm, verify sandbox creation, live inference
  through the microVM gateway, resume after openshell-vm kill, and
  reset to clean slate.

- New vm-e2e job in nightly-e2e.yaml: runs on ubuntu-latest (has
  /dev/kvm), installs openshell-vm, executes the VM backend E2E test.

- New vm-backend test suite in brev-e2e.test.ts: allows running the
  VM backend E2E on ephemeral Brev instances via TEST_SUITE=vm-backend.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The openshell-vm binary is published on the vm-dev pre-release tag
(not v0.0.26) with gnu libc (not musl). Also downloads the VM runtime
(kernel + rootfs) needed by libkrun, and installs zstd for
decompression.

Asset corrections:
- Tag: vm-dev (not v0.0.26)
- Binary: openshell-vm-*-unknown-linux-gnu.tar.gz (not musl)
- Checksums: vm-binary-checksums-sha256.txt
- Runtime: vm-runtime-linux-*.tar.zst → ~/.local/share/openshell-vm/

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
When detectGatewayBackend() returns "vm", the onboard and runtime
recovery flows now use openshell-vm instead of Docker:

- startGatewayWithOptions() detects the backend and delegates to
  startVmGatewayProcess() for the VM path, which spawns openshell-vm
  as a detached background process with PID tracking
- VM lifecycle helpers: writeVmPidFile, readVmPid, isVmProcessAlive,
  stopVmGateway (SIGTERM→SIGKILL), isVmGatewayHealthy
- destroyGateway() checks session.gatewayBackend and stops the VM
  process instead of Docker volume cleanup when backend is "vm"
- recoverGatewayRuntime() reads session.gatewayBackend to choose
  VM vs Docker recovery path
- recoverNamedGatewayRuntime() in nemoclaw.ts skips Docker-specific
  gateway select commands for VM backend
- cleanupGatewayAfterLastSandbox() stops VM process instead of
  Docker cleanup when backend is "vm"
- Gateway backend is saved to onboard session on step completion
  so resume cycles know which backend to use
- Resume flow checks VM health via isVmGatewayHealthy() instead of
  Docker gateway state when session records VM backend

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Reading the openshell-vm source (NVIDIA/OpenShell crates/openshell-vm)
revealed three bugs in the Phase 2 implementation:

1. openshell-vm was spawned with no arguments. It needs `--name nemoclaw`
   so it extracts the rootfs to the correct instance directory and
   registers gateway metadata under the right identity.

2. openshell-vm prefixes instance names: gateway_name("nemoclaw")
   produces "openshell-vm-nemoclaw". All OPENSHELL_GATEWAY env vars
   and `openshell gateway select` calls for the VM path now use
   VM_GATEWAY_NAME ("openshell-vm-nemoclaw") instead of GATEWAY_NAME.

3. Health poll was too short (15 × 2s = 30s). The VM boots k3s inside
   a microVM with its own 90s internal health check. Increased to
   60 × 3s = 180s to avoid racing the inner bootstrap. Also log the
   last 10 lines from openshell-vm.log on failure for diagnostics.

Additionally: the VM gateway listens on port 30051 (NodePort), not
8080. The openshell-vm binary handles the port mapping internally
(gvproxy host:30051 → VM:30051 → kube-proxy → pod:8080).

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The GitHub-hosted ubuntu runner has /dev/kvm (crw-rw---- root:kvm) but
the runner user is not in the kvm group. openshell-vm opens /dev/kvm
directly via libkrun and fails with EACCES.

Fix by chmod 666 /dev/kvm in the KVM verification step. Also add the
user to the kvm group for completeness, though the chmod is sufficient
for the current process.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…tics

Three changes to address gvproxy crash on GitHub Actions:

1. Pass --mem 4096 to openshell-vm. The default 8GB is half the 16GB
   runner memory. With k3s pulling container images inside the VM, host
   processes (gvproxy, gvisor netstack) get starved. 4GB is enough for
   a lightweight gateway without GPU workloads.

2. Detect and use the E2E-downloaded VM runtime via OPENSHELL_VM_RUNTIME_DIR.
   The test script downloads gvproxy/libkrun/libkrunfw from the vm-dev
   release to ~/.local/share/openshell-vm/ but never tells openshell-vm
   to use it. The downloaded runtime may contain fixes not in the binary's
   embedded copy. When the downloaded runtime exists (has gvproxy), set
   OPENSHELL_VM_RUNTIME_DIR to prefer it.

3. Add VM diagnostics step to CI: openshell-vm log, gvproxy log, dmesg
   OOM check, memory stats, and VM console log. This will show the
   actual root cause if gvproxy crashes again.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The openshell-vm kernel has CONFIG_POSIX_MQUEUE=y but the init script
(openshell-vm-init.sh) never mounts the mqueue filesystem. When k3s
creates a pod, runc tries to mount mqueue at /dev/mqueue inside the
container namespace and gets ENODEV ("no such device") because the
host mount point doesn't exist.

Fix: run `openshell-vm prepare-rootfs` to extract the rootfs, then
patch the init script to mkdir + mount mqueue alongside the existing
devpts/shm mounts. The patch is idempotent — skipped if the init
script already contains /dev/mqueue.

Root cause found by tracing the VM console log:
  runc create failed: error mounting "mqueue" to rootfs at
  "/dev/mqueue": no such device

This should be fixed upstream in the OpenShell init script.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The previous mqueue patch ran silently when prepare-rootfs failed or
the init script wasn't found. Add verbose logging at each step so CI
output shows exactly what happened: rootfs path, init script location,
whether the string replacement matched, and any errors.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The vm-dev kernel (built 2026-04-09) predates the CONFIG_POSIX_MQUEUE
addition to the kconfig (2026-04-10 d8cf7951). runc fails with ENODEV
when trying to mount -t mqueue inside container namespaces.

Previous approach (mounting mqueue in init script) didn't work because
the kernel itself lacks the filesystem type — mount returns ENODEV.

New approach: install a runc wrapper shim that strips mqueue mount
entries from the OCI config.json before calling the real runc binary.
The shim is only installed when the kernel actually lacks mqueue
support (tested by attempting mount -t mqueue). When a future kernel
rebuild includes CONFIG_POSIX_MQUEUE, the shim is not installed.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…pt breakage

The previous approach embedded a heredoc in the init script patch,
which broke the init script (VM died before exec agent started).

New approach: write runc-wrapper.sh as a separate file in
/opt/nemoclaw/ of the rootfs, then patch the init script with a
simple test-and-swap block. The wrapper is a standalone script that
strips mqueue entries from config.json via sed before calling the
real runc binary. The init script patch is minimal — just an if/fi
block that copies the wrapper over runc when the kernel lacks mqueue.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…shim

Previous approach of wrapping /usr/bin/runc didn't work because k3s
bundles its own runc in /var/lib/rancher/k3s/data/<hash>/bin/, extracted
fresh at each startup. Our host-side wrapper was never called.

New approach: write /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
in the init script, which k3s reads before generating its containerd config.
The template sets BinaryName to /opt/nemoclaw/runc-shim, which strips
mqueue mount entries from config.json before calling the k3s-bundled runc.

This routes ALL runc invocations through the shim, correctly handling the
k3s data directory extraction timing.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
When the session's gatewayBackend is "vm", the standard `openshell
sandbox create --from Dockerfile` path fails because the openshell CLI
internally uses Docker put_archive to push the image into the gateway
container — which doesn't exist for VM gateways.

Fix: detect VM backend during sandbox creation and:
1. Build the image locally with Docker (same Dockerfile)
2. Export with docker save to the VM's rootfs via virtio-fs
   (host can write directly to the rootfs/tmp/ directory)
3. Import into VM containerd via openshell-vm exec + ctr images import
4. Pass --from <image-ref> instead of --from <Dockerfile> so the
   openshell CLI treats it as a pre-existing image (skips Docker push)

Falls back to the standard Dockerfile path if any step fails, so Docker
backend behavior is unchanged.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…erlay)

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…g issues

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Two post-sandbox fixes for VM backend:

1. DNS proxy: setup-dns-proxy.sh uses docker exec to reach kubectl
   inside the gateway container. VM backend has no Docker container.
   Skip the DNS proxy — gvproxy provides NAT networking with working
   DNS through the gateway IP (192.168.127.1).

2. Sandbox readiness: the sandbox pod may briefly flip Ready→NotReady
   during init container restarts in the VM. Add a 30s wait loop in
   setupOpenclaw before attempting sandbox connect, preventing the
   FailedPrecondition "sandbox is not ready" error.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The vm-dev openshell-sandbox binary was built with cargo-zigbuild on
a host with glibc 2.39, but sandbox containers use Ubuntu 22.04
(glibc 2.35). The binary crashes at startup:
  GLIBC_2.38 not found (required by /opt/openshell/bin/openshell-sandbox)

Fix: extract the openshell-sandbox binary from the Docker gateway
image (ghcr.io/nvidia/openshell/cluster:<version>), which was built
with rust:1.88-slim (Debian bookworm, glibc 2.36 — compatible with
Ubuntu 22.04's glibc 2.35). Replace the VM rootfs copy before boot.

The supervisor is side-loaded into sandbox pods via hostPath volume
from /opt/openshell/bin on the k3s node, so this fix propagates to
all sandbox pods automatically.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
waitForSandboxReady() used `openshell doctor exec -- kubectl` which
runs docker exec inside the gateway container. This is Docker-specific
and fails silently for VM gateways (no Docker container exists).

For VM backend, use `openshell sandbox list` + isSandboxReady() which
goes through the gRPC API and works for both Docker and VM gateways.
Docker backend behavior is unchanged.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…ateway

When openshell-vm is killed after a successful onboard, the session is
marked complete (resumable=false). Running `nemoclaw onboard --resume`
correctly reports "No resumable session found" — but the E2E test
expects resume to restart the gateway and reconnect the sandbox.

Fix: when --resume finds a completed session with gatewayBackend=vm
and the VM gateway is dead, restart openshell-vm and check if the
sandbox is still alive. If the sandbox reconnects after gateway
restart, exit 0 (recovery successful). If not, mark the session
resumable and fall through to the normal re-onboard path.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…ng resume

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
GPU inference is routed through inference.local (OpenShell L7 proxy →
host inference server). The sandbox and gateway never need direct GPU
access — the GPU is used by the host-side Ollama/vLLM/NIM process.
The VM backend works for all scenarios including local GPU inference.

Remove the gpuRequested parameter and the conditional that forced
Docker when GPU was detected. VM is now preferred whenever openshell-vm
is available, regardless of GPU presence.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa changed the title feat(vm): dual-backend foundation for openshell-vm microVM gateway feat(vm): openshell-vm microVM gateway backend (full lifecycle, E2E 30/30) Apr 12, 2026
Replace the two-container DinD architecture (dockerd sidecar + workspace)
with a single container using openshell-vm. Removes:

- docker:24-dind sidecar container (8Gi RAM, 2 CPU)
- Docker socket shared volume
- Docker storage volume
- Docker config init container (cgroup v2 workaround)
- docker.io apt package install
- Docker daemon wait loop

The workspace container now installs openshell-vm via the upstream
install script and sets NEMOCLAW_GATEWAY_BACKEND=vm. Requires /dev/kvm
on the k8s node (bare-metal or nested virt enabled).

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Contributor Author

What dropping Docker-in-Docker means in practice

The k8s manifest (k8s/nemoclaw-k8s.yaml) went from a two-container DinD architecture to a single container using openshell-vm. Here's what changes:

Security

  • No more privileged Docker daemon. The DinD sidecar ran dockerd with privileged: true, which gives the container full root access to the host kernel — it can load kernel modules, access all devices, and escape the container. The VM backend still needs privileged: true for /dev/kvm, but KVM is a much narrower capability than a full Docker daemon. The attack surface shrinks from "arbitrary container execution engine" to "hardware virtualization API."
  • No Docker socket sharing. The old manifest shared /var/run/docker.sock between containers via an emptyDir volume. Any process with access to the Docker socket can run arbitrary containers on the host. That shared volume is gone.
  • Stronger isolation. Sandboxes run inside a hardware-isolated microVM (KVM + libkrun), not inside Docker containers sharing the host kernel. A sandbox escape would need to break out of the VM's hardware boundary, not just a Linux namespace.

Resource usage

  • ~8Gi less memory requested. The DinD sidecar requested 8Gi RAM + 2 CPU just for the Docker daemon. The VM backend uses 4Gi for the microVM (configurable via --mem) and doesn't need a separate daemon process. Total pod request drops from 12Gi to 8Gi.
  • One fewer container. No Docker daemon to start, health-check, or restart. Fewer moving parts means fewer failure modes.
  • No Docker image layer storage. The DinD sidecar used an emptyDir for /var/lib/docker which could grow unbounded as images were pulled and built. The VM uses a fixed-size state disk (32 GiB sparse file, allocated on demand).

Startup time

  • No Docker daemon wait. The old flow spent up to 60 seconds waiting for dockerd to become ready (docker info poll loop). The VM boots in ~6 seconds to first health check.
  • No Docker image pulls inside Docker. DinD needed to pull the OpenShell cluster image inside the Docker daemon, which was slow (no layer cache on first run). The VM has the k3s cluster embedded in its rootfs.

Operational simplicity

  • 122 lines → 56 lines. The manifest is less than half the size.
  • No init container. The cgroup v2 Docker daemon config hack (init-docker-config) is gone.
  • No shared volumes. Three emptyDir volumes (docker-storage, docker-socket, docker-config) are gone.
  • Easier debugging. One container, one process. kubectl logs nemoclaw gives you everything.

Requirements

  • KVM on the node. The k8s node must have /dev/kvm available. Bare-metal nodes and VMs with nested virtualization enabled have this. Standard cloud VMs on GKE/EKS may not — check your node pool configuration.
  • Docker still needed on the host for building the sandbox image (docker build + docker save). This is a build-time dependency, not a runtime one. Future work could eliminate this by building inside the VM's containerd or using podman.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/brev-launchable-ci-cpu.sh (1)

31-31: ⚠️ Potential issue | 🟡 Minor

Stale comment: default version mismatch.

Line 31 documents the default as v0.0.20, but line 43 sets the actual default to v0.0.26. Update the comment to match:

-#   OPENSHELL_VERSION     — OpenShell CLI release tag (default: v0.0.20)
+#   OPENSHELL_VERSION     — OpenShell CLI release tag (default: v0.0.26)

Also applies to: 43-43

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/brev-launchable-ci-cpu.sh` at line 31, Update the stale comment that
documents the default OpenShell CLI release tag so it matches the actual
default; change the commented line referencing "OPENSHELL_VERSION — OpenShell
CLI release tag (default: v0.0.20)" to reflect the current default v0.0.26 used
where OPENSHELL_VERSION is set, ensuring the comment and the OPENSHELL_VERSION
default value are consistent.
🧹 Nitpick comments (6)
test/e2e/test-vm-backend-e2e.sh (3)

34-34: Consider adding set -e for fail-fast behavior.

The script uses set -uo pipefail but omits -e (errexit). This means commands can fail silently without stopping the test. While the explicit pass/fail tracking handles some cases, intermediate commands (like cd, tar, install) could fail and the script would continue, potentially producing misleading results.

If intentional (to allow the test framework to track each failure), consider documenting why -e is omitted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-vm-backend-e2e.sh` at line 34, Update the shell options line
that currently reads "set -uo pipefail" to include "-e" so the script uses "set
-euo pipefail" to enable fail-fast behavior (or, if omission is intentional, add
an explicit comment above the "set -uo pipefail" line explaining why "-e" is
omitted and how failures are tracked); modify the single-line "set -uo pipefail"
occurrence in test/e2e/test-vm-backend-e2e.sh (the shell options line)
accordingly to ensure intermediate command failures stop the script or are
clearly documented.

291-295: apt-get usage assumes Debian/Ubuntu.

While the script targets GitHub ubuntu-latest runners, users running locally on other Linux distributions (Fedora, Arch) would see failures. Consider adding a fallback or clearer error message.

Suggested defensive check
 # zstd may not be installed — install it if needed
 if ! command -v zstd >/dev/null 2>&1; then
   info "Installing zstd for runtime decompression..."
-  sudo apt-get update -qq && sudo apt-get install -y -qq zstd >/dev/null 2>&1
+  if command -v apt-get >/dev/null 2>&1; then
+    sudo apt-get update -qq && sudo apt-get install -y -qq zstd >/dev/null 2>&1
+  elif command -v dnf >/dev/null 2>&1; then
+    sudo dnf install -y -q zstd >/dev/null 2>&1
+  else
+    fail "zstd not found and no supported package manager available"
+    exit 1
+  fi
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-vm-backend-e2e.sh` around lines 291 - 295, The current zstd
install block (the if ! command -v zstd conditional) assumes apt-get is
available; replace it with a defensive install routine that first tries apt-get,
then falls back to common package managers (dnf, yum, pacman) and finally prints
a clear error/usage message if none are present; update the block around the
"Installing zstd for runtime decompression..." message to attempt each package
manager in order, and on failure exit with an informative message telling the
user to install zstd manually.

488-495: The pkill -f "openshell-vm" pattern may be overly broad.

This pattern matches any process with "openshell-vm" anywhere in its command line, which could unintentionally kill unrelated processes in shared environments. Consider using the PID file that onboard.ts writes (~/.nemoclaw/openshell-vm.pid) for targeted termination.

Suggested alternative using PID file
 info "Killing openshell-vm process to simulate crash..."
-# Find and kill any openshell-vm gateway process
-if pkill -f "openshell-vm" 2>/dev/null; then
+# Kill via PID file for targeted termination
+VM_PID_FILE="$HOME/.nemoclaw/openshell-vm.pid"
+if [ -f "$VM_PID_FILE" ] && kill -0 "$(cat "$VM_PID_FILE")" 2>/dev/null; then
+  kill "$(cat "$VM_PID_FILE")" 2>/dev/null
   pass "openshell-vm process killed"
   sleep 3
 else
   info "No openshell-vm process found to kill (may already be stopped)"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-vm-backend-e2e.sh` around lines 488 - 495, Replace the broad
pkill usage with a PID-file based shutdown: read the PID from
~/.nemoclaw/openshell-vm.pid if it exists, verify the PID corresponds to a
running process whose command line contains "openshell-vm" (to avoid killing
unrelated PIDs), send a targeted kill to that PID and remove the pidfile; if the
pidfile is missing or verification fails, fall back to the existing pkill -f
"openshell-vm" behavior as a last resort and log appropriate info messages. This
change should be applied where pkill -f "openshell-vm" is used in the
test/e2e/test-vm-backend-e2e.sh script.
src/lib/onboard.ts (3)

2141-2235: Complex rootfs patching logic is well-documented but could benefit from extraction.

The mqueue fix (lines 2141-2235) is a significant workaround with detailed comments explaining the kernel limitation. While the documentation is excellent, this ~95-line block could be extracted into a separate helper function (e.g., patchVmRootfsForMqueue()) to improve readability and testability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2141 - 2235, Extract the mqueue workaround
block into a new helper function named patchVmRootfsForMqueue(rootfsDir, vmEnv)
and call it where the current block lives; move all logic that creates
shimDir/shimPath, writes the runc-shim, patches the init script (referencing
initScript, shimDir, shimPath, and the K3S_ARGS replacement) and the related fs
operations and logging into that function so the main flow around
spawnSync(prepResult) stays concise; ensure the helper receives rootfsDir (from
prepOutput) and returns success/failure or throws so existing console logs
(e.g., "Patched VM init..." / "Init script not found...") remain consistent and
tests can target patchVmRootfsForMqueue independently.

2815-2849: Import script approach is clever but embeds shell code in TypeScript.

The import script (lines 2817-2833) written to the rootfs is a reasonable workaround for the exec agent limitations. However, the shell script embedded as a JavaScript array is hard to maintain and test.

Consider extracting this to a separate .sh file that gets copied to the rootfs, similar to how other scripts are handled in the scripts/ directory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2815 - 2849, The embedded shell script
written to importScript makes maintenance hard; move the script text into a
standalone file (e.g., scripts/import-image.sh) and have onboard.ts copy that
file into the VM rootfs instead of constructing the array inline. Update the
code around importScript, fs.writeFileSync, and fs.chmodSync to locate the new
script (use path.join(__dirname, "..", "scripts", "import-image.sh") or
equivalent), copy it into rootfs (fs.copyFileSync or streaming copy), then set
executable bits and keep the existing run(...) and runCapture(...) calls that
reference /opt/nemoclaw/import-image.sh; ensure any template values (if needed)
are either left static or replaced before copying and adjust tests accordingly.

2245-2265: Two functions named getInstalledOpenshellVersion exist with different signatures—consider renaming for clarity.

Line 2246 correctly calls the imported version from openshell.ts with signature (binary: string, opts: CaptureOpenshellOptions). However, a local version at line 393 exists with signature (versionOutput = null) that serves a different purpose (parsing pre-captured output). While the code uses the correct function, having two identically-named functions with different signatures and purposes invites confusion during maintenance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2245 - 2265, There are two different
functions named getInstalledOpenshellVersion: one imported from "./openshell"
(signature (binary: string, opts: CaptureOpenshellOptions)) and a local helper
that parses pre-captured output (signature (versionOutput = null)); rename the
local parser to something unambiguous like parseOpenshellVersionOutput (or
parseCapturedOpenshellVersion), update all internal call sites to that new name,
and leave the imported getInstalledOpenshellVersion unchanged so callers like
the gatewayImage construction continue to use the correct function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/platform.ts`:
- Around line 123-135: The detectGatewayBackend function currently defaults
opts.vmAvailable to false which causes dual-capable hosts to prefer "docker"
silently; update detectGatewayBackend to either (A) perform VM detection when
opts.vmAvailable is undefined (mirror docker detection by calling an appropriate
VM detection helper, e.g., detectVMHost or similar, and set vmAvailable based on
its result) or (B) make opts.vmAvailable required and throw a clear error if
it's not provided; ensure you update the vmAvailable initialization (and any
callers) instead of hard-coding false and keep the dockerAvailable logic that
calls detectDockerHost unchanged.

In `@src/nemoclaw.ts`:
- Around line 523-543: The VM branch constructs vmGatewayName but calls
getNamedGatewayLifecycleState() with no name; update both lifecycle checks to
call getNamedGatewayLifecycleState(vmGatewayName) so the code inspects the VM
gateway (before and after), keep runOpenshell(["gateway","select",
vmGatewayName]) and setting process.env.OPENSHELL_GATEWAY as-is, and leave
startGatewayForRecovery() behavior unchanged unless it also needs a gateway name
in future.

In `@test/openshell-vm.test.ts`:
- Around line 129-164: The VM tests are reading the real PID file and process
table causing flakiness; update the helpers or tests so they don't touch real
system state: modify readVmPid to accept an optional pidFilePath (or add an
injectable function) and make isVmGatewayHealthy accept a liveness-check
function (or inject isVmProcessAlive), then in the tests call these helpers with
a temp/test PID file and a mocked liveness probe, or alternatively mock fs
(e.g., fs.readFileSync/existsSync) and process checks used by
isVmProcessAlive/isVmGatewayHealthy; adjust tests to create controlled PID
contents and to stub process liveness so assertions are deterministic.

---

Outside diff comments:
In `@scripts/brev-launchable-ci-cpu.sh`:
- Line 31: Update the stale comment that documents the default OpenShell CLI
release tag so it matches the actual default; change the commented line
referencing "OPENSHELL_VERSION — OpenShell CLI release tag (default: v0.0.20)"
to reflect the current default v0.0.26 used where OPENSHELL_VERSION is set,
ensuring the comment and the OPENSHELL_VERSION default value are consistent.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2141-2235: Extract the mqueue workaround block into a new helper
function named patchVmRootfsForMqueue(rootfsDir, vmEnv) and call it where the
current block lives; move all logic that creates shimDir/shimPath, writes the
runc-shim, patches the init script (referencing initScript, shimDir, shimPath,
and the K3S_ARGS replacement) and the related fs operations and logging into
that function so the main flow around spawnSync(prepResult) stays concise;
ensure the helper receives rootfsDir (from prepOutput) and returns
success/failure or throws so existing console logs (e.g., "Patched VM init..." /
"Init script not found...") remain consistent and tests can target
patchVmRootfsForMqueue independently.
- Around line 2815-2849: The embedded shell script written to importScript makes
maintenance hard; move the script text into a standalone file (e.g.,
scripts/import-image.sh) and have onboard.ts copy that file into the VM rootfs
instead of constructing the array inline. Update the code around importScript,
fs.writeFileSync, and fs.chmodSync to locate the new script (use
path.join(__dirname, "..", "scripts", "import-image.sh") or equivalent), copy it
into rootfs (fs.copyFileSync or streaming copy), then set executable bits and
keep the existing run(...) and runCapture(...) calls that reference
/opt/nemoclaw/import-image.sh; ensure any template values (if needed) are either
left static or replaced before copying and adjust tests accordingly.
- Around line 2245-2265: There are two different functions named
getInstalledOpenshellVersion: one imported from "./openshell" (signature
(binary: string, opts: CaptureOpenshellOptions)) and a local helper that parses
pre-captured output (signature (versionOutput = null)); rename the local parser
to something unambiguous like parseOpenshellVersionOutput (or
parseCapturedOpenshellVersion), update all internal call sites to that new name,
and leave the imported getInstalledOpenshellVersion unchanged so callers like
the gatewayImage construction continue to use the correct function.

In `@test/e2e/test-vm-backend-e2e.sh`:
- Line 34: Update the shell options line that currently reads "set -uo pipefail"
to include "-e" so the script uses "set -euo pipefail" to enable fail-fast
behavior (or, if omission is intentional, add an explicit comment above the "set
-uo pipefail" line explaining why "-e" is omitted and how failures are tracked);
modify the single-line "set -uo pipefail" occurrence in
test/e2e/test-vm-backend-e2e.sh (the shell options line) accordingly to ensure
intermediate command failures stop the script or are clearly documented.
- Around line 291-295: The current zstd install block (the if ! command -v zstd
conditional) assumes apt-get is available; replace it with a defensive install
routine that first tries apt-get, then falls back to common package managers
(dnf, yum, pacman) and finally prints a clear error/usage message if none are
present; update the block around the "Installing zstd for runtime
decompression..." message to attempt each package manager in order, and on
failure exit with an informative message telling the user to install zstd
manually.
- Around line 488-495: Replace the broad pkill usage with a PID-file based
shutdown: read the PID from ~/.nemoclaw/openshell-vm.pid if it exists, verify
the PID corresponds to a running process whose command line contains
"openshell-vm" (to avoid killing unrelated PIDs), send a targeted kill to that
PID and remove the pidfile; if the pidfile is missing or verification fails,
fall back to the existing pkill -f "openshell-vm" behavior as a last resort and
log appropriate info messages. This change should be applied where pkill -f
"openshell-vm" is used in the test/e2e/test-vm-backend-e2e.sh script.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: accc1d5d-64cf-43eb-8220-b2e5190b92f1

📥 Commits

Reviewing files that changed from the base of the PR and between fcb5fd6 and 5f2db18.

📒 Files selected for processing (18)
  • .agents/skills/nemoclaw-user-reference/references/troubleshooting.md
  • .github/workflows/nightly-e2e.yaml
  • docs/reference/troubleshooting.md
  • k8s/nemoclaw-k8s.yaml
  • nemoclaw-blueprint/blueprint.yaml
  • schemas/blueprint.schema.json
  • scripts/brev-launchable-ci-cpu.sh
  • scripts/install-openshell.sh
  • src/lib/onboard-session.ts
  • src/lib/onboard.ts
  • src/lib/openshell.ts
  • src/lib/platform.ts
  • src/nemoclaw.ts
  • test/e2e/brev-e2e.test.ts
  • test/e2e/test-sandbox-survival.sh
  • test/e2e/test-vm-backend-e2e.sh
  • test/openshell-vm.test.ts
  • test/platform.test.ts

Comment thread src/lib/platform.ts
Comment on lines +123 to +135
function detectGatewayBackend(opts = {}) {
const env = opts.env ?? process.env;
const override = env.NEMOCLAW_GATEWAY_BACKEND;
if (override === "vm" || override === "docker") return override;

const vmAvailable =
typeof opts.vmAvailable === "boolean"
? opts.vmAvailable
: false; // caller should pass actual detection result
const dockerAvailable =
typeof opts.dockerAvailable === "boolean"
? opts.dockerAvailable
: detectDockerHost(opts) !== null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't hard-code the default VM availability to false.

When callers omit opts.vmAvailable, this helper still resolves a dual-capable host to "docker", so the new default silently stops preferring VM. Either detect VM availability here or make vmAvailable a required input for this API.

💡 One way to keep the default preference correct
+const { isOpenshellVmAvailable } = require("./openshell");
+
 function detectGatewayBackend(opts = {}) {
   const env = opts.env ?? process.env;
   const override = env.NEMOCLAW_GATEWAY_BACKEND;
   if (override === "vm" || override === "docker") return override;
 
   const vmAvailable =
     typeof opts.vmAvailable === "boolean"
       ? opts.vmAvailable
-      : false; // caller should pass actual detection result
+      : isOpenshellVmAvailable();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function detectGatewayBackend(opts = {}) {
const env = opts.env ?? process.env;
const override = env.NEMOCLAW_GATEWAY_BACKEND;
if (override === "vm" || override === "docker") return override;
const vmAvailable =
typeof opts.vmAvailable === "boolean"
? opts.vmAvailable
: false; // caller should pass actual detection result
const dockerAvailable =
typeof opts.dockerAvailable === "boolean"
? opts.dockerAvailable
: detectDockerHost(opts) !== null;
const { isOpenshellVmAvailable } = require("./openshell");
function detectGatewayBackend(opts = {}) {
const env = opts.env ?? process.env;
const override = env.NEMOCLAW_GATEWAY_BACKEND;
if (override === "vm" || override === "docker") return override;
const vmAvailable =
typeof opts.vmAvailable === "boolean"
? opts.vmAvailable
: isOpenshellVmAvailable();
const dockerAvailable =
typeof opts.dockerAvailable === "boolean"
? opts.dockerAvailable
: detectDockerHost(opts) !== null;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/platform.ts` around lines 123 - 135, The detectGatewayBackend
function currently defaults opts.vmAvailable to false which causes dual-capable
hosts to prefer "docker" silently; update detectGatewayBackend to either (A)
perform VM detection when opts.vmAvailable is undefined (mirror docker detection
by calling an appropriate VM detection helper, e.g., detectVMHost or similar,
and set vmAvailable based on its result) or (B) make opts.vmAvailable required
and throw a clear error if it's not provided; ensure you update the vmAvailable
initialization (and any callers) instead of hard-coding false and keep the
dockerAvailable logic that calls detectDockerHost unchanged.

Comment thread src/nemoclaw.ts
Comment on lines +523 to +543
if (session?.gatewayBackend === "vm") {
// openshell-vm registers as "openshell-vm-nemoclaw", not "nemoclaw"
const vmGatewayName = `openshell-vm-${NEMOCLAW_GATEWAY_NAME}`;
runOpenshell(["gateway", "select", vmGatewayName], { ignoreError: true });
const before = getNamedGatewayLifecycleState();
if (before.state === "healthy_named") {
process.env.OPENSHELL_GATEWAY = vmGatewayName;
return { recovered: true, before, after: before, attempted: false };
}
try {
await startGatewayForRecovery();
} catch {
/* fall through */
}
runOpenshell(["gateway", "select", vmGatewayName], { ignoreError: true });
const after = getNamedGatewayLifecycleState();
if (after.state === "healthy_named") {
process.env.OPENSHELL_GATEWAY = vmGatewayName;
return { recovered: true, before, after, attempted: true, via: "start" };
}
return { recovered: false, before, after, attempted: true };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Parameterize the lifecycle check with the VM gateway name.

This branch selects openshell-vm-nemoclaw, but getNamedGatewayLifecycleState() still queries and compares against "nemoclaw". As written, the VM path can never observe healthy_named, so recovery falls through as failed even when the VM gateway is already up.

🛠️ Suggested shape of the fix
-function getNamedGatewayLifecycleState() {
+function getNamedGatewayLifecycleState(gatewayName = NEMOCLAW_GATEWAY_NAME) {
   const status = captureOpenshell(["status"]);
-  const gatewayInfo = captureOpenshell(["gateway", "info", "-g", "nemoclaw"]);
+  const gatewayInfo = captureOpenshell(["gateway", "info", "-g", gatewayName]);
   const cleanStatus = stripAnsi(status.output);
   const activeGateway = getActiveGatewayName(status.output);
   const connected = /^\s*Status:\s*Connected\b/im.test(cleanStatus);
-  const named = hasNamedGateway(gatewayInfo.output);
+  const named = stripAnsi(gatewayInfo.output).includes(`Gateway: ${gatewayName}`);
   const refusing = /Connection refused|client error \(Connect\)|tcp connect error/i.test(
     cleanStatus,
   );
-  if (connected && activeGateway === "nemoclaw" && named) {
+  if (connected && activeGateway === gatewayName && named) {
     return { state: "healthy_named", status: status.output, gatewayInfo: gatewayInfo.output };
   }
   ...
 }

And in this branch:

-const before = getNamedGatewayLifecycleState();
+const before = getNamedGatewayLifecycleState(vmGatewayName);
...
-const after = getNamedGatewayLifecycleState();
+const after = getNamedGatewayLifecycleState(vmGatewayName);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 523 - 543, The VM branch constructs
vmGatewayName but calls getNamedGatewayLifecycleState() with no name; update
both lifecycle checks to call getNamedGatewayLifecycleState(vmGatewayName) so
the code inspects the VM gateway (before and after), keep
runOpenshell(["gateway","select", vmGatewayName]) and setting
process.env.OPENSHELL_GATEWAY as-is, and leave startGatewayForRecovery()
behavior unchanged unless it also needs a gateway name in future.

Comment thread test/openshell-vm.test.ts
Comment on lines +129 to +164
describe("readVmPid", () => {
it("returns null when PID file does not exist", () => {
// readVmPid reads from the real path (~/.nemoclaw/openshell-vm.pid)
// but we can test the isVmProcessAlive helper with known PIDs
expect(readVmPid()).toSatisfy((v) => v === null || typeof v === "number");
});
});

describe("isVmProcessAlive", () => {
it("returns false for null PID", () => {
expect(isVmProcessAlive(null)).toBe(false);
});

it("returns false for PID 0", () => {
expect(isVmProcessAlive(0)).toBe(false);
});

it("returns false for negative PID", () => {
expect(isVmProcessAlive(-1)).toBe(false);
});

it("returns true for current process PID", () => {
expect(isVmProcessAlive(process.pid)).toBe(true);
});

it("returns false for non-existent PID", () => {
// Use a very high PID that's unlikely to exist
expect(isVmProcessAlive(4_000_000)).toBe(false);
});
});

describe("isVmGatewayHealthy", () => {
it("returns false when no VM process is running", () => {
// With no PID file, there's no VM process to check
expect(isVmGatewayHealthy()).toBe(false);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make these VM lifecycle tests hermetic.

readVmPid() and isVmGatewayHealthy() here read the real ~/.nemoclaw/openshell-vm.pid and process table, so the assertions change based on whatever is already running on the host. That makes this unit suite flaky on dev machines and shared CI workers. Inject the pid-file path / liveness probe into the helpers, or mock fs and the process checks for these cases instead. As per coding guidelines: test/**/*.test.{js,ts}: Mock external dependencies in unit tests; do not call real NVIDIA APIs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/openshell-vm.test.ts` around lines 129 - 164, The VM tests are reading
the real PID file and process table causing flakiness; update the helpers or
tests so they don't touch real system state: modify readVmPid to accept an
optional pidFilePath (or add an injectable function) and make isVmGatewayHealthy
accept a liveness-check function (or inject isVmProcessAlive), then in the tests
call these helpers with a temp/test PID file and a mocked liveness probe, or
alternatively mock fs (e.g., fs.readFileSync/existsSync) and process checks used
by isVmProcessAlive/isVmGatewayHealthy; adjust tests to create controlled PID
contents and to stub process liveness so assertions are deterministic.

@wscurran wscurran added enhancement New feature or request Docker Support for Docker containerization labels Apr 13, 2026
jyaunches pushed a commit that referenced this pull request Apr 14, 2026
… version field

Refactor test/e2e/brev-e2e.test.ts for issue #1390:

- Extract named helpers from the ~350-line monolithic beforeAll:
  cleanupLeftoverInstance(), createBrevInstance(), refreshAndWaitForSsh(),
  bootstrapLaunchable(), pollForSandboxReady(), writeManualRegistry().
  The beforeAll now reads as a high-level orchestration (~50 lines).

- Deduplicate brev create + error recovery: the deploy-cli and launchable
  paths shared duplicated brev refresh + waitForSsh patterns, now
  consolidated into refreshAndWaitForSsh().

- Remove phantom version: 1 from the manual registry write. The
  SandboxRegistry interface in src/lib/registry.ts has no version field;
  registerSandbox() doesn't write one either.

- bootstrapLaunchable() returns { remoteDir, needsOnboard } instead of
  mutating module-level state as a hidden side-effect.

- instanceCreated is set at call sites in beforeAll, not hidden inside
  createBrevInstance().

- Remove dead sleep() helper (defined but never called).

Pure refactoring — no behavior changes. // @ts-nocheck pragma preserved.

Note: PR #1791 (openshell-vm microVM) also touches this file — whoever
merges second will need a rebase.
ericksoa added a commit that referenced this pull request Apr 16, 2026
… version field

Refactor test/e2e/brev-e2e.test.ts for issue #1390:

- Extract named helpers from the ~350-line monolithic beforeAll:
  cleanupLeftoverInstance(), createBrevInstance(), refreshAndWaitForSsh(),
  bootstrapLaunchable(), pollForSandboxReady(), writeManualRegistry().
  The beforeAll now reads as a high-level orchestration (~50 lines).

- Deduplicate brev create + error recovery: the deploy-cli and launchable
  paths shared duplicated brev refresh + waitForSsh patterns, now
  consolidated into refreshAndWaitForSsh().

- Remove phantom version: 1 from the manual registry write. The
  SandboxRegistry interface in src/lib/registry.ts has no version field;
  registerSandbox() doesn't write one either.

- bootstrapLaunchable() returns { remoteDir, needsOnboard } instead of
  mutating module-level state as a hidden side-effect.

- instanceCreated is set at call sites in beforeAll, not hidden inside
  createBrevInstance().

- Remove dead sleep() helper (defined but never called).

Pure refactoring — no behavior changes. // @ts-nocheck pragma preserved.

Note: PR #1791 (openshell-vm microVM) also touches this file — whoever
merges second will need a rebase.
ericksoa added a commit that referenced this pull request Apr 16, 2026
…nstaller detection (#1888)

## Summary

Addresses all three items from #1390, plus a bonus installer fix
discovered while running the pre-push hooks.

### Issue #1390 — Brev E2E cleanup

- **Extract helpers from the monolithic beforeAll**: The ~350-line
`beforeAll` block is now ~50 lines of high-level orchestration calling
named helpers: `cleanupLeftoverInstance()`, `createBrevInstance()`,
`refreshAndWaitForSsh()`, `bootstrapLaunchable()`,
`pollForSandboxReady()`, `writeManualRegistry()`.

- **Deduplicate brev create + error recovery**: The deploy-cli and
launchable paths shared duplicated `brev refresh` + `waitForSsh`
patterns, now consolidated into `refreshAndWaitForSsh()`.

- **Remove phantom `version: 1`** from the manual registry write. The
`SandboxRegistry` interface in `src/lib/registry.ts` has no version
field; `registerSandbox()` doesn't write one either.

Additional cleanup:
- `bootstrapLaunchable()` returns `{ remoteDir, needsOnboard }` instead
of mutating module-level state as a hidden side-effect.
- `instanceCreated` is set at call sites in `beforeAll`, not hidden
inside `createBrevInstance()`.
- Dead `sleep()` helper removed.

Pure refactoring — no behavior changes. `// @ts-nocheck` pragma
preserved.

### Installer worktree fix

`scripts/install.sh` used `-d "${repo_root}/.git"` to detect source
checkouts, but in a git worktree `.git` is a file, not a directory. This
caused `is_source_checkout()` to return false, falling through to the
GitHub clone path. Fixed by using `-e` (exists) instead of `-d` (is
directory). This resolved 12 pre-existing test failures in
`test/install-preflight.test.ts`.

## Related Issue
Closes #1390

## Note
PR #1791 also touches `test/e2e/brev-e2e.test.ts` (appends a new
`vm-backend` test case). Clean merge — whoever merges second rebases
trivially.

## Type of Change
- [x] Code change for a new feature, bug fix, or refactor.

## Testing
- [x] `npx prek run --all-files` passes (all pre-commit and pre-push
hooks green).
- [x] `npx vitest run --project cli` — 1213 passed, 0 failed (was 12
failed before installer fix).


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Enhanced end-to-end test infrastructure and setup orchestration for
improved test reliability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Aaron Erickson <aerickson@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docker Support for Docker containerization enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants