feat(vm): openshell-vm microVM gateway backend (full lifecycle, E2E 30/30) by ericksoa · Pull Request #1791 · NVIDIA/NemoClaw

ericksoa · 2026-04-11T21:07:03Z

Summary

Replaces the Docker-based gateway with openshell-vm (libkrun microVM) as the default gateway backend. Eliminates Docker-in-Docker from the k8s deployment and removes Docker as a runtime dependency for the gateway. Docker is still used on the host as a build tool for the sandbox image.

E2E status: 30/30 passing on GitHub Actions ubuntu-latest with KVM.

Why microVM instead of Docker

Security: Sandboxes run inside hardware-isolated microVMs (KVM), not Docker containers sharing the host kernel. No privileged Docker daemon, no shared Docker socket.
Simplicity: One process (openshell-vm) replaces dockerd + k3s-in-Docker. The k8s manifest went from 122 lines (2 containers, 3 volumes, init container) to 56 lines (1 container, no volumes).
Resources: ~8Gi less memory (no Docker daemon sidecar). VM boots in ~6s vs 60s Docker daemon startup.
No Docker-in-Docker: The old k8s deployment ran a Docker daemon inside a k8s pod. That's gone entirely.

GPU inference works without Docker

GPU inference is routed through inference.local (OpenShell's L7 proxy → host-side Ollama/vLLM/NIM). The sandbox never touches the GPU directly — it makes HTTP requests through the gateway's inference route. The microVM backend works for all scenarios including local GPU inference.

What's implemented

Backend detection (platform.ts): detectGatewayBackend() prefers VM when available, falls back to Docker. NEMOCLAW_GATEWAY_BACKEND env override.
Gateway lifecycle (onboard.ts): spawn openshell-vm --name nemoclaw --mem 4096 as detached process, PID tracking, health polling via gRPC, SIGTERM/SIGKILL cleanup
Session persistence: gatewayBackend field stored in onboard session for resume/recovery
Sandbox image import: Docker build → docker save → write to VM rootfs via virtio-fs → ctr images import via exec agent → openshell sandbox create --from <image-ref>
Resume/recovery: --resume detects dead VM gateway, restarts it, waits for sandbox + inference route
K8s manifest: Dropped DinD, single container with openshell-vm
CI workflow: KVM access, VM diagnostics on failure

Where Docker is still used

Docker is used on the host as a build tool — docker build to create the sandbox image and docker save to export it. This is not Docker-in-Docker; it's the same as running docker build in any CI pipeline. Future work could replace this with podman/buildah or building inside the VM's containerd.

Workarounds for current vm-dev release (discussion needed)

We're patching the openshell-vm rootfs at onboard time to work around issues in the current vm-dev release. Each patch is designed to be self-disabling — it detects whether the issue exists before applying, so when the issues are fixed upstream, our patches become no-ops.

Here's what each one does and why, in plain language:

1. Kernel missing mqueue support (runc shim)

What happens without the fix: Every container inside the VM fails to start. You get error mounting "mqueue" to rootfs: no such device in the kubelet logs. No pods run — not even CoreDNS — so the entire k3s cluster is dead.

Why it happens: When a container starts, the container runtime (runc) tries to set up a special filesystem called mqueue inside the container. This requires the Linux kernel to have "POSIX message queue" support compiled in. Based on what we can see in the repo, the kernel config (CONFIG_POSIX_MQUEUE=y) appears to have been added to the kconfig file after the vm-dev release binary was built. The kernel in the current release appears to not have this feature compiled in.

What we do: Before starting the VM, we write a containerd config file that tells k3s to use our wrapper script instead of the real runc. The wrapper does one thing: before creating a container, it edits the container's configuration to request a plain tmpfs filesystem instead of mqueue. The container starts fine because tmpfs always works. Almost no software actually uses mqueue — it's a POSIX inter-process communication mechanism that OpenClaw doesn't need.

How it self-disables: The init script tests mount -t mqueue at boot. If the kernel supports mqueue, the test succeeds and the wrapper is never installed.

Upstream fix: Rebuild the vm-dev release from current main (which has the kernel config fix at commit d8cf7951).

2. Supervisor binary glibc mismatch

What happens without the fix: The sandbox container starts but immediately crashes in a loop. The error is GLIBC_2.38 not found (required by /opt/openshell/bin/openshell-sandbox).

Why it happens: Every sandbox pod runs a supervisor process (openshell-sandbox) that manages the agent inside the container. This binary isn't part of the sandbox image — OpenShell injects it from the gateway's filesystem into the pod via a Kubernetes volume mount. In the VM gateway, this binary appears to have been cross-compiled in a way that linked it against glibc 2.39 (a newer C library version than the sandbox container provides). But the sandbox container is based on Ubuntu 22.04, which has glibc 2.35. The binary can't run because it needs a newer C library than the container provides.

In the Docker gateway, the equivalent binary appears to be built against a compatible glibc version, which is why the Docker path works.

What we do: At onboard time, we pull the Docker gateway image (ghcr.io/nvidia/openshell/cluster:0.0.26), extract the compatible openshell-sandbox binary from it, and replace the VM rootfs copy. This takes ~7 seconds and happens once.

How it self-disables: We only replace the binary if the Docker gateway image is available (Docker is installed). If the VM release is rebuilt with a compatible binary, this step overwrites with an equivalent binary — harmless.

Upstream fix: Build openshell-sandbox for the VM rootfs targeting a glibc version compatible with the sandbox base image.

3. DNS proxy skip

Why: NemoClaw's DNS proxy setup script uses docker exec to reach kubectl inside the gateway container. With the microVM backend, there is no Docker container. We skip the DNS proxy — the VM uses gvproxy for networking, which provides NAT with working DNS out of the box. This isn't really a workaround — it's the correct behavior for the VM networking architecture.

4. Sandbox readiness check via gRPC instead of docker exec

Why: waitForSandboxReady() used openshell doctor exec -- kubectl which runs docker exec internally. For the VM backend, we use openshell sandbox list via the gRPC API instead. This also isn't really a workaround — it's using the right API for the backend.

Files changed (18 files, +1600/-100)

File	Purpose
`src/lib/onboard.ts`	VM gateway lifecycle, image import, rootfs patches, resume recovery
`src/lib/platform.ts`	`detectGatewayBackend()` (VM preferred, no GPU constraint)
`src/lib/openshell.ts`	`isOpenshellVmAvailable()`, `getInstalledOpenshellVmVersion()`
`src/lib/onboard-session.ts`	`gatewayBackend` session field
`src/nemoclaw.ts`	VM-aware cleanup, recovery
`nemoclaw-blueprint/blueprint.yaml`	`gateway_backends: [docker, vm]`, min version bump
`k8s/nemoclaw-k8s.yaml`	Replaced DinD with single-container openshell-vm
`.github/workflows/nightly-e2e.yaml`	vm-e2e job, KVM access, diagnostics
`test/e2e/test-vm-backend-e2e.sh`	Full E2E: install → onboard → inference → resume → reset
`test/openshell-vm.test.ts`	Unit tests for VM detection, session, lifecycle
`test/platform.test.ts`	Unit tests for `detectGatewayBackend()`

E2E test phases (all passing)

Phase	Description	Status
0	Prerequisites (Linux, KVM, API key, network)	PASS
1	Install openshell-vm binary + runtime	PASS
2	Install NemoClaw with VM backend	PASS
3	Post-install verification (registry, list, status, no Docker container)	PASS
4	Live inference through VM sandbox (PONG test)	PASS
5	Resume after openshell-vm kill	PASS
6	Reset — destroy and clean slate re-onboard	PASS
7	Final cleanup	PASS

Test plan

Known gaps

Commit squashing: Branch has 30 iterative commits. Needs squash before merge.
cloud-experimental-e2e: Fails on this branch AND on main — pre-existing, unrelated.
Mac testing: E2E runs on Linux only. openshell-vm supports macOS ARM64 via Hypervisor.framework but hasn't been tested with this NemoClaw integration yet.
Host Docker still needed for sandbox image build: docker build + docker save run on the host. Not DinD, but still a Docker dependency. Could be replaced with podman/buildah or in-VM builds.

Refs: NVIDIA/OpenShell#611

Summary by CodeRabbit

New Features
- Added microVM gateway backend support as an alternative to Docker, enabling CPU-only sandbox environments
- Introduced automatic gateway backend detection for flexible deployment configuration
Documentation
- Updated troubleshooting guides to reflect OpenShell version 0.0.26+ requirements
Tests
- Added end-to-end tests validating microVM backend workflow including sandbox lifecycle operations

OpenShell v0.0.26 ships openshell-vm, a standalone binary that boots a hardware-isolated microVM via libkrun. This commit lays the groundwork for NemoClaw to support both the existing Docker/k3s gateway and the new microVM gateway as selectable backends. Phase 1 changes: - Bump min_openshell_version from 0.0.24 to 0.0.26 across blueprint, install script, onboard preflight, CI scripts, E2E tests, and docs - Add gateway_backends field to blueprint.yaml schema (docker, vm) - Add isOpenshellVmAvailable() and getInstalledOpenshellVmVersion() to openshell.ts for detecting the openshell-vm binary - Add detectGatewayBackend() to platform.ts with NEMOCLAW_GATEWAY_BACKEND env var override, auto-detection preferring VM when available and falling back to Docker, and mandatory Docker for GPU workloads - Add gatewayBackend field to onboard session schema for persisting the selected backend across resume cycles - Add tests for all new functions The VM backend requires no Docker daemon and provides faster boot, but has no NVIDIA GPU passthrough (libkrun lacks PCI/VFIO support), so the Docker backend remains mandatory for local inference on GPU workstations. Refs: NVIDIA/OpenShell#611 Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai · 2026-04-11T21:07:10Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 30061766-16fd-47f9-ba49-702c8ca9429d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/openshell-vm-backend

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-11T21:07:47Z

🚀 Docs preview ready!

https://NVIDIA.github.io/NemoClaw/pr-preview/pr-1791/

Phase 4 of the dual-backend feature (PR #1791): - New test/e2e/test-vm-backend-e2e.sh: full E2E journey for the VM backend — install openshell-vm from release assets, onboard with NEMOCLAW_GATEWAY_BACKEND=vm, verify sandbox creation, live inference through the microVM gateway, resume after openshell-vm kill, and reset to clean slate. - New vm-e2e job in nightly-e2e.yaml: runs on ubuntu-latest (has /dev/kvm), installs openshell-vm, executes the VM backend E2E test. - New vm-backend test suite in brev-e2e.test.ts: allows running the VM backend E2E on ephemeral Brev instances via TEST_SUITE=vm-backend. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

The openshell-vm binary is published on the vm-dev pre-release tag (not v0.0.26) with gnu libc (not musl). Also downloads the VM runtime (kernel + rootfs) needed by libkrun, and installs zstd for decompression. Asset corrections: - Tag: vm-dev (not v0.0.26) - Binary: openshell-vm-*-unknown-linux-gnu.tar.gz (not musl) - Checksums: vm-binary-checksums-sha256.txt - Runtime: vm-runtime-linux-*.tar.zst → ~/.local/share/openshell-vm/ Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

When detectGatewayBackend() returns "vm", the onboard and runtime recovery flows now use openshell-vm instead of Docker: - startGatewayWithOptions() detects the backend and delegates to startVmGatewayProcess() for the VM path, which spawns openshell-vm as a detached background process with PID tracking - VM lifecycle helpers: writeVmPidFile, readVmPid, isVmProcessAlive, stopVmGateway (SIGTERM→SIGKILL), isVmGatewayHealthy - destroyGateway() checks session.gatewayBackend and stops the VM process instead of Docker volume cleanup when backend is "vm" - recoverGatewayRuntime() reads session.gatewayBackend to choose VM vs Docker recovery path - recoverNamedGatewayRuntime() in nemoclaw.ts skips Docker-specific gateway select commands for VM backend - cleanupGatewayAfterLastSandbox() stops VM process instead of Docker cleanup when backend is "vm" - Gateway backend is saved to onboard session on step completion so resume cycles know which backend to use - Resume flow checks VM health via isVmGatewayHealthy() instead of Docker gateway state when session records VM backend Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

Reading the openshell-vm source (NVIDIA/OpenShell crates/openshell-vm) revealed three bugs in the Phase 2 implementation: 1. openshell-vm was spawned with no arguments. It needs `--name nemoclaw` so it extracts the rootfs to the correct instance directory and registers gateway metadata under the right identity. 2. openshell-vm prefixes instance names: gateway_name("nemoclaw") produces "openshell-vm-nemoclaw". All OPENSHELL_GATEWAY env vars and `openshell gateway select` calls for the VM path now use VM_GATEWAY_NAME ("openshell-vm-nemoclaw") instead of GATEWAY_NAME. 3. Health poll was too short (15 × 2s = 30s). The VM boots k3s inside a microVM with its own 90s internal health check. Increased to 60 × 3s = 180s to avoid racing the inner bootstrap. Also log the last 10 lines from openshell-vm.log on failure for diagnostics. Additionally: the VM gateway listens on port 30051 (NodePort), not 8080. The openshell-vm binary handles the port mapping internally (gvproxy host:30051 → VM:30051 → kube-proxy → pod:8080). Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

The GitHub-hosted ubuntu runner has /dev/kvm (crw-rw---- root:kvm) but the runner user is not in the kvm group. openshell-vm opens /dev/kvm directly via libkrun and fails with EACCES. Fix by chmod 666 /dev/kvm in the KVM verification step. Also add the user to the kvm group for completeness, though the chmod is sufficient for the current process. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

…tics Three changes to address gvproxy crash on GitHub Actions: 1. Pass --mem 4096 to openshell-vm. The default 8GB is half the 16GB runner memory. With k3s pulling container images inside the VM, host processes (gvproxy, gvisor netstack) get starved. 4GB is enough for a lightweight gateway without GPU workloads. 2. Detect and use the E2E-downloaded VM runtime via OPENSHELL_VM_RUNTIME_DIR. The test script downloads gvproxy/libkrun/libkrunfw from the vm-dev release to ~/.local/share/openshell-vm/ but never tells openshell-vm to use it. The downloaded runtime may contain fixes not in the binary's embedded copy. When the downloaded runtime exists (has gvproxy), set OPENSHELL_VM_RUNTIME_DIR to prefer it. 3. Add VM diagnostics step to CI: openshell-vm log, gvproxy log, dmesg OOM check, memory stats, and VM console log. This will show the actual root cause if gvproxy crashes again. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

The openshell-vm kernel has CONFIG_POSIX_MQUEUE=y but the init script (openshell-vm-init.sh) never mounts the mqueue filesystem. When k3s creates a pod, runc tries to mount mqueue at /dev/mqueue inside the container namespace and gets ENODEV ("no such device") because the host mount point doesn't exist. Fix: run `openshell-vm prepare-rootfs` to extract the rootfs, then patch the init script to mkdir + mount mqueue alongside the existing devpts/shm mounts. The patch is idempotent — skipped if the init script already contains /dev/mqueue. Root cause found by tracing the VM console log: runc create failed: error mounting "mqueue" to rootfs at "/dev/mqueue": no such device This should be fixed upstream in the OpenShell init script. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

The previous mqueue patch ran silently when prepare-rootfs failed or the init script wasn't found. Add verbose logging at each step so CI output shows exactly what happened: rootfs path, init script location, whether the string replacement matched, and any errors. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>