Skip to content

openshell-driver-vm 0.0.39 on macOS: sandbox init's chown -R over virtio-fs fails, VM exits status 1, sandbox never reaches Ready #1356

@sandl99

Description

@sandl99

Agent Diagnostic

Loaded debug-openshell-cluster skill from NVIDIA/OpenShell.

Step 1 — CLI reachability:
$ openshell gateway info
Gateway: nemoclaw → http://127.0.0.1:8080
$ openshell status
Status: Connected Version: 0.0.39
Gateway is healthy. Failure is in sandbox creation, not gateway reach.

Step 2 — Compute platform: VM driver (libkrun + Apple Virtualization.framework
on Apple Silicon). Confirmed via which openshell-driver-vm and
openshell-driver-vm --version → 0.0.39.

Step 6 — VM-backed gateway checks:
- openshell-driver-vm process: spawned per gateway log, launcher_pid=74518.
- Rootfs: extracted on host at
~/.local/state/nemoclaw/openshell-docker-gateway/vm-driver/sandboxes/
1fc9c060-3e28-4b1d-ab24-828c966edb96/rootfs/ — owned by host uid 502.
- Host virtualization: Apple M4, Virtualization.framework + libkrun.
- Supervisor callback: never reached — PID 1 exits at 0.079s before the
supervisor starts.

Sandbox console (rootfs-console.log) shows init aborting:

  [0.039s] detected eth0 (gvproxy networking)
  [0.047s] no DHCP client, using static config
  [0.079s] fixing /sandbox ownership (was uid=502, setting to sandbox=998:998)
  chown: changing ownership of '/sandbox/.bashrc': Permission denied
  chown: changing ownership of '/sandbox/.profile': Permission denied

Gateway log loops every ~60s until consumer timeout:

  openshell_server::compute: Sandbox failed to become ready
    sandbox_id=1fc9c060-3e28-4b1d-ab24-828c966edb96
    reason=ProcessExited VM process exited with status 1
  (repeated 8× at ~60s intervals)

Tried locally:
- openshell-driver-vm --help — no flag to skip the chown step
(only --vcpus / --mem-mib / --gpu / --bind-address / --state-dir / TLS).
- Located the offending step in source:
crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh:398-410

      # Fix /sandbox ownership. The host-side CLI extracts OCI layers as a
      # non-root user (e.g. UID 501 on macOS), so /sandbox may be owned by
      # the host UID.
      if [ -d /sandbox ]; then
          _sb_uid=$(id -u sandbox 2>/dev/null || true)
          _sb_gid=$(id -g sandbox 2>/dev/null || true)
          if [ -n "$_sb_uid" ] && [ -n "$_sb_gid" ]; then
              _cur_uid=$(stat -c '%u' /sandbox 2>/dev/null || true)
              if [ -n "$_cur_uid" ] && [ "$_cur_uid" != "$_sb_uid" ]; then
                  ts "fixing /sandbox ownership (was uid=${_cur_uid}, ...)"
                  chown -R "${_sb_uid}:${_sb_gid}" /sandbox
              fi
          fi
      fi

  Script shebang/header (line 1, 10): `#!/bin/bash` and `set -euo pipefail`.
  The chown's non-zero return aborts PID 1 → VM exits status 1.

- Confirmed /sandbox is shared into the guest via virtio-fs:
  crates/openshell-driver-vm/src/runtime.rs:21 comment
  "gvproxy for libkrun, virtiofsd for QEMU".

- For comparison, the Linux Docker driver does NOT hit this path because
  the in-sandbox PID 1 there is the Rust `openshell-sandbox` supervisor,
  bind-mounted into the container, which handles host-uid extractions
  correctly. Only the VM driver uses the bash init script with `chown -R`.

Why my agent could not resolve it: the failing code is in the upstream init
script bundled with openshell-driver-vm. There is no env var or flag to
disable it, and no consumer-side patch (image Dockerfile, build args) can
prevent the chown from running — virtio-fs reflects host uids back into the
guest regardless of what the image's chown layers set.

Description

On macOS arm64 using openshell-driver-vm 0.0.39 (libkrun backend), the VM
sandbox-init script runs chown -R sandbox:sandbox /sandbox against a
virtio-fs-shared rootfs. virtio-fs returns EACCES for ownership changes to
a uid (998) that does not exist on the host, even when the in-guest caller
is root. Combined with set -euo pipefail at the top of the script, the
chown failure kills PID 1 at 0.079s and the VM exits with status 1.

The gateway then reports "Sandbox failed to become ready
reason=ProcessExited VM process exited with status 1" every ~60s. The
sandbox never reaches Ready and the calling project (NemoClaw onboard)
times out and tears the sandbox down.

Expected: sandbox-init tolerates virtio-fs ownership semantics on macOS
(detect virtio-fs and skip the chown, or treat EACCES as a warning, or
use a uid-mapped virtio-fs mount option). The Rust openshell-sandbox
supervisor already handles host-uid extractions correctly on the Linux
Docker driver path — porting that strategy to the VM init script would
fix this.

Reproduction Steps

  1. Apple Silicon macOS host (tested on M4, Darwin 24.1.0).
  2. Install OpenShell 0.0.39 (openshell, openshell-gateway, openshell-driver-vm).
  3. Start the standalone gateway (e.g. NemoClaw onboard does this; or
    openshell-gateway directly).
  4. Create any sandbox whose image has chown root:root /sandbox/.bashrc /sandbox/.profile in a build layer (e.g. NemoClaw's Dockerfile.base):
    openshell sandbox create --image
    or
    nemoclaw onboard --non-interactive
  5. Gateway calls CreateSandbox; openshell-driver-vm extracts rootfs to a
    host-side directory as the macOS user, then boots libkrun and shares
    the rootfs via virtio-fs.
  6. Inside the guest, /init runs and at 0.079s prints:
    fixing /sandbox ownership (was uid=502, setting to sandbox=998:998)
    chown: changing ownership of '/sandbox/.bashrc': Permission denied
    then exits status 1.
  7. Sandbox phase: Provisioning → Error. Subsequent restarts repeat the same
    failure. Sandbox never reaches Ready.

Environment

  • OS: macOS 15.x (Darwin 24.1.0)
  • Hardware: Apple Silicon M4, 24 GiB unified memory
  • openshell: 0.0.39 (~/.local/bin/openshell)
  • openshell-gateway: 0.0.39 (~/.local/bin/openshell-gateway)
  • openshell-driver-vm: 0.0.39 (~/.local/bin/openshell-driver-vm)
  • openshell-sandbox: not installed (Linux-only — expected on macOS)
  • Docker: Colima (used by consumer project for image build only; VM runs
    via libkrun directly on host, Colima is not involved at sandbox runtime)
  • Consumer project (for repro): NVIDIA/NemoClaw main running
    nemoclaw onboard --non-interactive

Logs

# rootfs-console.log (in-guest PID 1 output)
  [0.039s] detected eth0 (gvproxy networking)
  [0.047s] no DHCP client, using static config
  [0.079s] fixing /sandbox ownership (was uid=502, setting to sandbox=998:998)
  chown: changing ownership of '/sandbox/.bashrc': Permission denied
  chown: changing ownership of '/sandbox/.profile': Permission denied

  # Gateway log excerpts (relevant frames only)
  2026-05-13T05:00:08.260877Z  INFO openshell_driver_vm::driver: vm driver: create_sandbox received
    sandbox_id=1fc9c060-3e28-4b1d-ab24-828c966edb96 sandbox_name=my-assistant
  2026-05-13T05:01:10.117964Z  INFO openshell_driver_vm::driver: vm driver: rootfs prepared
    image_identity=sha256:f6917eb1f12be758d3b74f01fe2192a6dee1a5e3444a1abf2f786aee1987bf36
  2026-05-13T05:01:10.118445Z  INFO openshell_driver_vm::driver: vm driver: spawning VM launcher
    launcher=/Users/sdang/.local/bin/openshell-driver-vm
  2026-05-13T05:01:15.914544Z  INFO openshell_server::compute: Sandbox phase changed
    old_phase=Provisioning new_phase=Error
  2026-05-13T05:01:15.914591Z  WARN openshell_server::compute: Sandbox failed to become ready
    sandbox_id=1fc9c060-3e28-4b1d-ab24-828c966edb96 reason=ProcessExited VM process exited with status 1
  # (above WARN repeats 7 more times at ~60s intervals until consumer torn down)

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions