Problem Statement
Current OpenShell compute drivers (Docker, Podman, MicroVM, Kubernetes) can be difficult to integrate into large-scale HPC environments (e.g., 20k+ node Slurm/LSF deployments) due to a few practical constraints:
Infrastructure Policies: Enterprise IT guidelines frequently restrict the deployment of new daemons or orchestrators across compute nodes.
NFS Compatibility: Rootless Podman relies on fuse-overlayfs and user-namespace UID mappings. These can conflict with standard NFS setups and often require global /etc/subuid provisioning across the cluster.
Runtime Overhead: MicroVMs introduce virtualization overhead that may not be ideal for high-throughput, latency-sensitive batch jobs.
It would be highly beneficial to explore a daemonless compute driver that enables lightweight deployments while naturally aligning with existing NFS permissions and scheduler workflows.
Proposed Design
We propose the implementation of a new daemonless compute driver that leverages HPC-native or lightweight sandboxing primitives.
Possible Underlying Technologies:
Apptainer: As a widely adopted tool in HPC, it handles NFS and unprivileged execution well without requiring cluster-wide infrastructure changes. Implementing an Apptainer driver would require exploring how smoothly it can integrate with OpenShell's Landlock/seccomp policy engine while minimizing container image or OCI overhead.
Bubblewrap (bwrap): A low-level primitive that could serve as a lightweight, host-process wrapper. It avoids OCI overhead and allows for direct policy hook-ins, though it would require building custom lifecycle management and offers a lighter isolation boundary.
Alternatives Considered
Rootless Podman: Currently challenging to deploy in these specific environments without cluster-wide infrastructure modifications, largely due to NFS storage graph limitations and host-level /etc/subuid dependencies.
Agent Investigation
Investigation: Apptainer and bubblewrap as unprivileged-host backends
Our agents evaluated both runtimes named in the issue on a representative locked-down host (Rocky 8.9, NFSv3 $HOME, no /etc/subuid, cgroup v1, no sudo). Same host where rootless Podman v5.8.2 fails before launch (NFS xattrs, missing subuid, no cgroup limits on v1).
TL;DR: both runtimes are viable backends on this host class with the same network model — unshare-net plus a host-side forward-proxy UNIX socket bind-mounted in. Apptainer keeps the driver thin (built-in OCI pull + instance lifecycle + --nv); bubblewrap keeps the host footprint tiny but the driver has to supply OCI/lifecycle/networking glue. Both share the "no multi-uid images without /etc/subuid" and "no cgroup limits on v1" limitations.
Comparison at a glance
| Capability |
Apptainer |
Bubblewrap |
| OCI image pull |
Built-in (docker://) |
Driver-supplied (skopeo/umoci, or delegate to Apptainer) |
| Long-lived sandbox lifecycle |
Built-in (instance start / instance list) |
Driver-supplied (PID + state files, polling) |
unshare-net + bind-mounted forward-proxy socket |
Validated |
Validated |
| Workspace + supervisor bind mounts |
Built-in flags |
Built-in flags |
Single-uid mode on hosts without /etc/subuid |
Default behavior |
--unshare-user --uid 0 --gid 0 |
| Capability dropping / seccomp BPF hooks |
Limited surface |
First-class (--cap-drop, seccomp BPF) |
| Resource limits (CPU/mem) |
None directly; cgroup v2 best-effort |
None directly; cgroup v2 best-effort via systemd-run |
| GPU passthrough |
--nv (built-in) |
Driver-supplied (--dev-bind /dev/nvidia* + lib mounts) |
| Host footprint |
~250 MB relocatable user install |
Few hundred KB, often pre-installed |
| Driver surface area |
Thin (translate spec → CLI) |
Thick (rootfs cache, lifecycle, networking, limits) |
Network model (validated end-to-end under bwrap)
Sandbox started with the netns unshared; a host-side OpenShell forward proxy is bind-mounted in over a UNIX socket; HTTPS_PROXY is injected so every outbound call flows through gateway policy. From the supervisor inside the bwrap sandbox:
[PASS] direct TCP to 1.1.1.1:443: blocked: [Errno 101] Network is unreachable
[PASS] CONNECT example.com:443 (expect 200): HTTP/1.1 200 OK
[PASS] CONNECT forbidden.example.net:443 (expect 403): HTTP/1.1 403 Forbidden
The same shape is what an Apptainer driver would use (--network none + bind-mounted socket); no bridge, no published ports, no admin policy needed.
Shared design that lets either runtime work here
- Connectivity:
unshare-net + host-side forward proxy over a bind-mounted UNIX socket.
- Storage and supervisor delivery via bind mounts, replacing libpod named volumes /
image_volumes.
- Lifecycle:
apptainer instance start + polling (Apptainer), or PID + state-file tracking with inotify/polling (bwrap).
- CPU/memory limits as best-effort: only enforce on unified cgroup v2 (e.g.
systemd-run --user --scope); otherwise document and skip.
- No multi-uid images without
/etc/subuid — same blocker as rootless Podman, both runtimes affected. Documented limitation.
Suggested next steps
- Unblock end-to-end testing on the target host class: the v0.0.42 release binaries require glibc ≥ 2.39 /
GLIBCXX_3.4.31, which excludes RHEL 8 / Rocky 8 / SLES 15 (glibc 2.28). An x86_64-unknown-linux-musl artifact matching the existing aarch64 musl one would unblock testing of any HPC driver.
- Apptainer driver spike:
apptainer instance start with the real OpenShell supervisor image and the gateway's forward proxy bind-mounted as a UNIX socket, plus the small in-sandbox loopback→socket bridge for HTTPS_PROXY=http://127.0.0.1:N clients.
- Bubblewrap driver spike: promote the captured
bwrap argv into a real spec builder (mirroring openshell-driver-podman::build_container_spec) and wire CreateSandbox/StopSandbox/WatchSandboxes against PID + state files.
- Re-run capability probes on representative hosts (HPC login node, workstation, admin-enabled cluster) before committing to driver API guarantees.
Per-runtime probe results (full tables)
Apptainer
| Area |
Result |
OCI docker:// pull + exec |
PASS |
Bind RW → /sandbox workspace |
PASS |
| Bind RO → supervisor binary path |
PASS |
--env injection (OPENSHELL_* parity) |
PASS |
Bind RO → /run/secrets |
PASS |
--fakeroot → uid 0 inside |
PASS in root-mapped userns mode (single-uid map); subuid-backed fakeroot fails — no /etc/subuid entry |
apptainer instance start (long-lived sandbox) |
PASS |
--network none |
PASS |
| Bind RO of a host UNIX socket into sandbox |
PASS |
--mount type=tmpfs for /run/netns |
FAIL — workaround: bind a host dir → /run/netns (PASS) |
--cpus cgroup limit |
FAIL — requires unified cgroup v2 |
Bridge --net + reach default gateway |
FAIL — and not needed under the chosen network model |
Bubblewrap
| Area |
Result |
OCI rootfs staging (apptainer build --sandbox in the spike; skopeo+umoci also viable) |
PASS — python:3.11-slim extracted in ~2.5s |
--unshare-{user,pid,ipc,uts} + single-uid map |
PASS |
Bind RW workspace, RO supervisor, --proc /proc --dev /dev --tmpfs /tmp --tmpfs /run |
PASS |
--cap-drop ALL + --setenv OPENSHELL_* |
PASS |
--unshare-net (direct TCP unreachable) |
PASS |
| Bind RO host UNIX socket + allowlisted CONNECT (expect 200) |
PASS |
| Bind RO host UNIX socket + denied CONNECT (expect 403) |
PASS |
Pre-existing mount points in rootfs before --ro-bind <rootfs> / |
REQUIRED — driver must pre-create them |
| OCI image pull / lifecycle / event stream / port publish |
N/A — driver supplies these |
| CPU/memory cgroup limits |
FAIL on cgroup v1; on v2 wrap bwrap in systemd-run --user --scope |
Environment: Rocky 8.9, x86_64, kernel 6.1, glibc 2.28, $HOME on NFSv3, no /etc/subuid entry, cgroup v1, no sudo. Apptainer 1.4.5 via the unprivileged installer; bubblewrap 0.4.0 system package.
Full per-runtime detail, artifact paths, and the Podman-fails-here context: experiments/apptainer-smoke/GITHUB-ISSUE-AGENT-INVESTIGATION.md, with companion artifacts in experiments/apptainer-smoke/ and experiments/bubblewrap-spike/.
Checklist
Problem Statement
Current OpenShell compute drivers (Docker, Podman, MicroVM, Kubernetes) can be difficult to integrate into large-scale HPC environments (e.g., 20k+ node Slurm/LSF deployments) due to a few practical constraints:
Infrastructure Policies: Enterprise IT guidelines frequently restrict the deployment of new daemons or orchestrators across compute nodes.
NFS Compatibility: Rootless Podman relies on fuse-overlayfs and user-namespace UID mappings. These can conflict with standard NFS setups and often require global /etc/subuid provisioning across the cluster.
Runtime Overhead: MicroVMs introduce virtualization overhead that may not be ideal for high-throughput, latency-sensitive batch jobs.
It would be highly beneficial to explore a daemonless compute driver that enables lightweight deployments while naturally aligning with existing NFS permissions and scheduler workflows.
Proposed Design
We propose the implementation of a new daemonless compute driver that leverages HPC-native or lightweight sandboxing primitives.
Possible Underlying Technologies:
Apptainer: As a widely adopted tool in HPC, it handles NFS and unprivileged execution well without requiring cluster-wide infrastructure changes. Implementing an Apptainer driver would require exploring how smoothly it can integrate with OpenShell's Landlock/seccomp policy engine while minimizing container image or OCI overhead.
Bubblewrap (bwrap): A low-level primitive that could serve as a lightweight, host-process wrapper. It avoids OCI overhead and allows for direct policy hook-ins, though it would require building custom lifecycle management and offers a lighter isolation boundary.
Alternatives Considered
Rootless Podman: Currently challenging to deploy in these specific environments without cluster-wide infrastructure modifications, largely due to NFS storage graph limitations and host-level /etc/subuid dependencies.
Agent Investigation
Investigation: Apptainer and bubblewrap as unprivileged-host backends
Our agents evaluated both runtimes named in the issue on a representative locked-down host (Rocky 8.9, NFSv3
$HOME, no/etc/subuid, cgroup v1, no sudo). Same host where rootless Podman v5.8.2 fails before launch (NFS xattrs, missing subuid, no cgroup limits on v1).TL;DR: both runtimes are viable backends on this host class with the same network model —
unshare-netplus a host-side forward-proxy UNIX socket bind-mounted in. Apptainer keeps the driver thin (built-in OCI pull +instancelifecycle +--nv); bubblewrap keeps the host footprint tiny but the driver has to supply OCI/lifecycle/networking glue. Both share the "no multi-uid images without/etc/subuid" and "no cgroup limits on v1" limitations.Comparison at a glance
docker://)instance start/instance list)unshare-net+ bind-mounted forward-proxy socket/etc/subuid--unshare-user --uid 0 --gid 0--cap-drop, seccomp BPF)systemd-run--nv(built-in)--dev-bind /dev/nvidia*+ lib mounts)Network model (validated end-to-end under bwrap)
Sandbox started with the netns unshared; a host-side OpenShell forward proxy is bind-mounted in over a UNIX socket;
HTTPS_PROXYis injected so every outbound call flows through gateway policy. From the supervisor inside the bwrap sandbox:The same shape is what an Apptainer driver would use (
--network none+ bind-mounted socket); no bridge, no published ports, no admin policy needed.Shared design that lets either runtime work here
unshare-net+ host-side forward proxy over a bind-mounted UNIX socket.image_volumes.apptainer instance start+ polling (Apptainer), or PID + state-file tracking with inotify/polling (bwrap).systemd-run --user --scope); otherwise document and skip./etc/subuid— same blocker as rootless Podman, both runtimes affected. Documented limitation.Suggested next steps
GLIBCXX_3.4.31, which excludes RHEL 8 / Rocky 8 / SLES 15 (glibc 2.28). Anx86_64-unknown-linux-muslartifact matching the existing aarch64 musl one would unblock testing of any HPC driver.apptainer instance startwith the real OpenShell supervisor image and the gateway's forward proxy bind-mounted as a UNIX socket, plus the small in-sandbox loopback→socket bridge forHTTPS_PROXY=http://127.0.0.1:Nclients.bwrapargv into a real spec builder (mirroringopenshell-driver-podman::build_container_spec) and wireCreateSandbox/StopSandbox/WatchSandboxesagainst PID + state files.Per-runtime probe results (full tables)
Apptainer
docker://pull + exec/sandboxworkspace--envinjection (OPENSHELL_*parity)/run/secrets--fakeroot→ uid 0 inside/etc/subuidentryapptainer instance start(long-lived sandbox)--network none--mount type=tmpfsfor/run/netns/run/netns(PASS)--cpuscgroup limit--net+ reach default gatewayBubblewrap
apptainer build --sandboxin the spike;skopeo+umocialso viable)python:3.11-slimextracted in ~2.5s--unshare-{user,pid,ipc,uts}+ single-uid map--proc /proc --dev /dev --tmpfs /tmp --tmpfs /run--cap-drop ALL+--setenvOPENSHELL_*--unshare-net(direct TCP unreachable)--ro-bind <rootfs> /systemd-run --user --scopeEnvironment: Rocky 8.9, x86_64, kernel 6.1, glibc 2.28,
$HOMEon NFSv3, no/etc/subuidentry, cgroup v1, no sudo. Apptainer 1.4.5 via the unprivileged installer; bubblewrap 0.4.0 system package.Full per-runtime detail, artifact paths, and the Podman-fails-here context:
experiments/apptainer-smoke/GITHUB-ISSUE-AGENT-INVESTIGATION.md, with companion artifacts inexperiments/apptainer-smoke/andexperiments/bubblewrap-spike/.Checklist