Summary
On macOS with the Homebrew-installed gateway, Docker-backed sandboxes can stay stuck in Provisioning even after the gateway is correctly forced to use Docker and sandbox JWT files are mounted. The root cause in this case was a stale locally cached Docker supervisor binary extracted from ghcr.io/nvidia/openshell/supervisor:dev.
The gateway was current enough to require gateway-minted sandbox JWT auth, but the cached supervisor binary was old enough that it still used the removed x-sandbox-secret mechanism. As a result, the token file was present and valid, but the supervisor never sent it as authorization: Bearer ....
Environment
- macOS Homebrew install
- Gateway:
openshell-gateway 0.0.47-dev.12+g68d428055
- Docker Desktop daemon:
29.1.3, aarch64
- Docker context:
docker-desktop
- Podman was also installed
- Initial
/var/run/docker.sock pointed at Podman:
/var/run/docker.sock -> /Users/tmutch/.local/share/containers/podman/machine/podman.sock
- Docker Desktop socket:
unix:///Users/tmutch/.docker/run/docker.sock
How We Got Here
-
The Homebrew gateway appeared to use Podman instead of Docker.
That matched the gateway auto-detection behavior: Kubernetes, then Podman, then Docker. Because podman was installed, the gateway selected Podman unless configured otherwise.
-
We forced the gateway to Docker:
~/.config/openshell/gateway.toml:
[openshell.gateway]
compute_drivers = ["docker"]
~/.config/openshell/gateway.env:
OPENSHELL_DRIVERS=docker
DOCKER_HOST=unix:///Users/tmutch/.docker/run/docker.sock
The DOCKER_HOST part mattered because Docker's default socket path on this host pointed to Podman.
-
After restart, sandboxes were created in Docker Desktop, not Podman, but still stayed in Provisioning.
The Docker container had:
OPENSHELL_SANDBOX_TOKEN_FILE=/etc/openshell/auth/sandbox.jwt
OPENSHELL_ENDPOINT=https://host.openshell.internal:17670/
and mounted:
~/.local/state/openshell/docker-sandbox-tokens/default/<sandbox-id>/sandbox.jwt
-> /etc/openshell/auth/sandbox.jwt
-
Initially, the Homebrew copied TLS dir did not contain the jwt/ subdir.
Gateway JWT keys existed under:
/opt/homebrew/var/openshell/tls/jwt/{signing.pem,public.pem,kid}
but the Homebrew runtime TLS copy under:
~/.local/state/openshell/homebrew/tls/
had CA/server/client TLS files only. Copying the jwt/ directory there and restarting enabled gateway JWT minting:
gateway-minted sandbox JWT enabled gateway_id=openshell ttl_secs=3600
minted sandbox JWT
-
Even with a mounted JWT, the sandbox still failed.
Sandbox logs included:
Failed to fetch inference bundle, inference routing disabled
error: status: PermissionDenied, message: "GetInferenceBundle requires a sandbox principal"
NET:FAIL [LOW] host.openshell.internal:17670
The JWT itself decoded correctly:
{
"header": {
"alg": "EdDSA",
"kid": "45b4b366ae414387c0fa96717739ce35",
"typ": "JWT"
},
"claims": {
"aud": "openshell-gateway:openshell",
"iss": "openshell-gateway:openshell",
"sandbox_id": "<same sandbox id>",
"sub": "spiffe://openshell/sandbox/<same sandbox id>"
}
}
-
Comparing against e2e/with-docker-gateway.sh showed why e2e worked.
The e2e wrapper writes a complete per-run Docker driver config and supplies a fresh matching supervisor binary via:
[openshell.drivers.docker]
supervisor_bin = "<freshly built openshell-sandbox>"
The Homebrew gateway instead used the default supervisor image path and extracted/cached a binary from:
ghcr.io/nvidia/openshell/supervisor:dev
The failing container bind-mounted:
~/.local/share/openshell/docker-supervisor/sha256-87103ad60110703cc8e29053acd5ce643058c2f28978ee8248d2ab694ee37114/openshell-sandbox
-> /opt/openshell/bin/openshell-sandbox
That cached binary reported:
openshell-sandbox 0.0.37-dev.160+g316c788ea
Source at that commit still used x-sandbox-secret in crates/openshell-sandbox/src/grpc_client.rs and did not contain the current OPENSHELL_SANDBOX_TOKEN_FILE / Bearer JWT auth path.
Fix That Confirmed the Diagnosis
Pulling the current supervisor image and restarting the Homebrew gateway fixed provisioning:
docker pull ghcr.io/nvidia/openshell/supervisor:dev
brew services restart nvidia/openshell/openshell
After pull:
ghcr.io/nvidia/openshell/supervisor:dev
openshell-sandbox 0.0.47-dev.13+g57b71c68f
The gateway extracted a new cached supervisor:
~/.local/share/openshell/docker-supervisor/sha256-5742943b50ee5de76ed9da50f8383ce6805ca4d833a7271774b1bec8d8f365b9/openshell-sandbox
Fresh smoke test succeeded:
openshell sandbox create \
--name docker-smoke-after-pull-echo \
--from ghcr.io/nvidia/openshell-community/sandboxes/base:latest \
--no-keep --no-tty -- /bin/sh -lc 'echo supervisor-ok'
Output:
Created sandbox: docker-smoke-after-pull-echo
supervisor-ok
Deleted sandbox docker-smoke-after-pull-echo
Expected Behavior
The Homebrew-installed gateway should not silently use an incompatible stale supervisor binary for Docker sandboxes. If the gateway requires Bearer sandbox JWT auth, the selected supervisor binary should support that same auth protocol.
Possible Improvements
- Pin Homebrew's Docker supervisor image to an immutable tag/digest matching the gateway build instead of relying on floating
dev.
- On gateway startup, log the selected supervisor image/digest and extracted supervisor binary version.
- Detect supervisor/gateway protocol mismatch before creating a sandbox, or fail with an explicit error instead of leaving the sandbox in
Provisioning.
- Ensure the Homebrew wrapper copies the
jwt/ directory along with TLS materials when it sets OPENSHELL_LOCAL_TLS_DIR.
- Consider making Docker driver resolution run
docker pull for floating tags when appropriate, or document that users must refresh ghcr.io/nvidia/openshell/supervisor:dev after upgrading a dev Homebrew gateway.
Related
This is adjacent to, but different from, #1519. In this case, after DOCKER_HOST was pinned to Docker Desktop, containers were created in Docker and the remaining failure was the stale supervisor binary/auth protocol mismatch.
Summary
On macOS with the Homebrew-installed gateway, Docker-backed sandboxes can stay stuck in
Provisioningeven after the gateway is correctly forced to use Docker and sandbox JWT files are mounted. The root cause in this case was a stale locally cached Docker supervisor binary extracted fromghcr.io/nvidia/openshell/supervisor:dev.The gateway was current enough to require gateway-minted sandbox JWT auth, but the cached supervisor binary was old enough that it still used the removed
x-sandbox-secretmechanism. As a result, the token file was present and valid, but the supervisor never sent it asauthorization: Bearer ....Environment
openshell-gateway 0.0.47-dev.12+g68d42805529.1.3,aarch64docker-desktop/var/run/docker.sockpointed at Podman:/var/run/docker.sock -> /Users/tmutch/.local/share/containers/podman/machine/podman.sockunix:///Users/tmutch/.docker/run/docker.sockHow We Got Here
The Homebrew gateway appeared to use Podman instead of Docker.
That matched the gateway auto-detection behavior: Kubernetes, then Podman, then Docker. Because
podmanwas installed, the gateway selected Podman unless configured otherwise.We forced the gateway to Docker:
~/.config/openshell/gateway.toml:~/.config/openshell/gateway.env:The
DOCKER_HOSTpart mattered because Docker's default socket path on this host pointed to Podman.After restart, sandboxes were created in Docker Desktop, not Podman, but still stayed in
Provisioning.The Docker container had:
and mounted:
Initially, the Homebrew copied TLS dir did not contain the
jwt/subdir.Gateway JWT keys existed under:
but the Homebrew runtime TLS copy under:
had CA/server/client TLS files only. Copying the
jwt/directory there and restarting enabled gateway JWT minting:Even with a mounted JWT, the sandbox still failed.
Sandbox logs included:
The JWT itself decoded correctly:
{ "header": { "alg": "EdDSA", "kid": "45b4b366ae414387c0fa96717739ce35", "typ": "JWT" }, "claims": { "aud": "openshell-gateway:openshell", "iss": "openshell-gateway:openshell", "sandbox_id": "<same sandbox id>", "sub": "spiffe://openshell/sandbox/<same sandbox id>" } }Comparing against
e2e/with-docker-gateway.shshowed why e2e worked.The e2e wrapper writes a complete per-run Docker driver config and supplies a fresh matching supervisor binary via:
The Homebrew gateway instead used the default supervisor image path and extracted/cached a binary from:
The failing container bind-mounted:
That cached binary reported:
Source at that commit still used
x-sandbox-secretincrates/openshell-sandbox/src/grpc_client.rsand did not contain the currentOPENSHELL_SANDBOX_TOKEN_FILE/ Bearer JWT auth path.Fix That Confirmed the Diagnosis
Pulling the current supervisor image and restarting the Homebrew gateway fixed provisioning:
After pull:
The gateway extracted a new cached supervisor:
Fresh smoke test succeeded:
openshell sandbox create \ --name docker-smoke-after-pull-echo \ --from ghcr.io/nvidia/openshell-community/sandboxes/base:latest \ --no-keep --no-tty -- /bin/sh -lc 'echo supervisor-ok'Output:
Expected Behavior
The Homebrew-installed gateway should not silently use an incompatible stale supervisor binary for Docker sandboxes. If the gateway requires Bearer sandbox JWT auth, the selected supervisor binary should support that same auth protocol.
Possible Improvements
dev.Provisioning.jwt/directory along with TLS materials when it setsOPENSHELL_LOCAL_TLS_DIR.docker pullfor floating tags when appropriate, or document that users must refreshghcr.io/nvidia/openshell/supervisor:devafter upgrading a dev Homebrew gateway.Related
This is adjacent to, but different from, #1519. In this case, after
DOCKER_HOSTwas pinned to Docker Desktop, containers were created in Docker and the remaining failure was the stale supervisor binary/auth protocol mismatch.