Skip to content

feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703

Open
st-gr wants to merge 5 commits into
NVIDIA:mainfrom
st-gr:feat/external-compute-driver-socket
Open

feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703
st-gr wants to merge 5 commits into
NVIDIA:mainfrom
st-gr:feat/external-compute-driver-socket

Conversation

@st-gr
Copy link
Copy Markdown

@st-gr st-gr commented Jun 3, 2026

Summary

Adds an opt-in --compute-driver-socket=<path> flag (env OPENSHELL_COMPUTE_DRIVER_SOCKET) to the gateway. When set, the gateway dispatches sandbox lifecycle to an out-of-tree compute driver speaking the existing compute_driver.proto contract over a Unix domain socket, instead of one of the four built-in drivers (Kubernetes / Podman / Docker / VM).

Related Issue

Supersedes the closed draft #1604 (same author, same patch shape — re-opened as a non-draft after rebasing onto current main, with all commits DCO-signed and st-gr-attributed). Same direction as cheese-head's vouch in #1345 ("extend or provide new out-of-tree compute drivers to OpenShell").

No upstream tracking issue filed — happy to file one if reviewers prefer.

Changes

  • crates/openshell-core/src/config.rs: adds External(PathBuf) variant to ComputeDriverKind. Drops the Copy derive and const fn as_str(self) (both incompatible with PathBuf). FromStr accepts external:<path> (case-insensitive prefix) and rejects bare external with a message pointing at the CLI flag.
  • crates/openshell-server/src/cli.rs: adds --compute-driver-socket=<path> flag (env OPENSHELL_COMPUTE_DRIVER_SOCKET) on RunArgs. When set, pins ComputeDriverKind::External(<path>) and skips both --drivers and the auto-detection probe.
  • crates/openshell-server/src/config_file.rs: surfaces the new field through the config-file path.
  • crates/openshell-server/src/lib.rs / crates/openshell-server/src/compute/mod.rs: in build_compute_runtime, the External(<path>) arm connects a tonic::Channel to the UDS via hyper-util::TokioIo and wraps it in RemoteComputeDriver — the same proxy used by the in-tree VM driver.
  • architecture/compute-runtimes.md: adds an "External" row to the Runtime Summary table, describing the trust boundary (operator-owned UDS file permissions) and the activation flag.

Testing

  • mise run pre-commit passes — not run end-to-end (mise not on author's dev environment), but the equivalent rust pieces verified independently: cargo fmt --all -- --check clean; cargo clippy --no-deps -p openshell-core -p openshell-server --all-targets -- -D warnings clean.
  • Unit tests added/updated — 9 compute_driver_kind_* tests in crates/openshell-core/src/config.rs::tests cover the new variant: parses external:<path>, rejects bare external and empty path, displays as external:<path>, round-trips through FromStr+Display, and is case-insensitive on the prefix. Plus 3 CLI-arg tests for the new flag (presence, override-of-drivers, env-var binding).
  • E2E tests added/updated — none — running an External driver in CI would require shipping a stub driver binary; deferring until at least one in-tree consumer wants to test against it. The patch is exercised in production at SAP via the st-gr/openshell-driver-kyma reference driver running against a Gardener-managed Kyma cluster since 2026-05-28 (sandbox CR creation, openshell sandbox exec, openshell sandbox upload/download all work).

Checklist

  • Follows Conventional Commits (feat(core):, feat(server):, docs(core,server):, docs(arch):)
  • Commits are signed off (DCO)
  • Architecture docs updated — added "External" row to architecture/compute-runtimes.md::Runtime Summary. Per-runtime implementation-notes section is intentionally NOT extended for External because external drivers ship their own documentation; the reference implementation lives at st-gr/openshell-driver-kyma.

Notes for reviewers

Out of scope (intentional):

  • No new in-tree compute driver implementations. This PR only adds the protocol surface (External(PathBuf) + UDS dispatch). Operators bring their own driver process and point --compute-driver-socket at its socket path.
  • No compute_driver.proto changes. The existing protocol is reused verbatim — the same one already used by the in-tree VM driver to talk to its tonic peer.
  • No security model changes. UDS endpoint security is the operator's responsibility (filesystem permissions, sidecar isolation), matching the --drivers vm posture.
  • No new dependencies. hyper-util is already in the workspace Cargo.toml; the patch reuses it via the same TokioIo connector path the VM driver uses.

Operator context:

The forked gateway image at ghcr.io/st-gr/openshell-gateway (carrying these patches plus an in-tree Dockerfile.gateway) has been driving production-shaped Kubernetes clusters with the st-gr/openshell-driver-kyma compute driver since 2026-05-28. The flag has needed zero changes since it landed — the API shape is stable. Merging this PR lets the driver consume the unmodified upstream gateway image, retiring the fork.

@st-gr st-gr requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners June 3, 2026 05:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

I have read the DCO document and I hereby sign the DCO.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

recheck

@drew drew self-assigned this Jun 3, 2026
@drew
Copy link
Copy Markdown
Collaborator

drew commented Jun 3, 2026

/ok to test 8689967

@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 8689967 to 464018b Compare June 3, 2026 08:18
@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

Pushed a doc_markdown fix (added backticks around compute_driver.proto and OPENSHELL_COMPUTE_DRIVER_SOCKET in the External variant's doc comment, folded into the originating feat(core): commit via autosquash).

/ok to test 464018b7 — superseded by a second force-push (redundant_pub_crate fix); see the follow-up comment below for the current SHA.

st-gr added a commit to st-gr/OpenShell that referenced this pull request Jun 3, 2026
NVIDIA/OpenShell's Cargo.toml has

    [workspace.lints.clippy]
    pedantic = "warn"
    nursery = "warn"
    all = "warn"

so `cargo clippy --workspace -- -D warnings` escalates pedantic lints
(including `doc_markdown`) to errors. Their CI uses this as part of
`mise run pre-commit`. The fork's existing `release-gateway.yml` only
runs `docker build` (which calls `cargo build --release` — no clippy),
so the same lints weren't enforced on the fork.

PR NVIDIA#1703 caught us with two `doc_markdown` violations
in `crates/openshell-core/src/config.rs:56` (`compute_driver.proto` and
`OPENSHELL_COMPUTE_DRIVER_SOCKET` missing backticks in the External
variant's doc comment). The fork's CI didn't catch it; NVIDIA's did.

This workflow runs on push to main + `feat/**` branches and on
workflow_dispatch. It pins to Rust 1.95.0 (matches upstream mise.toml)
and installs the same native deps `Dockerfile.gateway` needs
(libclang-dev + libz3-dev for transitive bindgen + z3-sys).

Workflow is light: ubuntu-latest runner, ~10-15 min cold, ~3-5 min
with the Swatinem/rust-cache hot. Path-filtered so doc-only changes
don't trigger it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
st-gr and others added 5 commits June 3, 2026 02:01
Carries the UDS path supplied by --compute-driver-socket. Drops Copy
from the enum derive (PathBuf is not Copy); existing callers use Clone
or owned values. FromStr accepts 'external:<path>' and rejects bare
'external' with a message pointing at the CLI flag.

Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Adds an opt-in `--compute-driver-socket=PATH` flag (env
`OPENSHELL_COMPUTE_DRIVER_SOCKET`) on the gateway's `RunArgs`. When
set, the gateway pins `ComputeDriverKind::External(<path>)` and
skips both the `--drivers` list and the auto-detection probe. This
lets out-of-tree driver binaries (Kyma, custom backends) connect
to a stock gateway without a rebuild.

`effective_single_driver` and the `Config.compute_drivers` payload
both honour the new flag so pre-runtime checks and the runtime
factory dispatch agree on the configured driver. The companion
dispatch arm in `lib.rs::build_compute_runtime` is wired in the
follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
`build_compute_runtime` now connects a tonic `Channel` to the
configured Unix domain socket using `hyper-util`'s `TokioIo`
connector and wraps it in `RemoteComputeDriver` -- the same proxy
used by the VM driver. Replaces the placeholder
`External(_) => Err(...)` arm.

Adds two helpers in `compute/mod.rs`:

* `connect_external_compute_driver(socket_path)` -- a small
  tonic-Endpoint + tower::service_fn + UnixStream connector,
  parallel to the one in `compute::vm` but with no VM-specific
  logging or capability probing. Out-of-tree drivers manage their
  own readiness; the gateway just dials.
* `ComputeRuntime::new_remote_external(channel, ...)` -- mirrors
  `new_remote_vm` but takes no `ManagedDriverProcess`. The
  external driver's lifecycle is the operator's responsibility
  (systemd unit, sidecar container, etc.).

Smoke-tested: `--compute-driver-socket /tmp/nonexistent.sock`
now starts the gateway, logs "Connecting to external compute
driver", and fails with a clear "failed to connect to external
compute driver socket '<path>': transport error" message that
points at the new arm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
…uteDriverKind

Adds three short inline comments at the points where the
non-`Copy`-ness of `ComputeDriverKind` becomes visible to readers,
so reviewers don't have to ask "why was Copy removed?" or "why
clone() instead of *deref?":

- `crates/openshell-core/src/config.rs`: comment above the enum
  derive explaining the `Copy` and `const fn as_str(self)` revert
  was forced by the `External(PathBuf)` variant, and that all
  in-tree call sites already handle `Clone`.
- `crates/openshell-server/src/cli.rs`: comment at the
  `effective_single_driver` match arm noting why we clone instead
  of deref.
- `crates/openshell-server/src/lib.rs`: same comment shape at the
  `configured_compute_driver` match arm.

Squash into the corresponding earlier commit if preferred — split
out as its own commit to make the rationale grep-able.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Adds a row to the Runtime Summary table describing the new
`--compute-driver-socket=<path>` flag and how the External arm relates
to the existing in-tree drivers (operator-managed driver process,
Unix domain socket, identical proto contract to the VM driver, same
trust boundary).

Keeps the rest of the document untouched — the per-runtime
implementation notes (e.g. which driver crate README to read) are
intentionally NOT extended for `External` because external drivers
ship their own documentation in their own repos. The reference
implementation lives at:

    https://github.com/st-gr/openshell-driver-kyma

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 464018b to 1c6663d Compare June 3, 2026 09:01
@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

Force-pushed once more — caught a second pedantic clippy lint locally that the previous run didn't get to (redundant_pub_crate on connect_external_compute_driver). Folded into the originating feat(server): wire ... commit via autosquash so the PR stays at 5 commits.

New HEAD: 1c6663d. Verified clean with the full-workspace check NVIDIA's CI runs:

      cargo fmt --all -- --check
      cargo clippy --workspace --all-targets -- -D warnings

Could I get a re-vet with /ok to test 1c6663d3?

Apologies for the round-trip — I've added a rust-lint workflow to my fork to mirror your mise run pre-commit so future pushes catch these locally before they reach your runners.

Copy link
Copy Markdown
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking ahead to changes such as #1589 and support for multiple drivers in future, does it make sense to formulate this as a "named driver endpoint" where a user would supply a driver name and a socket path? Especially given that the in-tree vm driver is already an example of such a driver and we would only be extending the existing functionality.

In practice, the gateway would select a driver by name (instead of a kind enum) and one could use GetCapabilities.driver_name to validate the user-provided name if required.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

@elezar
I read through #1589 / RFC 0005. The named-endpoint shape and GetCapabilities.driver_name validation both fit cleanly. Two ways to implement your suggestion in this PR:

A. Minimal add. Keep External(PathBuf) and add a paired --compute-driver-name=<name> flag. After the UDS connect, the gateway calls GetCapabilities, compares response.driver_name to the supplied name, and errors loudly on mismatch. The enum surface stays flat. Could serve as a stepping stone; when #1589's registry lands, it can replace the enum entirely without re-migrating this code.

B. Restructured External. Extend the variant to External { name: String, socket: PathBuf }. More visible signal of the multi-driver direction at the type level, but touches FromStr / Display / their tests.

Replacing ComputeDriverKind itself with a name-keyed registry (your second paragraph's "select a driver by name instead of a kind enum") reads to me as the actual #1589 implementation. I can work on that as a follow-up once the RFC has a concrete shape, but that's a much bigger refactor than this PR.

I'd lean (A) because it doesn't pre-commit the data model to a shape #1589 might want to rewrite, and it costs ~30 LOC. Either is fine, though. Let me know which you'd prefer to see here.

@elezar
Copy link
Copy Markdown
Member

elezar commented Jun 3, 2026

As a quick follow-up: What would it look like if we first migrate to implementing the vm driver as a specific-instance of a NamedDriver{ name: String, socket: PathBuf } where name == "vm" and then generalize from here?

Also note, that although #1589 is mentioned here as a motivator for restructuring how drivers are managed from a gateway perspective, moving to a name-keyed registry for managing drivers is not specifically related to the driver-specific config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants