refactor(gateway): move driver options into config#1394
Merged
drew merged 1 commit intoMay 15, 2026
Conversation
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
62302a7 to
6d49eb8
Compare
|
🌿 Preview your docs: https://nvidia-preview-pr-1394.docs.buildwithfern.com/openshell |
TaylorMutch
approved these changes
May 15, 2026
TaylorMutch
added a commit
that referenced
this pull request
May 15, 2026
…esses The gateway CLI flags moved into TOML config tables in 560550d (#1394), which made every existing e2e/with-{docker,podman}-gateway.sh invocation fail with "unexpected argument '--sandbox-namespace'" (and a long tail of similar driver-specific options) before the gateway could even bind. Replace the obsolete CLI flags with a synthesized `[openshell.drivers.<driver>]` table written to `${STATE_DIR}/gateway.toml` and passed via the new `--config` flag. Only the gateway-wide flags that survived 560550d (bind-address, port, drivers, db-url, tls-*, disable-tls, log-level, health-port) stay on the command line. Both scripts get a small `toml_string` helper to properly TOML-quote the values (the previous `%q` printf format produced bash-escape, not TOML-escape). The Docker harness also corrects two field names that diverged from the driver schema: `docker_network_name` → `network_name`, and the supervisor binary/image plumbing now reads through to `supervisor_bin` / `supervisor_image` in the same table. The Podman harness drops `--ssh-gateway-port` (deleted in 560550d — gRPC + SSH are multiplexed on the same port now) and substitutes `network_name` + `gateway_port` for the obsolete `--sandbox-namespace` (which the Podman driver never had as a typed field).
TaylorMutch
pushed a commit
that referenced
this pull request
May 15, 2026
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
TaylorMutch
added a commit
that referenced
this pull request
May 15, 2026
…esses The gateway CLI flags moved into TOML config tables in 560550d (#1394), which made every existing e2e/with-{docker,podman}-gateway.sh invocation fail with "unexpected argument '--sandbox-namespace'" (and a long tail of similar driver-specific options) before the gateway could even bind. Replace the obsolete CLI flags with a synthesized `[openshell.drivers.<driver>]` table written to `${STATE_DIR}/gateway.toml` and passed via the new `--config` flag. Only the gateway-wide flags that survived 560550d (bind-address, port, drivers, db-url, tls-*, disable-tls, log-level, health-port) stay on the command line. Both scripts get a small `toml_string` helper to properly TOML-quote the values (the previous `%q` printf format produced bash-escape, not TOML-escape). The Docker harness also corrects two field names that diverged from the driver schema: `docker_network_name` → `network_name`, and the supervisor binary/image plumbing now reads through to `supervisor_bin` / `supervisor_image` in the same table. The Podman harness drops `--ssh-gateway-port` (deleted in 560550d — gRPC + SSH are multiplexed on the same port now) and substitutes `network_name` + `gateway_port` for the obsolete `--sandbox-namespace` (which the Podman driver never had as a typed field).
TaylorMutch
pushed a commit
that referenced
this pull request
May 15, 2026
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
TaylorMutch
added a commit
that referenced
this pull request
May 15, 2026
…esses The gateway CLI flags moved into TOML config tables in 560550d (#1394), which made every existing e2e/with-{docker,podman}-gateway.sh invocation fail with "unexpected argument '--sandbox-namespace'" (and a long tail of similar driver-specific options) before the gateway could even bind. Replace the obsolete CLI flags with a synthesized `[openshell.drivers.<driver>]` table written to `${STATE_DIR}/gateway.toml` and passed via the new `--config` flag. Only the gateway-wide flags that survived 560550d (bind-address, port, drivers, db-url, tls-*, disable-tls, log-level, health-port) stay on the command line. Both scripts get a small `toml_string` helper to properly TOML-quote the values (the previous `%q` printf format produced bash-escape, not TOML-escape). The Docker harness also corrects two field names that diverged from the driver schema: `docker_network_name` → `network_name`, and the supervisor binary/image plumbing now reads through to `supervisor_bin` / `supervisor_image` in the same table. The Podman harness drops `--ssh-gateway-port` (deleted in 560550d — gRPC + SSH are multiplexed on the same port now) and substitutes `network_name` + `gateway_port` for the obsolete `--sandbox-namespace` (which the Podman driver never had as a typed field).
TaylorMutch
added a commit
that referenced
this pull request
May 15, 2026
* feat(gateway): add TOML configuration file (RFC 0003)
Introduces an opt-in --config / OPENSHELL_GATEWAY_CONFIG flag that loads a
TOML file with gateway-wide settings and per-driver tables. Source
precedence is CLI > env > file > built-in default, implemented via clap's
ValueSource so existing flags and env vars keep their priority.
Driver crates (kubernetes, docker, podman, vm) now derive Deserialize on
their config structs. SupervisorSideloadMethod gains Deserialize with
kebab-case rename. A per-driver inheritance allowlist on the loader side
overlays [openshell.gateway] shared defaults (default_image,
supervisor_image, image_pull_policy, guest_tls_*, ssh_handshake_skew_secs,
client_tls_secret_name, host_gateway_ip, enable_user_namespaces) onto
each [openshell.drivers.<name>] table before deserialization.
The Helm chart renders a new gateway-config ConfigMap and mounts it at
/etc/openshell/gateway.toml. The migrated OPENSHELL_* env entries are
dropped from the StatefulSet — only the Secret-backed
OPENSHELL_SSH_HANDSHAKE_SECRET remains. database_url stays on --db-url.
Adds examples/gateway/gateway.example.toml and updates architecture/gateway.md
with the source precedence and inheritance rules.
* docs(gateway): drop ssh_handshake_skew_secs and ssh_handshake_secret from examples
Both fields are scheduled for removal. Remove the example values and the
env-only note so the gateway.toml example and the architecture doc stop
recommending settings that will not exist much longer.
* docs(rfc): correct OPENSHELL_CONFIG to OPENSHELL_GATEWAY_CONFIG in RFC 0003
* docs(gateway): add per-driver TOML example configurations
Adds focused single-driver examples next to the comprehensive
gateway.example.toml: kubernetes, docker, podman, and microvm. Each one
demonstrates the realistic settings for that driver plus how shared
[openshell.gateway] defaults inherit into the driver table.
A new unit test (`checked_in_examples_parse`) loads every example through
the config_file loader so schema drift fails CI rather than silently
shipping a broken example.
* refactor(gateway): drop image_pull_policy from shared inheritance
Kubernetes and Podman use mutually-incompatible vocabularies for the same
TOML key:
- Kubernetes: `Always | IfNotPresent | Never` (free-form string passed
verbatim to the K8s API).
- Podman: `always | missing | never | newer` (strict lowercase enum
deserialised into `ImagePullPolicy`).
No value means the same thing in both drivers. Sharing the key at
`[openshell.gateway]` scope and inheriting it into every active driver's
table meant any value safe for one driver was either wrong or silently
dropped for the other (`IfNotPresent` → `ImagePullPolicy::Missing` after
`.unwrap_or_default()`). Operators run one driver per gateway, so the
"shared default" never pays for itself.
Make `image_pull_policy` driver-local:
- Remove the field from `GatewayFileSection` and from
`inheritable_keys()` for both Kubernetes and Podman.
- Drop the file→`RunArgs` merge for the gateway-scope key.
- Stop unconditionally clobbering the driver value with
`config.sandbox_image_pull_policy` in the runtime wiring — only apply
the CLI/env override when it was set (and, for Podman, only when it
parses into the lowercase enum).
- Move the key under `[openshell.drivers.kubernetes]` and
`[openshell.drivers.podman]` in every example, the RFC, the
architecture doc, and the Helm-rendered gateway ConfigMap.
The supervisor pull policy follows the same shape: it is K8s-only and
moves into `[openshell.drivers.kubernetes]` alongside `image_pull_policy`
in the Helm template.
* fix(gateway): address review feedback on TOML configuration
Resolves the P1 and P2 issues raised in PR #1317:
- Helm gateway ConfigMap moves `grpc_endpoint` under
`[openshell.drivers.kubernetes]` so the default install no longer fails
the gateway's `deny_unknown_fields` schema check.
- `kubernetes_config_from_file` and `podman_config_from_file` only let
the gateway-wide CLI/env `grpc_endpoint` overwrite the driver-table
value when it was actually supplied, preserving file-only configs.
- Kubernetes driver default `image_pull_policy` is now empty (was Podman
vocabulary "missing"), so default deployments let the Kubernetes API
apply its own policy instead of being rejected.
- New `disable_tls` gateway field plumbs `.Values.server.disableTls`
through the TOML ConfigMap instead of relying on env vars dropped from
the StatefulSet.
- StatefulSet pod template now carries a `checksum/gateway-config`
annotation so `helm upgrade` rolls pods when the ConfigMap changes.
- Auxiliary listener resolution preserves the full `SocketAddr` from
`health_bind_address` / `metrics_bind_address`, so a loopback-pinned
health port is not silently relocated onto the public bind address.
- `ssh_session_ttl_secs` from the file is now applied to `Config` (it
was previously accepted by the loader but never read).
New regression coverage: cli-level merge tests for the new fields plus
helm-unittest assertions for the ConfigMap shape, checksum annotation,
and `disable_tls` rendering.
* docs(gateway): consolidate gateway TOML examples into docs reference
Replaces the per-driver example files under examples/gateway/ with a
single published reference page at docs/reference/gateway-config.mdx
covering source precedence, layout, the full example, and the four
per-driver examples (Kubernetes, Docker, Podman, microVM). Drew flagged
during PR #1317 review that the examples belong with the user-facing
docs rather than in a sibling examples/ directory.
The cross-references in architecture/gateway.md and RFC 0003 are updated
to point at the new docs page; the round-trip test in config_file.rs is
removed (schema coverage stays on the inline parses_full_example test
and per-field merge tests — doc-snippet drift belongs in a separate
docs-lint, not in a cross-tree Rust unit test).
* refactor(core): move DEFAULT_K8S_NAMESPACE into K8s driver
The constant is Kubernetes-specific (used only by KubernetesComputeConfig's
Default impl) and does not belong in openshell-core. Relocate it to the
driver crate that owns the K8s vocabulary; openshell-core retains only
truly cross-cutting defaults.
* refactor(core): move Podman bridge default into Podman driver
DEFAULT_NETWORK_NAME is Podman vocabulary, consumed only by the Podman
driver. Also drops the unused DEFAULT_IMAGE_PULL_POLICY constant.
* docs(auth): scrub remaining SSH handshake secret references
Sweeps the trailing mentions left after the rebase: the gateway
config-file module doc, the Helm gateway-config ConfigMap header,
the gateway-config.mdx env-only note, and the RPM systemd unit
comment for init-gateway-env.sh.
* docs(gateway): clarify OPENSHELL_GRPC_ENDPOINT applies to all drivers
The previous comment implied the callback endpoint was Kubernetes-only,
but the value is propagated to every compute driver (Kubernetes, Docker,
Podman, VM) and must be reachable from wherever the sandbox runs.
* refactor(gateway): move driver options into config (#1394)
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
* fix(e2e): regenerate gateway config via TOML for docker + podman harnesses
The gateway CLI flags moved into TOML config tables in 560550d (#1394),
which made every existing e2e/with-{docker,podman}-gateway.sh invocation
fail with "unexpected argument '--sandbox-namespace'" (and a long tail
of similar driver-specific options) before the gateway could even bind.
Replace the obsolete CLI flags with a synthesized
`[openshell.drivers.<driver>]` table written to `${STATE_DIR}/gateway.toml`
and passed via the new `--config` flag. Only the gateway-wide flags
that survived 560550d (bind-address, port, drivers, db-url, tls-*,
disable-tls, log-level, health-port) stay on the command line.
Both scripts get a small `toml_string` helper to properly TOML-quote
the values (the previous `%q` printf format produced bash-escape, not
TOML-escape). The Docker harness also corrects two field names that
diverged from the driver schema: `docker_network_name` →
`network_name`, and the supervisor binary/image plumbing now reads
through to `supervisor_bin` / `supervisor_image` in the same table.
The Podman harness drops `--ssh-gateway-port` (deleted in 560550d —
gRPC + SSH are multiplexed on the same port now) and substitutes
`network_name` + `gateway_port` for the obsolete `--sandbox-namespace`
(which the Podman driver never had as a typed field).
* fix(core): swap bind-only 0.0.0.0 SSH gateway host for cluster URL host
CLI's resolve_ssh_gateway treated 0.0.0.0 as a loopback "keep as-is"
when the cluster URL was also loopback, so the SSH proxy connected to
0.0.0.0:port. The unspecified address is never a valid connect target
and is not present in any TLS cert SAN, which produced BadCertificate
TLS handshake failures during `openshell sandbox create -- ...` in
docker/podman e2e (e.g. bypass_detection).
Resolution: when the server returns 0.0.0.0 or :: as the gateway host
and both endpoints are loopback, fall back to the cluster URL's host
(which the CLI is already using to reach the gateway, so it must
resolve and match the cert).
* fix(e2e): repair podman harness on macOS
Podman 5.x with the applehv/libkrun provider no longer creates the legacy
~/.local/share/containers/podman/machine/podman.sock symlink, and
`podman system service` is a Linux-only subcommand — the macOS client
delegates the API service to the VM. Both assumptions in the harness
were stale, so the script tried to start a temporary service that podman
rejected with "unknown flag: --time".
- Discover the macOS socket via `podman machine inspect` instead of the
hardcoded path.
- On Darwin, fail fast with a "start podman machine" message rather than
attempting the Linux-only `podman system service` fallback.
- Write socket_path into [openshell.drivers.podman] so the in-process
driver picks up the discovered socket; the driver reads TOML only
after the config refactor (560550d), so OPENSHELL_PODMAN_SOCKET alone
was no longer enough.
* fix(server): use clone_from for TLS client CA assignment
clippy 1.95.0 rejects assigning the result of `Clone::clone()` to an
existing variable under `-D warnings` (`assigning_clones`). Switch to
`clone_from(&...)` to satisfy the lint and avoid the redundant
allocation.
---------
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
Co-authored-by: Drew Newberry <anewberry@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Removes gateway-owned driver and SSH endpoint CLI/env flags, moving driver implementation settings into TOML driver tables. Gateway CLI keeps process and gateway-level args while SSH relay endpoint data is derived from the gateway listener.
Related Issue
N/A
Changes
Testing
mise run pre-commitpassesChecklist