Skip to content

refactor(gateway): move driver options into config#1394

Merged
drew merged 1 commit into
tmutch/gateway-config-implfrom
remove-driver-cli-args/anewberry
May 15, 2026
Merged

refactor(gateway): move driver options into config#1394
drew merged 1 commit into
tmutch/gateway-config-implfrom
remove-driver-cli-args/anewberry

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented May 15, 2026

Summary

Removes gateway-owned driver and SSH endpoint CLI/env flags, moving driver implementation settings into TOML driver tables. Gateway CLI keeps process and gateway-level args while SSH relay endpoint data is derived from the gateway listener.

Related Issue

N/A

Changes

  • Removed gateway driver flags and shared Config fields for sandbox image, callback endpoint, driver dirs, TLS mounts, and SSH endpoint ports
  • Added driver TOML config inheritance and updated Docker, Podman, VM, and Kubernetes defaults
  • Updated packaging wrappers, Helm, snap, RPM/deb/Homebrew generation, scripts, docs, and tests for TOML-based driver settings

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

@drew drew requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 15, 2026 00:13
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
@drew drew force-pushed the remove-driver-cli-args/anewberry branch from 62302a7 to 6d49eb8 Compare May 15, 2026 00:17
@github-actions
Copy link
Copy Markdown

@drew drew merged commit 560550d into tmutch/gateway-config-impl May 15, 2026
29 checks passed
@drew drew deleted the remove-driver-cli-args/anewberry branch May 15, 2026 00:20
TaylorMutch added a commit that referenced this pull request May 15, 2026
…esses

The gateway CLI flags moved into TOML config tables in 560550d (#1394),
which made every existing e2e/with-{docker,podman}-gateway.sh invocation
fail with "unexpected argument '--sandbox-namespace'" (and a long tail
of similar driver-specific options) before the gateway could even bind.

Replace the obsolete CLI flags with a synthesized
`[openshell.drivers.<driver>]` table written to `${STATE_DIR}/gateway.toml`
and passed via the new `--config` flag. Only the gateway-wide flags
that survived 560550d (bind-address, port, drivers, db-url, tls-*,
disable-tls, log-level, health-port) stay on the command line.

Both scripts get a small `toml_string` helper to properly TOML-quote
the values (the previous `%q` printf format produced bash-escape, not
TOML-escape). The Docker harness also corrects two field names that
diverged from the driver schema: `docker_network_name` →
`network_name`, and the supervisor binary/image plumbing now reads
through to `supervisor_bin` / `supervisor_image` in the same table.

The Podman harness drops `--ssh-gateway-port` (deleted in 560550d —
gRPC + SSH are multiplexed on the same port now) and substitutes
`network_name` + `gateway_port` for the obsolete `--sandbox-namespace`
(which the Podman driver never had as a typed field).
TaylorMutch pushed a commit that referenced this pull request May 15, 2026
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
TaylorMutch added a commit that referenced this pull request May 15, 2026
…esses

The gateway CLI flags moved into TOML config tables in 560550d (#1394),
which made every existing e2e/with-{docker,podman}-gateway.sh invocation
fail with "unexpected argument '--sandbox-namespace'" (and a long tail
of similar driver-specific options) before the gateway could even bind.

Replace the obsolete CLI flags with a synthesized
`[openshell.drivers.<driver>]` table written to `${STATE_DIR}/gateway.toml`
and passed via the new `--config` flag. Only the gateway-wide flags
that survived 560550d (bind-address, port, drivers, db-url, tls-*,
disable-tls, log-level, health-port) stay on the command line.

Both scripts get a small `toml_string` helper to properly TOML-quote
the values (the previous `%q` printf format produced bash-escape, not
TOML-escape). The Docker harness also corrects two field names that
diverged from the driver schema: `docker_network_name` →
`network_name`, and the supervisor binary/image plumbing now reads
through to `supervisor_bin` / `supervisor_image` in the same table.

The Podman harness drops `--ssh-gateway-port` (deleted in 560550d —
gRPC + SSH are multiplexed on the same port now) and substitutes
`network_name` + `gateway_port` for the obsolete `--sandbox-namespace`
(which the Podman driver never had as a typed field).
TaylorMutch pushed a commit that referenced this pull request May 15, 2026
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
TaylorMutch added a commit that referenced this pull request May 15, 2026
…esses

The gateway CLI flags moved into TOML config tables in 560550d (#1394),
which made every existing e2e/with-{docker,podman}-gateway.sh invocation
fail with "unexpected argument '--sandbox-namespace'" (and a long tail
of similar driver-specific options) before the gateway could even bind.

Replace the obsolete CLI flags with a synthesized
`[openshell.drivers.<driver>]` table written to `${STATE_DIR}/gateway.toml`
and passed via the new `--config` flag. Only the gateway-wide flags
that survived 560550d (bind-address, port, drivers, db-url, tls-*,
disable-tls, log-level, health-port) stay on the command line.

Both scripts get a small `toml_string` helper to properly TOML-quote
the values (the previous `%q` printf format produced bash-escape, not
TOML-escape). The Docker harness also corrects two field names that
diverged from the driver schema: `docker_network_name` →
`network_name`, and the supervisor binary/image plumbing now reads
through to `supervisor_bin` / `supervisor_image` in the same table.

The Podman harness drops `--ssh-gateway-port` (deleted in 560550d —
gRPC + SSH are multiplexed on the same port now) and substitutes
`network_name` + `gateway_port` for the obsolete `--sandbox-namespace`
(which the Podman driver never had as a typed field).
TaylorMutch added a commit that referenced this pull request May 15, 2026
* feat(gateway): add TOML configuration file (RFC 0003)

Introduces an opt-in --config / OPENSHELL_GATEWAY_CONFIG flag that loads a
TOML file with gateway-wide settings and per-driver tables. Source
precedence is CLI > env > file > built-in default, implemented via clap's
ValueSource so existing flags and env vars keep their priority.

Driver crates (kubernetes, docker, podman, vm) now derive Deserialize on
their config structs. SupervisorSideloadMethod gains Deserialize with
kebab-case rename. A per-driver inheritance allowlist on the loader side
overlays [openshell.gateway] shared defaults (default_image,
supervisor_image, image_pull_policy, guest_tls_*, ssh_handshake_skew_secs,
client_tls_secret_name, host_gateway_ip, enable_user_namespaces) onto
each [openshell.drivers.<name>] table before deserialization.

The Helm chart renders a new gateway-config ConfigMap and mounts it at
/etc/openshell/gateway.toml. The migrated OPENSHELL_* env entries are
dropped from the StatefulSet — only the Secret-backed
OPENSHELL_SSH_HANDSHAKE_SECRET remains. database_url stays on --db-url.

Adds examples/gateway/gateway.example.toml and updates architecture/gateway.md
with the source precedence and inheritance rules.

* docs(gateway): drop ssh_handshake_skew_secs and ssh_handshake_secret from examples

Both fields are scheduled for removal. Remove the example values and the
env-only note so the gateway.toml example and the architecture doc stop
recommending settings that will not exist much longer.

* docs(rfc): correct OPENSHELL_CONFIG to OPENSHELL_GATEWAY_CONFIG in RFC 0003

* docs(gateway): add per-driver TOML example configurations

Adds focused single-driver examples next to the comprehensive
gateway.example.toml: kubernetes, docker, podman, and microvm. Each one
demonstrates the realistic settings for that driver plus how shared
[openshell.gateway] defaults inherit into the driver table.

A new unit test (`checked_in_examples_parse`) loads every example through
the config_file loader so schema drift fails CI rather than silently
shipping a broken example.

* refactor(gateway): drop image_pull_policy from shared inheritance

Kubernetes and Podman use mutually-incompatible vocabularies for the same
TOML key:

  - Kubernetes: `Always | IfNotPresent | Never` (free-form string passed
    verbatim to the K8s API).
  - Podman: `always | missing | never | newer` (strict lowercase enum
    deserialised into `ImagePullPolicy`).

No value means the same thing in both drivers. Sharing the key at
`[openshell.gateway]` scope and inheriting it into every active driver's
table meant any value safe for one driver was either wrong or silently
dropped for the other (`IfNotPresent` → `ImagePullPolicy::Missing` after
`.unwrap_or_default()`). Operators run one driver per gateway, so the
"shared default" never pays for itself.

Make `image_pull_policy` driver-local:

  - Remove the field from `GatewayFileSection` and from
    `inheritable_keys()` for both Kubernetes and Podman.
  - Drop the file→`RunArgs` merge for the gateway-scope key.
  - Stop unconditionally clobbering the driver value with
    `config.sandbox_image_pull_policy` in the runtime wiring — only apply
    the CLI/env override when it was set (and, for Podman, only when it
    parses into the lowercase enum).
  - Move the key under `[openshell.drivers.kubernetes]` and
    `[openshell.drivers.podman]` in every example, the RFC, the
    architecture doc, and the Helm-rendered gateway ConfigMap.

The supervisor pull policy follows the same shape: it is K8s-only and
moves into `[openshell.drivers.kubernetes]` alongside `image_pull_policy`
in the Helm template.

* fix(gateway): address review feedback on TOML configuration

Resolves the P1 and P2 issues raised in PR #1317:

- Helm gateway ConfigMap moves `grpc_endpoint` under
  `[openshell.drivers.kubernetes]` so the default install no longer fails
  the gateway's `deny_unknown_fields` schema check.
- `kubernetes_config_from_file` and `podman_config_from_file` only let
  the gateway-wide CLI/env `grpc_endpoint` overwrite the driver-table
  value when it was actually supplied, preserving file-only configs.
- Kubernetes driver default `image_pull_policy` is now empty (was Podman
  vocabulary "missing"), so default deployments let the Kubernetes API
  apply its own policy instead of being rejected.
- New `disable_tls` gateway field plumbs `.Values.server.disableTls`
  through the TOML ConfigMap instead of relying on env vars dropped from
  the StatefulSet.
- StatefulSet pod template now carries a `checksum/gateway-config`
  annotation so `helm upgrade` rolls pods when the ConfigMap changes.
- Auxiliary listener resolution preserves the full `SocketAddr` from
  `health_bind_address` / `metrics_bind_address`, so a loopback-pinned
  health port is not silently relocated onto the public bind address.
- `ssh_session_ttl_secs` from the file is now applied to `Config` (it
  was previously accepted by the loader but never read).

New regression coverage: cli-level merge tests for the new fields plus
helm-unittest assertions for the ConfigMap shape, checksum annotation,
and `disable_tls` rendering.

* docs(gateway): consolidate gateway TOML examples into docs reference

Replaces the per-driver example files under examples/gateway/ with a
single published reference page at docs/reference/gateway-config.mdx
covering source precedence, layout, the full example, and the four
per-driver examples (Kubernetes, Docker, Podman, microVM). Drew flagged
during PR #1317 review that the examples belong with the user-facing
docs rather than in a sibling examples/ directory.

The cross-references in architecture/gateway.md and RFC 0003 are updated
to point at the new docs page; the round-trip test in config_file.rs is
removed (schema coverage stays on the inline parses_full_example test
and per-field merge tests — doc-snippet drift belongs in a separate
docs-lint, not in a cross-tree Rust unit test).

* refactor(core): move DEFAULT_K8S_NAMESPACE into K8s driver

The constant is Kubernetes-specific (used only by KubernetesComputeConfig's
Default impl) and does not belong in openshell-core. Relocate it to the
driver crate that owns the K8s vocabulary; openshell-core retains only
truly cross-cutting defaults.

* refactor(core): move Podman bridge default into Podman driver

DEFAULT_NETWORK_NAME is Podman vocabulary, consumed only by the Podman
driver. Also drops the unused DEFAULT_IMAGE_PULL_POLICY constant.

* docs(auth): scrub remaining SSH handshake secret references

Sweeps the trailing mentions left after the rebase: the gateway
config-file module doc, the Helm gateway-config ConfigMap header,
the gateway-config.mdx env-only note, and the RPM systemd unit
comment for init-gateway-env.sh.

* docs(gateway): clarify OPENSHELL_GRPC_ENDPOINT applies to all drivers

The previous comment implied the callback endpoint was Kubernetes-only,
but the value is propagated to every compute driver (Kubernetes, Docker,
Podman, VM) and must be reachable from wherever the sandbox runs.

* refactor(gateway): move driver options into config (#1394)

Signed-off-by: Drew Newberry <anewberry@nvidia.com>

* fix(e2e): regenerate gateway config via TOML for docker + podman harnesses

The gateway CLI flags moved into TOML config tables in 560550d (#1394),
which made every existing e2e/with-{docker,podman}-gateway.sh invocation
fail with "unexpected argument '--sandbox-namespace'" (and a long tail
of similar driver-specific options) before the gateway could even bind.

Replace the obsolete CLI flags with a synthesized
`[openshell.drivers.<driver>]` table written to `${STATE_DIR}/gateway.toml`
and passed via the new `--config` flag. Only the gateway-wide flags
that survived 560550d (bind-address, port, drivers, db-url, tls-*,
disable-tls, log-level, health-port) stay on the command line.

Both scripts get a small `toml_string` helper to properly TOML-quote
the values (the previous `%q` printf format produced bash-escape, not
TOML-escape). The Docker harness also corrects two field names that
diverged from the driver schema: `docker_network_name` →
`network_name`, and the supervisor binary/image plumbing now reads
through to `supervisor_bin` / `supervisor_image` in the same table.

The Podman harness drops `--ssh-gateway-port` (deleted in 560550d —
gRPC + SSH are multiplexed on the same port now) and substitutes
`network_name` + `gateway_port` for the obsolete `--sandbox-namespace`
(which the Podman driver never had as a typed field).

* fix(core): swap bind-only 0.0.0.0 SSH gateway host for cluster URL host

CLI's resolve_ssh_gateway treated 0.0.0.0 as a loopback "keep as-is"
when the cluster URL was also loopback, so the SSH proxy connected to
0.0.0.0:port. The unspecified address is never a valid connect target
and is not present in any TLS cert SAN, which produced BadCertificate
TLS handshake failures during `openshell sandbox create -- ...` in
docker/podman e2e (e.g. bypass_detection).

Resolution: when the server returns 0.0.0.0 or :: as the gateway host
and both endpoints are loopback, fall back to the cluster URL's host
(which the CLI is already using to reach the gateway, so it must
resolve and match the cert).

* fix(e2e): repair podman harness on macOS

Podman 5.x with the applehv/libkrun provider no longer creates the legacy
~/.local/share/containers/podman/machine/podman.sock symlink, and
`podman system service` is a Linux-only subcommand — the macOS client
delegates the API service to the VM. Both assumptions in the harness
were stale, so the script tried to start a temporary service that podman
rejected with "unknown flag: --time".

- Discover the macOS socket via `podman machine inspect` instead of the
  hardcoded path.
- On Darwin, fail fast with a "start podman machine" message rather than
  attempting the Linux-only `podman system service` fallback.
- Write socket_path into [openshell.drivers.podman] so the in-process
  driver picks up the discovered socket; the driver reads TOML only
  after the config refactor (560550d), so OPENSHELL_PODMAN_SOCKET alone
  was no longer enough.

* fix(server): use clone_from for TLS client CA assignment

clippy 1.95.0 rejects assigning the result of `Clone::clone()` to an
existing variable under `-D warnings` (`assigning_clones`). Switch to
`clone_from(&...)` to satisfy the lint and avoid the redundant
allocation.

---------

Signed-off-by: Drew Newberry <anewberry@nvidia.com>
Co-authored-by: Drew Newberry <anewberry@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants