Skip to content

bug: cluster image bakes envoy-gateway-openshell.yaml into k3s static manifests, gateway start times out #1193

@raktes

Description

@raktes

Agent Diagnostic

  • Loaded debug-openshell-cluster skill. openshell gateway start against ghcr.io/nvidia/openshell/cluster:dev (current main, f17806c) aborts after ~2 min with × K8s namespace not ready.
  • Cluster container k3s log repeatedly emits ApplyManifestFailed for /var/lib/rancher/k3s/server/manifests/envoy-gateway-openshell.yamlthe server could not find the requested resource. The manifest defines a gateway.networking.k8s.io/v1 GatewayClass; the bundled cluster image installs no Gateway API CRDs.
  • Dockerfile.images:204 wildcard-copies deploy/kube/manifests/*.yaml into /opt/openshell/manifests/; cluster-entrypoint.sh:354-359 then copies that dir into k3s's static manifest dir unconditionally. So the file leaks into the bundled k3s deploy path.
  • envoy-gateway-openshell.yaml:4-7 documents itself as opt-in: "Apply after a successful Skaffold deploy when gateway routing is enabled: mise run helm:gateway:apply". tasks/helm.toml:59-61 runs kubectl apply against the same path — that path expects Envoy Gateway CRDs already present.
  • The repeated retries on the unknown CRD slow the deploy controller enough that the bundled openshell HelmChart does not finish installing inside wait_for_namespace's 60-attempt / ~2 min budget (crates/openshell-bootstrap/src/lib.rs:1184), so the openshell namespace never appears.
  • git log confirms regression in 5116cc2 (PR feat(helm): add kubernetes local-dev environment #1158, merged 2026-05-05). Pin 4483c860 did not have the file and reproduces clean.

Caller-side configuration cannot fix this — the file is baked into the cluster image. Fix has to land in the cluster image build.

Description

Actual: openshell gateway start against a clean cluster:dev times out because k3s cannot apply envoy-gateway-openshell.yaml. The openshell namespace is never created within the bootstrap budget.

Expected: gateway start completes; the bundled k3s applies only the manifests intended for the bundled deploy path (agent-sandbox.yaml, openshell-helmchart.yaml).

Reproduction Steps

Repro'd locally on macOS 15.x + Docker Desktop with openshell 0.0.36 (stable, PyPI) against the :dev cluster image. Same signature observed on Ubuntu 24.04 in CI.

  1. uv tool install -U openshell (or any 0.0.36+ install). Ensure no prior gateway is running (openshell gateway destroy if needed).

  2. Confirm the file is in the cluster image:

    docker run --rm --entrypoint cat ghcr.io/nvidia/openshell/cluster:dev \
      /opt/openshell/manifests/envoy-gateway-openshell.yaml

    prints the GatewayClass eg YAML. Image digest at time of repro: sha256:20e0190c1cacbe13036afce465db93ea6fae9f763ec65bf9c3301d4d093ec75a.

  3. Start a gateway against that image:

    OPENSHELL_CLUSTER_IMAGE=ghcr.io/nvidia/openshell/cluster:dev \
      openshell gateway start --port 18080 --name os-bugrepro

    After ~2 min, exits with × K8s namespace not ready ╰─▶ timed out waiting for namespace 'openshell' to exist. Container-logs section of the error includes repeated ApplyManifestFailed for envoy-gateway-openshell.yaml.

Environment

  • Reporter (verified): macOS 15.x (Darwin 25.2.0) + Docker Desktop, Docker Engine 28.5.1, openshell 0.0.36 from PyPI.
  • Also seen (CI): Ubuntu 24.04 (GitHub Actions ubuntu-24.04), Docker Engine 27.x.
  • Cluster image: ghcr.io/nvidia/openshell/cluster:dev at main (f17806c).
  • Last good upstream pin: 4483c860. Regression: 5116cc2 (PR feat(helm): add kubernetes local-dev environment #1158).

Logs

Captured from the local repro above:

× Gateway failed: os-bugrepro
Error:   × K8s namespace not ready
  ╰─▶ timed out waiting for namespace 'openshell' to exist: Error from server
      (NotFound): namespaces "openshell" not found

      container logs:
        ...
        time="2026-05-06T13:24:08Z" level=error msg="Failed to process config:
        failed to process /var/lib/rancher/k3s/server/manifests/envoy-gateway-
        openshell.yaml: the server could not find the requested resource"
        object="kube-system/envoy-gateway-openshell" kind="Addon"
        apiVersion="k3s.cattle.io/v1" type="Warning" reason="ApplyManifestFailed"
        message="Applying manifest at \"/var/lib/rancher/k3s/server/manifests/
        envoy-gateway-openshell.yaml\" failed: the server could not find the
        requested resource"

Suggested Fix

The file is documented as opt-in for Skaffold/Helm dev clusters that already have Envoy Gateway CRDs installed; it should not ship in the bundled-cluster path. Two minimally invasive options:

  1. Move it out of deploy/kube/manifests/ (e.g. deploy/kube/gateway-extras/) and update tasks/helm.toml (helm:gateway:apply) to the new path.
  2. Or narrow Dockerfile.images:204 COPY to an explicit allowlist (agent-sandbox.yaml openshell-helmchart.yaml) instead of *.yaml.

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (debug-openshell-cluster)
  • My agent could not resolve this — the diagnostic above explains why (root cause is in the cluster image build; no caller-side configuration reaches it)

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions