Skip to content

feat(ci,bundler): validate deployer OCI bundles matrix in KWOK#956

Merged
mchmarny merged 2 commits into
mainfrom
feat/kwok-deployer-matrix-843
May 19, 2026
Merged

feat(ci,bundler): validate deployer OCI bundles matrix in KWOK#956
mchmarny merged 2 commits into
mainfrom
feat/kwok-deployer-matrix-843

Conversation

@mchmarny
Copy link
Copy Markdown
Member

@mchmarny mchmarny commented May 18, 2026

Summary

Adds end-to-end validation of argocd, argocd-helm, and flux bundles
in the KWOK CI matrix. Each lane bundles the recipe, pushes the artifact
to an in-cluster OCI registry, lets the corresponding GitOps controller
pull and reconcile (or for flux-oci, apply), and asserts the bundle's
GitOps shape was produced. Closes the long-standing gap where Argo CD
and Flux bundle template regressions could ship to main undetected.

Three latent bundler bugs that this round-trip surfaced are fixed in
the same PR. Two further gaps are deferred to tracked follow-ups
(#963, #964).

Fixes: #843, #960, #961
Related: #962 (migrate KWOK argocd-* sync gate to Chainsaw),
#963 (Gitea-in-cluster for filesystem-bundle round-trip),
#964 (flux local-chart HelmReleases unreachable under OCI)

Deployer × output-format CI coverage

Before / after this PR, by deployer:

Deployer FS bundle OCI image
helm ✅ Chainsaw (offline, cuj1-training) + KWOK live install ➖ Intentionally unscoped — bundle pushes work generically but no downstream consumer of a helm deployer's OCI artifact exists
argocd ✅ Chainsaw bundle-variants (app-of-apps shape) New — KWOK lane (bundle → push → Argo CD pull → reconcile → pods Running)
argocd-helm ✅ Chainsaw bundle-dynamic (Chart.yaml + templates/) New — KWOK lane (Helm-OCI mediatype push + helm install round-trip)
flux ✅ Chainsaw bundle-flux New — KWOK lane (OCI pull → Kustomization apply → HelmRelease CRs created). Reconciliation deferred per #964.
helmfile ✅ Chainsaw bundle-helmfile (incl. real helmfile build) N/Ahelmfile.yaml's oci: true flags chart repos, not the bundle artifact

The KWOK matrix goes from 1 deployer (helm) to 4 (helm, argocd-oci,
argocd-helm-oci, flux-oci).

For filesystem (Git-source) bundle round-trip coverage of argocd and
flux, see follow-up #963.

What "pass" means per lane

The KWOK lanes validate bundle-format correctness, not chart
reconciliation completeness. Each lane stops at the strongest signal
its controller can produce without fighting KWOK fidelity gaps.

Lane Pull artifact Apply manifests Reconcile to Ready
helm n/a (filesystem) ✅ (helm install) ✅ pods scheduled
argocd-oci ✅ (repo-server OCI pull) ✅ (Argo CD sync) ✅ terminal pass set (below)
argocd-helm-oci ✅ (helm pull OCI) ✅ (wrapper chart install) ✅ terminal pass set (below)
flux-oci ✅ (source-controller OCI pull) ✅ (kustomize-controller apply) deferred — see #964

The argocd-* lanes admit four reality-shaped terminal-pass outcomes to
handle KWOK pod-readiness simulation gaps and operator-mutation drift:

  • Synced + Healthy — canonical pass
  • Synced + Progressing — KWOK pod-readiness simulation gap (Prometheus StatefulSet, DaemonSet probes); not a bundle defect
  • OutOfSync + Healthy + operationState.phase=Succeeded — operator-mutation drift (gpu-operator's ClusterPolicy, nvidia-dra-driver-gpu's DeviceClass); production users handle this with ignoreDifferences, not a bundle defect
  • Synced + Degraded + operationState.phase=Succeeded — chart applied but health controller marks at least one managed resource Degraded (typically a Pod whose Ready signal KWOK can't produce, e.g. kai-scheduler leader election); same phase=Succeeded guard as arm 3 rejects genuinely-broken syncs

The matrix does not assert that every chart eventually reaches
Healthy under KWOK — that conflates bundle shape with chart-author
contracts and KWOK simulation fidelity.

flux-oci intentionally stops one step earlier ("bundle applied" =
OCIRepository pulled + Kustomization applied + HelmRelease CRs created)
because the bundler emits local-chart HelmReleases pointing at a
placeholder Git URL (https://github.com/YOUR_ORG/YOUR_REPO.git).
Under OCI consumption these can't reconcile, and Flux's dependsOn
cascade stalls every downstream HelmRelease. Flux's HelmRelease v2 API
blocks the obvious fix (spec.chart.spec.sourceRef.kind rejects
OCIRepository; spec.chartRef has no path support into a multi-chart
artifact), so the proper fix needs an ADR-level design decision.
Tracked in #964.

Issues this PR exposed

The round-trip lanes immediately surfaced latent bundler bugs. Each
was invisible to existing CI because no lane previously exercised a
GitOps controller pulling an AICR bundle from a registry.

# Symptom Root cause
#960 argocd-oci failed with unsupported scheme "oci" from Argo CD repo-server application.yaml.tmpl rendered multi-source $values ref against an OCI URL; $values is Git-only in Argo CD
#961 argocd-helm-oci failed with unable to locate any tags in provided repository from helm pull Bundle pushed with generic AICR mediatype; helm v3 filters tags by helm.config.v1+json config mediatype + semver tag
(no issue, fixed in this PR) argocd-helm-oci failed component "gpu-operator-pre" not found in registry on recipes with preManifestFiles (e.g. gke-cos-training) argocdhelm.processFolders parent-resolution stripped -post suffix but not -pre
#964 (deferred) flux-oci local-chart HelmReleases stall under OCI; dependsOn cascade stalls the rest of the graph Bundler emits placeholder GitRepository URL for inner local charts; Flux's HelmRelease API rejects OCIRepository as a chart sourceRef and chartRef has no path support, so a wrapper OCIRepository can't address sub-paths. Proper fix needs ADR-level choice (render at bundle time vs. publish per-chart).

What's in this PR

  • Argo CD $values multi-source ref is Git-only; blocks argocd deployer over OCI #960argocd.Generator gains an InlineUpstreamValues flag that, when set (auto-enabled by the bundler for OCI RepoURLs), inlines per-component values as helm.valuesObject under a single-source Application. The Git-only $values multi-source ref is no longer rendered for OCI targets.

  • argocd-helm-oci: helm pull from in-cluster OCI registry returns 'unable to locate any tags' #961 — New oci.PackageAndPushHelmChart flow: builds a chart-root-prefixed chart.tgz, pushes with application/vnd.cncf.helm.config.v1+json config + application/vnd.cncf.helm.chart.content.v1.tar+gzip layer, and rewrites Chart.yaml.version to the OCI tag so helm install --version <tag> resolves. The CLI rejects non-semver tags at bundle time with an actionable remediation hint.

  • argocdhelm -pre resolution — Parent-component lookup in processFolders now symmetrically strips both -pre and -post suffixes. Recipes with preManifestFiles (gke-cos OS overlay, etc.) bundle correctly.

  • CLI contract clarification--deployer argocd-helm no longer auto-fills repoURL from the OCI reference; that bundle is URL-portable by design and repoURL is supplied at helm install time.

  • KWOK matrix scaffoldingdeployer dimension added to kwok-recipes.yaml (Tier 1 + Tier 3). Per-cell: Kind cluster + in-cluster registry:2 + the GitOps controller(s) the selected deployer needs (Argo CD chart 9.5.14 or Flux 2 install manifest v2.8.7), repo-creds template Secret (Argo CD lanes only), aicr bundle --output oci://..., then either kubectl apply (argocd-oci, flux-oci) or helm install (argocd-helm-oci), then sync-gate, then verify_pods.

  • flux-oci lane — Installs Flux 2 (source-controller / kustomize-controller / helm-controller) via the pinned install.yaml. Applies an OCIRepository (with spec.layerSelector.mediaType: application/vnd.oci.image.layer.v1.tar+gzip for the AICR generic OCI mediatype, plus spec.insecure: true for the in-cluster HTTP registry) and a Kustomization pointing at it. Sync gate validates pull + apply + HelmRelease-CRs-created; does NOT wait for HelmRelease reconciliation — see #964.

  • Sync gate (argocd) — Four-state terminal-pass predicate, non-root Application fallback for argocd-helm-oci wrappers whose status.resources hasn't populated yet, and set -e-safe kubectl probe.

  • 3-strike rulerun-all-recipes.sh bails the whole job with exit 50 after three consecutive GitOps sync timeouts (argocd-* or flux-oci) so a wedged cluster doesn't burn the full matrix budget.

  • Cleanup hardening — Three call sites switched from kubectl delete ns --wait=true --timeout=120s to async delete + force-finalize. KWOK can't run real controller finalizers, so --wait=true blocked the full 2 min per namespace; with ~10 namespaces/recipe this exhausted the 15 min job budget on cleanup after the actual validation passed. Force-finalize bypasses the protocol by PATCHing empty finalizers — safe in the ephemeral KWOK cluster. Cleanup drops from ~20 min/recipe to ~5–10 s.

  • Retry hardeningKWOK_COMMAND_RETRIES default bumped from 3 to 5 (cumulative sleep 15s → 75s) so transient GitHub releases CDN 502s on kwok-stage-fast chart fetch don't cancel CI runs.

  • SIGPIPE safetylist_bundle_entries now traps SIGPIPE in its subshell so a downstream head -N closing the pipe early can't propagate exit 141 to the whole script.

  • macOS port-5500 defaultkwok/kind-config.yaml exposes the in-cluster registry on host port 5500 (ControlCenter holds 5000); Linux CI runners are unaffected.

  • Test hermeticityTestWrite_VendorCharts_KustomizeFallthrough sets GIT_TERMINAL_PROMPT=0 so kustomize's git fetcher fails fast instead of prompting Username for 'https://github.com': mid-make qualify on macOS.

Testing

make qualify passes locally on the final commit (test + lint + e2e + scan).

KWOK acceptance against eks-training on macOS:

  • helm: ✅ PASS — baseline byte-identical to main
  • argocd-oci: ✅ PASS — 15/15 Applications terminal, 44/44 pods scheduled
  • argocd-helm-oci: ✅ PASS — 15/15 Applications terminal, 44/44 pods scheduled
  • flux-oci: ✅ PASS — OCIRepository pulled, Kustomization applied, 14 HelmReleases created (reconciliation deferred per bundler(flux): local-chart HelmReleases unreachable under OCI consumption #964)

CI matrix covers 16 recipes × 4 deployers; gke-cos-* and other recipes with COS-OS preManifestFiles specifically exercise the -pre fix.

Risk

helm and non-OCI deployer paths are byte-identical to main. Changes
that affect OCI users:

  • argocd + OCI: per-component Applications now render single-source with helm.valuesObject inlined. Per-component values.yaml files remain on disk for operator readability but are no longer referenced via $values. Git-backed argocd is unchanged.
  • argocd-helm + OCI: bundle is now published as a Helm OCI chart (helm.config.v1+json config + helm.chart.content.v1.tar+gzip layer). Chart.yaml.version is rewritten to match the OCI tag so helm install --version <tag> resolves correctly.
  • argocd-helm + OCI now rejects non-semver tags at bundle time with an actionable error message ("Wrap as a semver pre-release, e.g. 0.0.0-<tag>"), instead of producing an artifact that helm pull silently refuses to find.
  • flux + OCI: No bundler changes. The bundler-side investigation found Flux's HelmRelease v2 API blocks the natural fix (spec.chart.spec.sourceRef.kind rejects OCIRepository; spec.chartRef has no path support). The flux-oci lane validates the bundle-apply step only and tracks the proper fix in #964.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (golangci-lint 0 issues)
  • No skipped/disabled tests
  • New tests added (argocdhelm -pre resolution; argocd inline values; oci helm mediatypes + tag semver; oci helm tgz reproducibility)
  • Docs updated (docs/contributor/kwok-testing.md, docs/design/008-kwok-deployer-matrix.md)
  • Commits cryptographically signed (git commit -S)

@github-actions
Copy link
Copy Markdown
Contributor

@mchmarny mchmarny self-assigned this May 18, 2026
@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

This comment was marked as outdated.

@mchmarny mchmarny marked this pull request as draft May 18, 2026 23:40
mchmarny added a commit that referenced this pull request May 19, 2026
Ran the deployer-matrix acceptance tests locally; surfaced multiple
wiring issues that the design spike missed because it tested with a
hand-rolled minimal flow rather than the actual validate-scheduling
path. Also resolves two CodeRabbit findings on PR #956.

helm-path is byte-identical (regression baseline still passes).

Wiring fixes (validate-scheduling.sh):
- For argocd-* deployers, run `aicr bundle --output oci://...` from
  inside ${WORK_DIR}. When --output is an OCI ref the bundler writes
  the local bundle to ./bundle (CWD-relative); without the cd, the
  rendered app-of-apps.yaml never lands at ${WORK_DIR}/bundle/.
- Pass --repo `oci://registry.aicr-registry.svc.cluster.local:5000/<repo>`
  explicitly. The auto-derive in pkg/cli/bundle.go uses the same URL
  for push and pull, but Argo CD's repo-server runs in-cluster and
  must use the in-cluster Service DNS (not localhost:5000 which is the
  runner-side host port mapping).
- argocd-helm-oci helm install: use `oci://host/repo --version <tag>`,
  not the `oci://host/repo:<tag>` shape (which Helm OCI rejects as
  "unable to locate any tags"). The tag is split from OCI_REF locally.

Repository secret (install-infra.sh):
- Switch from `secret-type: repository` (exact URL match in Argo CD —
  util/db/repository_secrets.go::getRepositorySecret) to
  `secret-type: repo-creds` (longest-prefix match —
  getRepositoryCredentialIndex). One template at the …/aicr prefix
  covers every recipe pushed by validate-scheduling.sh; Repository
  type cannot because each recipe has a distinct repoURL. The
  repo-creds enrichment still propagates insecureOCIForceHttp=true to
  the synthetic Repository for each Application.

Cluster bootstrap (kind-config.yaml):
- Re-enable CoreDNS. The historical skip was correct for KWOK-only
  scheduling validation; with Argo CD pods in the cluster, they need
  in-cluster DNS to reach `argocd-redis` and to resolve the OCI
  registry's Service DNS. Comment updated to call out the dependency.

CodeRabbit findings:
- run-all-recipes.sh: reset consecutive_sync_timeouts on non-timeout
  failures so the 3-strike rule fires only on genuinely consecutive
  timeouts (a pass-fail-pass-timeout-timeout-timeout sequence is 3
  consecutive; pass-timeout-fail-timeout-timeout is 2).
- validate-scheduling.sh::wait_for_argocd_sync: move `apps_rc=$?`
  out of the `if \! kubectl get ...; then` block. The shell's negation
  rewrites the captured status, so the diagnostic was logging rc=0
  on failure. Now logs the real kubectl rc.
@mchmarny

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

mchmarny added a commit that referenced this pull request May 19, 2026
Closes the two outstanding blockers that prevented PR #956's KWOK
deployer matrix from going green on the argocd-* lanes.

#960 — argocd-oci: Argo CD's $values multi-source ref is Git-only.
The repo-server attempts a `git ListRefs` against an oci:// URL and
fails with `unsupported scheme "oci"`. Adds InlineUpstreamValues to
argocd.Generator: when set, KindUpstreamHelm Applications render
single-source with helm.valuesObject inlined from ComponentValues
(no second `ref: values` source). Bundler auto-enables it when the
configured RepoURL starts with `oci://`; argocdhelm's inner Generator
leaves it off because its convertToSingleSourceWithValues transformer
needs the multi-source shape to emit its dynamic-merge Helm template.

#961 — argocd-helm-oci: helm pull rejects the AICR artifactType.
Helm v3 reads manifest.config.mediaType to detect a chart artifact
during pull; with our generic application/vnd.nvidia.aicr.artifact
config + application/vnd.oci.image.layer.v1.tar+gzip layer it
reports `unable to locate any tags in provided repository` even when
the tag exists in /v2/.../tags/list. Adds oci.PackageAndPushHelmChart:
- builds a chart-root-prefixed chart.tgz from the bundle dir
  (matches `helm package` output exactly so helm's loader resolves
  templates/ at the right level)
- pushes with helm.config.v1+json config + helm.chart.content.v1.
  tar+gzip layer via an OCI 1.0 manifest (v1.0 keeps the config-
  blob's mediaType AS the artifactType, which older registry:2
  builds — including the one the kwok lane uses — require)
- rewrites Chart.yaml.version to Reference.Tag before packaging,
  because helm OCI tags ARE chart versions; `helm install --version
  <tag>` validates the chart manifest's version against <tag>.
CLI branches on --deployer argocd-helm to route through the new
flow; all other deployers continue to use the generic AICR OCI push.

Tests: TestGenerate_InlineUpstreamValues asserts the rendered
Application has spec.source (no sources), valuesObject populated,
no $values references; TestRenderInlineValuesYAML covers the
empty-map {} fallback and nested-map indentation.
TestStageHelmOCIManifest_HasHelmMediaTypes asserts both mediaTypes
appear in the staged OCI manifest; TestBuildHelmChartTGZ_
HasChartRootPrefix asserts every tar entry is `<chart-name>/...`
prefixed; TestBuildHelmChartTGZ_Reproducible guards against
time.Now() leaking into the chart.tgz digest;
TestPackageAndPushHelmChart_RewritesChartYAMLOnDisk pins the
version-rewrite side effect.

The PR's helm lane is byte-identical to before. Local repro for the
two newly-unblocked lanes is in the PR description.

Refs: #843
coderabbitai[bot]

This comment was marked as resolved.

@mchmarny mchmarny changed the title feat(ci): KWOK deployer matrix — helm + argocd-oci + argocd-helm-oci feat(ci,bundler): validate argocd / argocd-helm OCI bundles in KWOK May 19, 2026
@mchmarny mchmarny marked this pull request as ready for review May 19, 2026 02:49
@mchmarny mchmarny changed the title feat(ci,bundler): validate argocd / argocd-helm OCI bundles in KWOK feat(ci,bundler): validate deployer OCI bundles matrix in KWOK May 19, 2026
@github-actions

This comment was marked as outdated.

mchmarny added a commit that referenced this pull request May 19, 2026
Ran the deployer-matrix acceptance tests locally; surfaced multiple
wiring issues that the design spike missed because it tested with a
hand-rolled minimal flow rather than the actual validate-scheduling
path. Also resolves two CodeRabbit findings on PR #956.

helm-path is byte-identical (regression baseline still passes).

Wiring fixes (validate-scheduling.sh):
- For argocd-* deployers, run `aicr bundle --output oci://...` from
  inside ${WORK_DIR}. When --output is an OCI ref the bundler writes
  the local bundle to ./bundle (CWD-relative); without the cd, the
  rendered app-of-apps.yaml never lands at ${WORK_DIR}/bundle/.
- Pass --repo `oci://registry.aicr-registry.svc.cluster.local:5000/<repo>`
  explicitly. The auto-derive in pkg/cli/bundle.go uses the same URL
  for push and pull, but Argo CD's repo-server runs in-cluster and
  must use the in-cluster Service DNS (not localhost:5000 which is the
  runner-side host port mapping).
- argocd-helm-oci helm install: use `oci://host/repo --version <tag>`,
  not the `oci://host/repo:<tag>` shape (which Helm OCI rejects as
  "unable to locate any tags"). The tag is split from OCI_REF locally.

Repository secret (install-infra.sh):
- Switch from `secret-type: repository` (exact URL match in Argo CD —
  util/db/repository_secrets.go::getRepositorySecret) to
  `secret-type: repo-creds` (longest-prefix match —
  getRepositoryCredentialIndex). One template at the …/aicr prefix
  covers every recipe pushed by validate-scheduling.sh; Repository
  type cannot because each recipe has a distinct repoURL. The
  repo-creds enrichment still propagates insecureOCIForceHttp=true to
  the synthetic Repository for each Application.

Cluster bootstrap (kind-config.yaml):
- Re-enable CoreDNS. The historical skip was correct for KWOK-only
  scheduling validation; with Argo CD pods in the cluster, they need
  in-cluster DNS to reach `argocd-redis` and to resolve the OCI
  registry's Service DNS. Comment updated to call out the dependency.

CodeRabbit findings:
- run-all-recipes.sh: reset consecutive_sync_timeouts on non-timeout
  failures so the 3-strike rule fires only on genuinely consecutive
  timeouts (a pass-fail-pass-timeout-timeout-timeout sequence is 3
  consecutive; pass-timeout-fail-timeout-timeout is 2).
- validate-scheduling.sh::wait_for_argocd_sync: move `apps_rc=$?`
  out of the `if \! kubectl get ...; then` block. The shell's negation
  rewrites the captured status, so the diagnostic was logging rc=0
  on failure. Now logs the real kubectl rc.
mchmarny added a commit that referenced this pull request May 19, 2026
Closes the two outstanding blockers that prevented PR #956's KWOK
deployer matrix from going green on the argocd-* lanes.

#960 — argocd-oci: Argo CD's $values multi-source ref is Git-only.
The repo-server attempts a `git ListRefs` against an oci:// URL and
fails with `unsupported scheme "oci"`. Adds InlineUpstreamValues to
argocd.Generator: when set, KindUpstreamHelm Applications render
single-source with helm.valuesObject inlined from ComponentValues
(no second `ref: values` source). Bundler auto-enables it when the
configured RepoURL starts with `oci://`; argocdhelm's inner Generator
leaves it off because its convertToSingleSourceWithValues transformer
needs the multi-source shape to emit its dynamic-merge Helm template.

#961 — argocd-helm-oci: helm pull rejects the AICR artifactType.
Helm v3 reads manifest.config.mediaType to detect a chart artifact
during pull; with our generic application/vnd.nvidia.aicr.artifact
config + application/vnd.oci.image.layer.v1.tar+gzip layer it
reports `unable to locate any tags in provided repository` even when
the tag exists in /v2/.../tags/list. Adds oci.PackageAndPushHelmChart:
- builds a chart-root-prefixed chart.tgz from the bundle dir
  (matches `helm package` output exactly so helm's loader resolves
  templates/ at the right level)
- pushes with helm.config.v1+json config + helm.chart.content.v1.
  tar+gzip layer via an OCI 1.0 manifest (v1.0 keeps the config-
  blob's mediaType AS the artifactType, which older registry:2
  builds — including the one the kwok lane uses — require)
- rewrites Chart.yaml.version to Reference.Tag before packaging,
  because helm OCI tags ARE chart versions; `helm install --version
  <tag>` validates the chart manifest's version against <tag>.
CLI branches on --deployer argocd-helm to route through the new
flow; all other deployers continue to use the generic AICR OCI push.

Tests: TestGenerate_InlineUpstreamValues asserts the rendered
Application has spec.source (no sources), valuesObject populated,
no $values references; TestRenderInlineValuesYAML covers the
empty-map {} fallback and nested-map indentation.
TestStageHelmOCIManifest_HasHelmMediaTypes asserts both mediaTypes
appear in the staged OCI manifest; TestBuildHelmChartTGZ_
HasChartRootPrefix asserts every tar entry is `<chart-name>/...`
prefixed; TestBuildHelmChartTGZ_Reproducible guards against
time.Now() leaking into the chart.tgz digest;
TestPackageAndPushHelmChart_RewritesChartYAMLOnDisk pins the
version-rewrite side effect.

The PR's helm lane is byte-identical to before. Local repro for the
two newly-unblocked lanes is in the PR description.

Refs: #843
@mchmarny mchmarny force-pushed the feat/kwok-deployer-matrix-843 branch from aa04c97 to b443028 Compare May 19, 2026 10:27
@mchmarny mchmarny marked this pull request as draft May 19, 2026 11:10
@mchmarny mchmarny force-pushed the feat/kwok-deployer-matrix-843 branch 5 times, most recently from 6dbe344 to aea1e35 Compare May 19, 2026 12:29
@mchmarny mchmarny marked this pull request as ready for review May 19, 2026 12:43
@mchmarny mchmarny enabled auto-merge (squash) May 19, 2026 12:51
Copy link
Copy Markdown
Member

@njhensley njhensley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a substantial and well-scoped contribution — the round-trip lanes immediately surfaced three latent bundler bugs, which is exactly the case for the lanes existing. The OCI tag/version handling in pkg/oci/helm.go and pkg/bundler/deployer/argocd/argocd.go is carefully thought through (inline comments preempt several reasonable misreadings). PR description honestly tracks the deferred items (#963, #964) rather than glossing over them.

I have one blocker and a handful of follow-up suggestions. None of the follow-ups need to block merge if you'd rather chase them as a stacked PR.

Blocker: the new force-finalize namespace cleanup path runs against whatever the current kubectl context happens to be, with no positive identity check that it's the KWOK Kind cluster. If KUBECONFIG or kubectl config current-context points anywhere else when someone runs validate-scheduling.sh directly (it's independently invokable per the script header) or via make from a stale shell, the EXIT trap and cleanup_old_tests will strip finalizers from every non-allowlisted namespace on the live cluster. The pre-existing --wait=true --timeout=120s path was slow but did not bypass the finalizer protocol. This expansion to PATCH spec.finalizers=[] materially widens blast radius and needs a guard before merge. Concrete fix in the inline comment.

Themes worth a follow-up PR (not blocking):

  • System-namespace allowlist drift. Same regex is copy-pasted in three places (run-all-recipes.sh:184, validate-scheduling.sh:290, :497). Any one of them gaining flux-system or aicr-registry and the others not will silently strip finalizers from the infra namespaces. Hoist to one sourced constant.
  • Sync-gate complexity. The four-arm Argo CD sync predicate at validate-scheduling.sh:944-1056 collapses to roughly (sync == Synced || phase == Succeeded) && health != Missing. ~100 lines → ~15. The per-arm prose belongs in ADR-008, not the wait loop.
  • Mirror wait-loops. wait_for_argocd_* and wait_for_flux_* are ~250 lines of structural duplication. At minimum, fold wait_for_*_root_app into one helper.

Pre-merge checklist:

  • Add KWOK-context guard in front of cleanup_old_tests and the EXIT trap (blocker below)
  • Rebase onto origin/main (PR has the needs-rebase label)
  • Optional but recommended: address the in-place Chart.yaml rewrite and the HelmRelease finalizer-patch shape in the same push since the fixes are tiny

Nice work overall.

resolve_flux_root_names "$recipe"

# Set up cleanup trap
trap cleanup EXIT
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker. This force-finalize path runs before verify_kwok_nodes (step 3), so the only thing standing between a misconfigured KUBECONFIG and a finalizer wipe on whatever cluster the user is authenticated to is the static system-namespace allowlist. Since validate-scheduling.sh is independently invokable (per the header ./validate-scheduling.sh <recipe>) and the EXIT trap at line 1779 also fires unconditionally, I'd add a positive identity gate up front — before either path can run.

Something like:

ensure_kwok_context() {
  local ctx; ctx=$(kubectl config current-context 2>/dev/null || true)
  if [[ "$ctx" != "kind-${CLUSTER_NAME:-aicr-kwok}" ]]; then
    log_error "refusing to run: current kubectl context is '$ctx', not a KWOK Kind cluster"
    exit 1
  fi
  kubectl get node -l type=kwok -o name | grep -q . \
    || { log_error "refusing to run: no kwok-typed nodes in current context"; exit 1; }
}

Call it before trap cleanup EXIT at line 1779 and before cleanup_old_tests at line 1790. The same guard should fire from run-all-recipes.sh:cleanup_between_tests (line 157) for the same reason.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f76ee55. Added ensure_kwok_context_loose (context-name only) and ensure_kwok_context (context + kwok-node-label) in kwok/scripts/lib/cleanup.sh. The strict variant is called from validate-scheduling.sh::main() before trap cleanup EXIT (apply-nodes.sh has populated the cluster by then). The loose variant is called from run-all-recipes.sh::main() after ensure_cluster returns — split because at that point no recipe has applied nodes yet, so the strict label check would false-positive on cold start.

Comment thread pkg/oci/helm.go
if err != nil {
return nil, apperrors.Wrap(apperrors.ErrCodeInternal, "failed to marshal Chart.yaml", err)
}
if err := os.WriteFile(chartPath, rewritten, 0o600); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadAndRewriteChartYAML rewrites Chart.yaml in place inside the caller-supplied SourceDir. The push completes cleanly, but the source dir is now mutated — a subsequent push to a different tag, or a code path that re-derives version from Chart.yaml between push attempts, reads stale tag data. It's also non-atomic; a crash mid-write leaves a corrupt Chart.yaml.

Consider copying SourceDir to a temp dir at the top of PackageAndPushHelmChart, rewriting and packaging from the temp copy, and leaving the caller's source dir untouched. The same approach also fixes the atomicity concern for free.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f76ee55. PackageAndPushHelmChart now copies SourceDir to <OutputDir>/helm-chart-work and runs loadAndRewriteChartYAML + buildHelmChartTGZ against the copy. The caller's source tree stays byte-identical; the atomicity concern is gone too (crash mid-write leaves a corrupt file in the disposable work dir, not the caller's tree). Reused the existing copyDir helper from push.go.

Comment thread kwok/scripts/validate-scheduling.sh Outdated
--ignore-not-found --wait=false 2>/dev/null || true
# Strip finalizers if helm-controller's reconciler is stuck.
kubectl patch helmrelease "$name" -n "$ns" \
--type=json -p='[{"op":"remove","path":"/metadata/finalizers"}]' \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC 6902 remove errors when the target path is absent; the || true then silently swallows it, which means this guard is a no-op on exactly the case it exists for — a HelmRelease whose helm-controller crashed before attaching its finalizer.

Switch to a merge patch:

kubectl patch helmrelease "$hr" -n "$ns" \
  --type=merge -p '{"metadata":{"finalizers":null}}' >/dev/null 2>&1 || true

Idempotent whether finalizers are present or absent.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f76ee55. Switched the HelmRelease finalizer strip to --type=merge -p '{"metadata":{"finalizers":null}}'. Idempotent whether finalizers are present or absent — the previous --type=json remove would error when the path was absent and the || true then silently swallowed the exact case the guard was meant for.

Comment thread kwok/scripts/run-all-recipes.sh Outdated
# cluster (no real workloads to leak; cluster destroyed at job end).
local system_ns="default|kube-node-lease|kube-public|kube-system|kwok-system|local-path-storage|argocd|aicr-registry|flux-system"
local terminating
terminating=$(kubectl get ns -o jsonpath='{range .items[?(@.status.phase=="Terminating")]}{.metadata.name}{"\n"}{end}' 2>/dev/null | grep -vE "^(${system_ns})$" || true)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only iterates namespaces already in status.phase=="Terminating". A namespace that's still Active with a pending deletionTimestamp is skipped, and the next recipe collides on the existing namespace name. The full cleanup_old_tests in validate-scheduling.sh handles this with a two-phase pattern (async delete --wait=false, then force-finalize the stuck remainder).

Either mirror the same two-phase logic here, or factor both into a shared helper sourced by both scripts. Otherwise the between-test cleanup silently drifts away from the in-recipe cleanup over time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f76ee55. cleanup_between_tests now mirrors validate-scheduling.sh::cleanup_old_tests's two-phase shape: phase 1 issues async deletes (--wait=false) against every non-system namespace, phase 2 sleeps briefly and force-finalizes anything still present. Also factored the system-ns allowlist into kwok/scripts/lib/cleanup.sh::SYSTEM_NS_PATTERN (your other comment) so both cleanup paths now consume the same source.

Comment thread kwok/scripts/run-all-recipes.sh Outdated
# while doing zero useful work. Force-finalize bypasses the protocol
# by PATCHing spec.finalizers to []; safe in the ephemeral KWOK
# cluster (no real workloads to leak; cluster destroyed at job end).
local system_ns="default|kube-node-lease|kube-public|kube-system|kwok-system|local-path-storage|argocd|aicr-registry|flux-system"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex is the only thing preventing the force-finalize sweep from hitting the install-infra-owned namespaces (argocd, flux-system, aicr-registry). It's now duplicated in three locations — here, validate-scheduling.sh:290, and validate-scheduling.sh:497. Any drift (typo, missing pipe, one site updated and not the other two) silently strips finalizers from those namespaces and breaks the next recipe.

Hoist to a single sourced constant — either a tiny kwok/scripts/lib/cleanup.sh or a readonly SYSTEM_NS_PATTERN=… exported from one of the existing scripts.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f76ee55. Created kwok/scripts/lib/cleanup.sh with readonly SYSTEM_NS_PATTERN=... and sourced from both scripts. All three sites (run-all-recipes.sh:cleanup_between_tests, validate-scheduling.sh:cleanup, validate-scheduling.sh:cleanup_old_tests) now use ${SYSTEM_NS_PATTERN} with a # see SYSTEM_NS_PATTERN in lib/cleanup.sh annotation pointing back at the canonical definition.

Comment thread pkg/oci/helm.go
return nil, err
}

configBlob, err := json.Marshal(chartMeta)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a one-line comment pinning the contract: this struct is intentionally the OCI config blob (registry-visible metadata), not the full Chart.yaml. Full Chart.yaml content still ships in the tar layer, so a Helm chart key not modeled here is not lost. Without the comment, a future reader is likely to "fix" this by widening the struct or marshaling the raw map.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f76ee55. Expanded the godoc on chartYAML to pin the contract: it's intentionally the OCI config blob (registry-visible metadata), NOT a model of full Chart.yaml; full Chart.yaml content still ships in the tar layer; widening this struct is the wrong fix.

// in registry` at bundle time. Caught in the KWOK gke-cos-training
// CI lane after the pre-fix; before this commit the bug only
// surfaced for recipes that combined Helm + pre-manifest overlays.
func TestGenerate_PreManifestParentResolution(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-pre resolution covered here, -post covered elsewhere, but no test exercises a component with both a -pre and a -post sibling — which is the actual gke-cos OS overlay shape described in the production-code comment. The loop for _, suffix := range []string{"-pre", "-post"} runs twice across two distinct folder iterations in that case, and the override-key lookup for each must resolve back to the same parent.

A new row (or a sibling subtest) staging 001-gpu-operator-pre/ and 003-gpu-operator-post/ against a single registered gpu-operator and asserting both wrapper templates point at the same parent override-key would cover the case for ~20 lines.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f76ee55. New TestGenerate_PreAndPostManifestParentResolution stages a single registered gpu-operator with both ComponentPreManifests AND ComponentPostManifests populated, then asserts both 001-gpu-operator-pre/ and 003-gpu-operator-post/ folders exist AND both templates/gpu-operator-pre.yaml and templates/gpu-operator-post.yaml wrapper templates were produced. Catches the asymmetric-suffix-strip regression the production code comment warns about.

Comment thread pkg/oci/helm_test.go
// dedups the blob. A regression (e.g., embedding time.Now() in the
// gzip header or in tar mtimes) would silently make every push a
// new blob, defeating registry-level caching.
func TestBuildHelmChartTGZ_Reproducible(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same-process determinism doesn't prove cross-run reproducibility — a regression that caches time.Now() once at package init would still pass this test. Consider walking the produced tar headers and asserting absolute invariants:

for _, hdr := range hdrs {
    if !hdr.ModTime.IsZero() { /* fail */ }
    if hdr.Uid != 0 || hdr.Gid != 0 { /* fail */ }
    if hdr.Uname != "" || hdr.Gname != "" { /* fail */ }
}
// plus gzip header MTIME zero check

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f76ee55. TestBuildHelmChartTGZ_Reproducible now walks the produced tar and asserts per-entry ModTime / AccessTime / ChangeTime are zero (or Unix epoch — the tar wire format normalizes the Go zero time to epoch), Uid/Gid are 0, Uname/Gname are empty, AND the outer gzip.Header.ModTime is zero. A time.Now() cached at package-init that passed the same-process byte-equality would now fail these absolute-value assertions.

Comment thread pkg/oci/helm_test.go
// satisfy oras's existence check), and manifest PUTs. Anything else
// returns 404 — the goal isn't a conformant registry, just the
// endpoints PackageAndPushHelmChart's push path actually touches.
func newFakeOCIRegistry(t *testing.T) *httptest.Server {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fake registry accepts manifest PUTs without inspecting manifest.config.mediaType, which is the exact field whose handling motivated the #961 fix. The full round-trip (TestPackageAndPushHelmChart_PushesAgainstFakeRegistry) currently only asserts the outer manifest envelope mediaType.

Either have the fake assert manifest.Config.MediaType == helmConfigMediaType on the manifest PUT handler, or GET the manifest back at the end of the round-trip test and decode it to verify the Helm config mediaType survived.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f76ee55. TestPackageAndPushHelmChart_PushesAgainstFakeRegistry now round-trips: after the push, it GETs the manifest back from the fake registry via HTTP, decodes the JSON, and asserts manifest.Config.MediaType == helmConfigMediaType AND manifest.Layers[0].MediaType == helmLayerMediaType. The exact field whose handling motivated #961 is now under test.

log_info "Running deploy.sh --no-wait..."

local deploy_output
if ! deploy_output=$("${bundle_dir}/deploy.sh" --no-wait 2>&1); then
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optional, not blocking.) The four arms here collapse to roughly (sync == "Synced" || operationState.phase == "Succeeded") && health != "Missing". Arms 1–2 reduce to Synced (any non-Missing health); arms 3–4 to phase == Succeeded. The per-arm prose belongs in ADR-008 §"Terminal Pass Predicate"; the wait loop just needs the predicate. Happy to take this as a follow-up.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking as a follow-up — not addressing in this PR per your note. Filed a mental note to revisit when the ADR-008 §"Terminal Pass Predicate" section gets the predicate-collapse rewrite.

@mchmarny mchmarny force-pushed the feat/kwok-deployer-matrix-843 branch from aea1e35 to f76ee55 Compare May 19, 2026 17:01
Adds end-to-end KWOK validation for argocd, argocd-helm, and flux
bundles published to OCI. Each lane bundles, pushes to an in-cluster
OCI registry, lets the GitOps controller pull and reconcile (or for
flux-oci, apply), and asserts the expected GitOps shape.

Closes the long-standing gap where Argo CD / Flux bundle template
regressions could ship to main undetected. Three latent bundler bugs
surfaced and were fixed during the round-trip.

Coverage delta:
  helm           helm filesystem (baseline, unchanged)
  argocd-oci     NEW - bundle -> OCI push -> Argo CD pull -> reconcile
  argocd-helm-oci NEW - helm OCI mediatype push -> helm install
  flux-oci       NEW - bundle -> OCI pull -> Kustomization apply
                 (HelmRelease reconciliation deferred; see #964)

Latent bundler bugs fixed:
  - #960 argocd-oci `$values` Git-only ref under OCI -> inline values
  - #961 helm OCI mediatype + semver-tag filter -> proper Helm OCI flow
  - argocdhelm `-pre` parent-resolution stripped only `-post` suffix

Infrastructure hardening (from KWOK live testing):
  - Async delete + force-finalize cleanup (KWOK can't run finalizers)
  - 4-arm Argo CD sync gate (Synced+Healthy, Synced+Progressing,
    OutOfSync+Healthy+op.Succeeded, Synced+Degraded+op.Succeeded)
  - Retry max_attempts 3 -> 5 for transient CDN 502s
  - list_bundle_entries SIGPIPE-safe under pipefail
  - macOS port-5500 default (ControlCenter holds 5000)
  - TestWrite_VendorCharts_KustomizeFallthrough hermeticity on macOS

Fixes: #843, #960, #961
Related: #962, #963, #964
@mchmarny mchmarny force-pushed the feat/kwok-deployer-matrix-843 branch from f76ee55 to 45f17ca Compare May 19, 2026 17:16
Copy link
Copy Markdown
Member

@njhensley njhensley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough response — all 12 actionable items addressed cleanly in 45f17cad, and the explicit deferral on the sync-gate predicate collapse (with ADR-008 reference) is acceptable.

A few highlights worth calling out:

  • The os.MkdirTemp staging approach with TestPackageAndPushHelmChart_SourceEqualsOutputDir as a regression guard is exactly the right shape. The self-discovered same-dir recursion bug being captured as a test is good engineering.
  • The widened defaults.MaxChartYAMLBytes + validateChartName (called from both loadAndRewriteChartYAML AND the exported buildHelmChartTGZ) gives the OCI push path solid defense-in-depth.
  • The strike-gate widening to include flux-oci (rather than just argocd-*) is a sharper read than the original review — wait_for_flux_sync also returns 50, and gating on both deployers is the right call.
  • The two-phase cleanup_between_tests mirroring cleanup_old_tests, with SYSTEM_NS_PATTERN centralized in lib/cleanup.sh, eliminates the drift class entirely.

One minor follow-up surfaced from reading the new safety guard — see the inline comment. Not blocking; consider for a stacked PR.

Approving.

local ctx
ctx=$(kubectl config current-context 2>/dev/null || true)
case "$ctx" in
kind-aicr-kwok|kind-aicr-kwok-test)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Minor, non-blocking follow-up.) This hardcodes acceptance of kind-aicr-kwok or kind-aicr-kwok-test, but KWOK_CLUSTER is a publicly documented env var (see .github/actions/kwok-test/action.yml:115 and run-all-recipes.sh:64's CLUSTER_NAME="${KWOK_CLUSTER:-aicr-kwok-test}"). A developer running locally with KWOK_CLUSTER=my-dev-cluster legitimately creates kind-my-dev-cluster and is then hard-blocked by this safety check.

The default CI path is unaffected, but the env var exists specifically to permit non-default names, and the symptom (hard-fail with a generic error) is bad UX for the only population who'd encounter it.

Suggested fix — parameterize against the caller's expected context name:

ensure_kwok_context_loose() {
    local ctx expected="${1:-${CONTEXT:-}}"
    ctx=$(kubectl config current-context 2>/dev/null || true)
    case "$ctx" in
        kind-*)
            if [[ -n "$expected" && "$ctx" != "$expected" ]]; then
                echo "[ERROR] ensure_kwok_context: context '$ctx' does not match expected '$expected'" >&2
                exit 1
            fi
            return 0
            ;;
        *)
            echo "[ERROR] ensure_kwok_context: context '${ctx:-<none>}' is not a kind- cluster — refusing" >&2
            exit 1
            ;;
    esac
}

run-all-recipes.sh already computes CONTEXT="kind-${CLUSTER_NAME}" at line 65, so callers can pass it through or let the function pick it up from the environment. Preserves the safety invariant (context must be a kind- cluster) while honoring the existing customization knob.

@mchmarny mchmarny disabled auto-merge May 19, 2026 17:32
@mchmarny mchmarny merged commit 291439a into main May 19, 2026
106 checks passed
@mchmarny mchmarny deleted the feat/kwok-deployer-matrix-843 branch May 19, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI does not validate deployer-specific bundle output (Argo CD bundles can ship broken)

2 participants