feat(ci,bundler): validate deployer OCI bundles matrix in KWOK#956
Conversation
|
🌿 Preview your docs: https://nvidia-preview-feat-kwok-deployer-matrix-843.docs.buildwithfern.com/aicr |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
Ran the deployer-matrix acceptance tests locally; surfaced multiple wiring issues that the design spike missed because it tested with a hand-rolled minimal flow rather than the actual validate-scheduling path. Also resolves two CodeRabbit findings on PR #956. helm-path is byte-identical (regression baseline still passes). Wiring fixes (validate-scheduling.sh): - For argocd-* deployers, run `aicr bundle --output oci://...` from inside ${WORK_DIR}. When --output is an OCI ref the bundler writes the local bundle to ./bundle (CWD-relative); without the cd, the rendered app-of-apps.yaml never lands at ${WORK_DIR}/bundle/. - Pass --repo `oci://registry.aicr-registry.svc.cluster.local:5000/<repo>` explicitly. The auto-derive in pkg/cli/bundle.go uses the same URL for push and pull, but Argo CD's repo-server runs in-cluster and must use the in-cluster Service DNS (not localhost:5000 which is the runner-side host port mapping). - argocd-helm-oci helm install: use `oci://host/repo --version <tag>`, not the `oci://host/repo:<tag>` shape (which Helm OCI rejects as "unable to locate any tags"). The tag is split from OCI_REF locally. Repository secret (install-infra.sh): - Switch from `secret-type: repository` (exact URL match in Argo CD — util/db/repository_secrets.go::getRepositorySecret) to `secret-type: repo-creds` (longest-prefix match — getRepositoryCredentialIndex). One template at the …/aicr prefix covers every recipe pushed by validate-scheduling.sh; Repository type cannot because each recipe has a distinct repoURL. The repo-creds enrichment still propagates insecureOCIForceHttp=true to the synthetic Repository for each Application. Cluster bootstrap (kind-config.yaml): - Re-enable CoreDNS. The historical skip was correct for KWOK-only scheduling validation; with Argo CD pods in the cluster, they need in-cluster DNS to reach `argocd-redis` and to resolve the OCI registry's Service DNS. Comment updated to call out the dependency. CodeRabbit findings: - run-all-recipes.sh: reset consecutive_sync_timeouts on non-timeout failures so the 3-strike rule fires only on genuinely consecutive timeouts (a pass-fail-pass-timeout-timeout-timeout sequence is 3 consecutive; pass-timeout-fail-timeout-timeout is 2). - validate-scheduling.sh::wait_for_argocd_sync: move `apps_rc=$?` out of the `if \! kubectl get ...; then` block. The shell's negation rewrites the captured status, so the diagnostic was logging rc=0 on failure. Now logs the real kubectl rc.
This comment was marked as resolved.
This comment was marked as resolved.
Closes the two outstanding blockers that prevented PR #956's KWOK deployer matrix from going green on the argocd-* lanes. #960 — argocd-oci: Argo CD's $values multi-source ref is Git-only. The repo-server attempts a `git ListRefs` against an oci:// URL and fails with `unsupported scheme "oci"`. Adds InlineUpstreamValues to argocd.Generator: when set, KindUpstreamHelm Applications render single-source with helm.valuesObject inlined from ComponentValues (no second `ref: values` source). Bundler auto-enables it when the configured RepoURL starts with `oci://`; argocdhelm's inner Generator leaves it off because its convertToSingleSourceWithValues transformer needs the multi-source shape to emit its dynamic-merge Helm template. #961 — argocd-helm-oci: helm pull rejects the AICR artifactType. Helm v3 reads manifest.config.mediaType to detect a chart artifact during pull; with our generic application/vnd.nvidia.aicr.artifact config + application/vnd.oci.image.layer.v1.tar+gzip layer it reports `unable to locate any tags in provided repository` even when the tag exists in /v2/.../tags/list. Adds oci.PackageAndPushHelmChart: - builds a chart-root-prefixed chart.tgz from the bundle dir (matches `helm package` output exactly so helm's loader resolves templates/ at the right level) - pushes with helm.config.v1+json config + helm.chart.content.v1. tar+gzip layer via an OCI 1.0 manifest (v1.0 keeps the config- blob's mediaType AS the artifactType, which older registry:2 builds — including the one the kwok lane uses — require) - rewrites Chart.yaml.version to Reference.Tag before packaging, because helm OCI tags ARE chart versions; `helm install --version <tag>` validates the chart manifest's version against <tag>. CLI branches on --deployer argocd-helm to route through the new flow; all other deployers continue to use the generic AICR OCI push. Tests: TestGenerate_InlineUpstreamValues asserts the rendered Application has spec.source (no sources), valuesObject populated, no $values references; TestRenderInlineValuesYAML covers the empty-map {} fallback and nested-map indentation. TestStageHelmOCIManifest_HasHelmMediaTypes asserts both mediaTypes appear in the staged OCI manifest; TestBuildHelmChartTGZ_ HasChartRootPrefix asserts every tar entry is `<chart-name>/...` prefixed; TestBuildHelmChartTGZ_Reproducible guards against time.Now() leaking into the chart.tgz digest; TestPackageAndPushHelmChart_RewritesChartYAMLOnDisk pins the version-rewrite side effect. The PR's helm lane is byte-identical to before. Local repro for the two newly-unblocked lanes is in the PR description. Refs: #843
This comment was marked as outdated.
This comment was marked as outdated.
Ran the deployer-matrix acceptance tests locally; surfaced multiple wiring issues that the design spike missed because it tested with a hand-rolled minimal flow rather than the actual validate-scheduling path. Also resolves two CodeRabbit findings on PR #956. helm-path is byte-identical (regression baseline still passes). Wiring fixes (validate-scheduling.sh): - For argocd-* deployers, run `aicr bundle --output oci://...` from inside ${WORK_DIR}. When --output is an OCI ref the bundler writes the local bundle to ./bundle (CWD-relative); without the cd, the rendered app-of-apps.yaml never lands at ${WORK_DIR}/bundle/. - Pass --repo `oci://registry.aicr-registry.svc.cluster.local:5000/<repo>` explicitly. The auto-derive in pkg/cli/bundle.go uses the same URL for push and pull, but Argo CD's repo-server runs in-cluster and must use the in-cluster Service DNS (not localhost:5000 which is the runner-side host port mapping). - argocd-helm-oci helm install: use `oci://host/repo --version <tag>`, not the `oci://host/repo:<tag>` shape (which Helm OCI rejects as "unable to locate any tags"). The tag is split from OCI_REF locally. Repository secret (install-infra.sh): - Switch from `secret-type: repository` (exact URL match in Argo CD — util/db/repository_secrets.go::getRepositorySecret) to `secret-type: repo-creds` (longest-prefix match — getRepositoryCredentialIndex). One template at the …/aicr prefix covers every recipe pushed by validate-scheduling.sh; Repository type cannot because each recipe has a distinct repoURL. The repo-creds enrichment still propagates insecureOCIForceHttp=true to the synthetic Repository for each Application. Cluster bootstrap (kind-config.yaml): - Re-enable CoreDNS. The historical skip was correct for KWOK-only scheduling validation; with Argo CD pods in the cluster, they need in-cluster DNS to reach `argocd-redis` and to resolve the OCI registry's Service DNS. Comment updated to call out the dependency. CodeRabbit findings: - run-all-recipes.sh: reset consecutive_sync_timeouts on non-timeout failures so the 3-strike rule fires only on genuinely consecutive timeouts (a pass-fail-pass-timeout-timeout-timeout sequence is 3 consecutive; pass-timeout-fail-timeout-timeout is 2). - validate-scheduling.sh::wait_for_argocd_sync: move `apps_rc=$?` out of the `if \! kubectl get ...; then` block. The shell's negation rewrites the captured status, so the diagnostic was logging rc=0 on failure. Now logs the real kubectl rc.
Closes the two outstanding blockers that prevented PR #956's KWOK deployer matrix from going green on the argocd-* lanes. #960 — argocd-oci: Argo CD's $values multi-source ref is Git-only. The repo-server attempts a `git ListRefs` against an oci:// URL and fails with `unsupported scheme "oci"`. Adds InlineUpstreamValues to argocd.Generator: when set, KindUpstreamHelm Applications render single-source with helm.valuesObject inlined from ComponentValues (no second `ref: values` source). Bundler auto-enables it when the configured RepoURL starts with `oci://`; argocdhelm's inner Generator leaves it off because its convertToSingleSourceWithValues transformer needs the multi-source shape to emit its dynamic-merge Helm template. #961 — argocd-helm-oci: helm pull rejects the AICR artifactType. Helm v3 reads manifest.config.mediaType to detect a chart artifact during pull; with our generic application/vnd.nvidia.aicr.artifact config + application/vnd.oci.image.layer.v1.tar+gzip layer it reports `unable to locate any tags in provided repository` even when the tag exists in /v2/.../tags/list. Adds oci.PackageAndPushHelmChart: - builds a chart-root-prefixed chart.tgz from the bundle dir (matches `helm package` output exactly so helm's loader resolves templates/ at the right level) - pushes with helm.config.v1+json config + helm.chart.content.v1. tar+gzip layer via an OCI 1.0 manifest (v1.0 keeps the config- blob's mediaType AS the artifactType, which older registry:2 builds — including the one the kwok lane uses — require) - rewrites Chart.yaml.version to Reference.Tag before packaging, because helm OCI tags ARE chart versions; `helm install --version <tag>` validates the chart manifest's version against <tag>. CLI branches on --deployer argocd-helm to route through the new flow; all other deployers continue to use the generic AICR OCI push. Tests: TestGenerate_InlineUpstreamValues asserts the rendered Application has spec.source (no sources), valuesObject populated, no $values references; TestRenderInlineValuesYAML covers the empty-map {} fallback and nested-map indentation. TestStageHelmOCIManifest_HasHelmMediaTypes asserts both mediaTypes appear in the staged OCI manifest; TestBuildHelmChartTGZ_ HasChartRootPrefix asserts every tar entry is `<chart-name>/...` prefixed; TestBuildHelmChartTGZ_Reproducible guards against time.Now() leaking into the chart.tgz digest; TestPackageAndPushHelmChart_RewritesChartYAMLOnDisk pins the version-rewrite side effect. The PR's helm lane is byte-identical to before. Local repro for the two newly-unblocked lanes is in the PR description. Refs: #843
aa04c97 to
b443028
Compare
6dbe344 to
aea1e35
Compare
njhensley
left a comment
There was a problem hiding this comment.
This is a substantial and well-scoped contribution — the round-trip lanes immediately surfaced three latent bundler bugs, which is exactly the case for the lanes existing. The OCI tag/version handling in pkg/oci/helm.go and pkg/bundler/deployer/argocd/argocd.go is carefully thought through (inline comments preempt several reasonable misreadings). PR description honestly tracks the deferred items (#963, #964) rather than glossing over them.
I have one blocker and a handful of follow-up suggestions. None of the follow-ups need to block merge if you'd rather chase them as a stacked PR.
Blocker: the new force-finalize namespace cleanup path runs against whatever the current kubectl context happens to be, with no positive identity check that it's the KWOK Kind cluster. If KUBECONFIG or kubectl config current-context points anywhere else when someone runs validate-scheduling.sh directly (it's independently invokable per the script header) or via make from a stale shell, the EXIT trap and cleanup_old_tests will strip finalizers from every non-allowlisted namespace on the live cluster. The pre-existing --wait=true --timeout=120s path was slow but did not bypass the finalizer protocol. This expansion to PATCH spec.finalizers=[] materially widens blast radius and needs a guard before merge. Concrete fix in the inline comment.
Themes worth a follow-up PR (not blocking):
- System-namespace allowlist drift. Same regex is copy-pasted in three places (
run-all-recipes.sh:184,validate-scheduling.sh:290,:497). Any one of them gainingflux-systemoraicr-registryand the others not will silently strip finalizers from the infra namespaces. Hoist to one sourced constant. - Sync-gate complexity. The four-arm Argo CD sync predicate at
validate-scheduling.sh:944-1056collapses to roughly(sync == Synced || phase == Succeeded) && health != Missing. ~100 lines → ~15. The per-arm prose belongs in ADR-008, not the wait loop. - Mirror wait-loops.
wait_for_argocd_*andwait_for_flux_*are ~250 lines of structural duplication. At minimum, foldwait_for_*_root_appinto one helper.
Pre-merge checklist:
- Add KWOK-context guard in front of
cleanup_old_testsand the EXIT trap (blocker below) - Rebase onto
origin/main(PR has theneeds-rebaselabel) - Optional but recommended: address the in-place
Chart.yamlrewrite and the HelmRelease finalizer-patch shape in the same push since the fixes are tiny
Nice work overall.
| resolve_flux_root_names "$recipe" | ||
|
|
||
| # Set up cleanup trap | ||
| trap cleanup EXIT |
There was a problem hiding this comment.
Blocker. This force-finalize path runs before verify_kwok_nodes (step 3), so the only thing standing between a misconfigured KUBECONFIG and a finalizer wipe on whatever cluster the user is authenticated to is the static system-namespace allowlist. Since validate-scheduling.sh is independently invokable (per the header ./validate-scheduling.sh <recipe>) and the EXIT trap at line 1779 also fires unconditionally, I'd add a positive identity gate up front — before either path can run.
Something like:
ensure_kwok_context() {
local ctx; ctx=$(kubectl config current-context 2>/dev/null || true)
if [[ "$ctx" != "kind-${CLUSTER_NAME:-aicr-kwok}" ]]; then
log_error "refusing to run: current kubectl context is '$ctx', not a KWOK Kind cluster"
exit 1
fi
kubectl get node -l type=kwok -o name | grep -q . \
|| { log_error "refusing to run: no kwok-typed nodes in current context"; exit 1; }
}Call it before trap cleanup EXIT at line 1779 and before cleanup_old_tests at line 1790. The same guard should fire from run-all-recipes.sh:cleanup_between_tests (line 157) for the same reason.
There was a problem hiding this comment.
Addressed in f76ee55. Added ensure_kwok_context_loose (context-name only) and ensure_kwok_context (context + kwok-node-label) in kwok/scripts/lib/cleanup.sh. The strict variant is called from validate-scheduling.sh::main() before trap cleanup EXIT (apply-nodes.sh has populated the cluster by then). The loose variant is called from run-all-recipes.sh::main() after ensure_cluster returns — split because at that point no recipe has applied nodes yet, so the strict label check would false-positive on cold start.
| if err != nil { | ||
| return nil, apperrors.Wrap(apperrors.ErrCodeInternal, "failed to marshal Chart.yaml", err) | ||
| } | ||
| if err := os.WriteFile(chartPath, rewritten, 0o600); err != nil { |
There was a problem hiding this comment.
loadAndRewriteChartYAML rewrites Chart.yaml in place inside the caller-supplied SourceDir. The push completes cleanly, but the source dir is now mutated — a subsequent push to a different tag, or a code path that re-derives version from Chart.yaml between push attempts, reads stale tag data. It's also non-atomic; a crash mid-write leaves a corrupt Chart.yaml.
Consider copying SourceDir to a temp dir at the top of PackageAndPushHelmChart, rewriting and packaging from the temp copy, and leaving the caller's source dir untouched. The same approach also fixes the atomicity concern for free.
There was a problem hiding this comment.
Addressed in f76ee55. PackageAndPushHelmChart now copies SourceDir to <OutputDir>/helm-chart-work and runs loadAndRewriteChartYAML + buildHelmChartTGZ against the copy. The caller's source tree stays byte-identical; the atomicity concern is gone too (crash mid-write leaves a corrupt file in the disposable work dir, not the caller's tree). Reused the existing copyDir helper from push.go.
| --ignore-not-found --wait=false 2>/dev/null || true | ||
| # Strip finalizers if helm-controller's reconciler is stuck. | ||
| kubectl patch helmrelease "$name" -n "$ns" \ | ||
| --type=json -p='[{"op":"remove","path":"/metadata/finalizers"}]' \ |
There was a problem hiding this comment.
RFC 6902 remove errors when the target path is absent; the || true then silently swallows it, which means this guard is a no-op on exactly the case it exists for — a HelmRelease whose helm-controller crashed before attaching its finalizer.
Switch to a merge patch:
kubectl patch helmrelease "$hr" -n "$ns" \
--type=merge -p '{"metadata":{"finalizers":null}}' >/dev/null 2>&1 || trueIdempotent whether finalizers are present or absent.
There was a problem hiding this comment.
Addressed in f76ee55. Switched the HelmRelease finalizer strip to --type=merge -p '{"metadata":{"finalizers":null}}'. Idempotent whether finalizers are present or absent — the previous --type=json remove would error when the path was absent and the || true then silently swallowed the exact case the guard was meant for.
| # cluster (no real workloads to leak; cluster destroyed at job end). | ||
| local system_ns="default|kube-node-lease|kube-public|kube-system|kwok-system|local-path-storage|argocd|aicr-registry|flux-system" | ||
| local terminating | ||
| terminating=$(kubectl get ns -o jsonpath='{range .items[?(@.status.phase=="Terminating")]}{.metadata.name}{"\n"}{end}' 2>/dev/null | grep -vE "^(${system_ns})$" || true) |
There was a problem hiding this comment.
This only iterates namespaces already in status.phase=="Terminating". A namespace that's still Active with a pending deletionTimestamp is skipped, and the next recipe collides on the existing namespace name. The full cleanup_old_tests in validate-scheduling.sh handles this with a two-phase pattern (async delete --wait=false, then force-finalize the stuck remainder).
Either mirror the same two-phase logic here, or factor both into a shared helper sourced by both scripts. Otherwise the between-test cleanup silently drifts away from the in-recipe cleanup over time.
There was a problem hiding this comment.
Addressed in f76ee55. cleanup_between_tests now mirrors validate-scheduling.sh::cleanup_old_tests's two-phase shape: phase 1 issues async deletes (--wait=false) against every non-system namespace, phase 2 sleeps briefly and force-finalizes anything still present. Also factored the system-ns allowlist into kwok/scripts/lib/cleanup.sh::SYSTEM_NS_PATTERN (your other comment) so both cleanup paths now consume the same source.
| # while doing zero useful work. Force-finalize bypasses the protocol | ||
| # by PATCHing spec.finalizers to []; safe in the ephemeral KWOK | ||
| # cluster (no real workloads to leak; cluster destroyed at job end). | ||
| local system_ns="default|kube-node-lease|kube-public|kube-system|kwok-system|local-path-storage|argocd|aicr-registry|flux-system" |
There was a problem hiding this comment.
This regex is the only thing preventing the force-finalize sweep from hitting the install-infra-owned namespaces (argocd, flux-system, aicr-registry). It's now duplicated in three locations — here, validate-scheduling.sh:290, and validate-scheduling.sh:497. Any drift (typo, missing pipe, one site updated and not the other two) silently strips finalizers from those namespaces and breaks the next recipe.
Hoist to a single sourced constant — either a tiny kwok/scripts/lib/cleanup.sh or a readonly SYSTEM_NS_PATTERN=… exported from one of the existing scripts.
There was a problem hiding this comment.
Addressed in f76ee55. Created kwok/scripts/lib/cleanup.sh with readonly SYSTEM_NS_PATTERN=... and sourced from both scripts. All three sites (run-all-recipes.sh:cleanup_between_tests, validate-scheduling.sh:cleanup, validate-scheduling.sh:cleanup_old_tests) now use ${SYSTEM_NS_PATTERN} with a # see SYSTEM_NS_PATTERN in lib/cleanup.sh annotation pointing back at the canonical definition.
| return nil, err | ||
| } | ||
|
|
||
| configBlob, err := json.Marshal(chartMeta) |
There was a problem hiding this comment.
Worth a one-line comment pinning the contract: this struct is intentionally the OCI config blob (registry-visible metadata), not the full Chart.yaml. Full Chart.yaml content still ships in the tar layer, so a Helm chart key not modeled here is not lost. Without the comment, a future reader is likely to "fix" this by widening the struct or marshaling the raw map.
There was a problem hiding this comment.
Addressed in f76ee55. Expanded the godoc on chartYAML to pin the contract: it's intentionally the OCI config blob (registry-visible metadata), NOT a model of full Chart.yaml; full Chart.yaml content still ships in the tar layer; widening this struct is the wrong fix.
| // in registry` at bundle time. Caught in the KWOK gke-cos-training | ||
| // CI lane after the pre-fix; before this commit the bug only | ||
| // surfaced for recipes that combined Helm + pre-manifest overlays. | ||
| func TestGenerate_PreManifestParentResolution(t *testing.T) { |
There was a problem hiding this comment.
-pre resolution covered here, -post covered elsewhere, but no test exercises a component with both a -pre and a -post sibling — which is the actual gke-cos OS overlay shape described in the production-code comment. The loop for _, suffix := range []string{"-pre", "-post"} runs twice across two distinct folder iterations in that case, and the override-key lookup for each must resolve back to the same parent.
A new row (or a sibling subtest) staging 001-gpu-operator-pre/ and 003-gpu-operator-post/ against a single registered gpu-operator and asserting both wrapper templates point at the same parent override-key would cover the case for ~20 lines.
There was a problem hiding this comment.
Addressed in f76ee55. New TestGenerate_PreAndPostManifestParentResolution stages a single registered gpu-operator with both ComponentPreManifests AND ComponentPostManifests populated, then asserts both 001-gpu-operator-pre/ and 003-gpu-operator-post/ folders exist AND both templates/gpu-operator-pre.yaml and templates/gpu-operator-post.yaml wrapper templates were produced. Catches the asymmetric-suffix-strip regression the production code comment warns about.
| // dedups the blob. A regression (e.g., embedding time.Now() in the | ||
| // gzip header or in tar mtimes) would silently make every push a | ||
| // new blob, defeating registry-level caching. | ||
| func TestBuildHelmChartTGZ_Reproducible(t *testing.T) { |
There was a problem hiding this comment.
Same-process determinism doesn't prove cross-run reproducibility — a regression that caches time.Now() once at package init would still pass this test. Consider walking the produced tar headers and asserting absolute invariants:
for _, hdr := range hdrs {
if !hdr.ModTime.IsZero() { /* fail */ }
if hdr.Uid != 0 || hdr.Gid != 0 { /* fail */ }
if hdr.Uname != "" || hdr.Gname != "" { /* fail */ }
}
// plus gzip header MTIME zero checkThere was a problem hiding this comment.
Addressed in f76ee55. TestBuildHelmChartTGZ_Reproducible now walks the produced tar and asserts per-entry ModTime / AccessTime / ChangeTime are zero (or Unix epoch — the tar wire format normalizes the Go zero time to epoch), Uid/Gid are 0, Uname/Gname are empty, AND the outer gzip.Header.ModTime is zero. A time.Now() cached at package-init that passed the same-process byte-equality would now fail these absolute-value assertions.
| // satisfy oras's existence check), and manifest PUTs. Anything else | ||
| // returns 404 — the goal isn't a conformant registry, just the | ||
| // endpoints PackageAndPushHelmChart's push path actually touches. | ||
| func newFakeOCIRegistry(t *testing.T) *httptest.Server { |
There was a problem hiding this comment.
The fake registry accepts manifest PUTs without inspecting manifest.config.mediaType, which is the exact field whose handling motivated the #961 fix. The full round-trip (TestPackageAndPushHelmChart_PushesAgainstFakeRegistry) currently only asserts the outer manifest envelope mediaType.
Either have the fake assert manifest.Config.MediaType == helmConfigMediaType on the manifest PUT handler, or GET the manifest back at the end of the round-trip test and decode it to verify the Helm config mediaType survived.
There was a problem hiding this comment.
Addressed in f76ee55. TestPackageAndPushHelmChart_PushesAgainstFakeRegistry now round-trips: after the push, it GETs the manifest back from the fake registry via HTTP, decodes the JSON, and asserts manifest.Config.MediaType == helmConfigMediaType AND manifest.Layers[0].MediaType == helmLayerMediaType. The exact field whose handling motivated #961 is now under test.
| log_info "Running deploy.sh --no-wait..." | ||
|
|
||
| local deploy_output | ||
| if ! deploy_output=$("${bundle_dir}/deploy.sh" --no-wait 2>&1); then |
There was a problem hiding this comment.
(Optional, not blocking.) The four arms here collapse to roughly (sync == "Synced" || operationState.phase == "Succeeded") && health != "Missing". Arms 1–2 reduce to Synced (any non-Missing health); arms 3–4 to phase == Succeeded. The per-arm prose belongs in ADR-008 §"Terminal Pass Predicate"; the wait loop just needs the predicate. Happy to take this as a follow-up.
There was a problem hiding this comment.
Tracking as a follow-up — not addressing in this PR per your note. Filed a mental note to revisit when the ADR-008 §"Terminal Pass Predicate" section gets the predicate-collapse rewrite.
aea1e35 to
f76ee55
Compare
Adds end-to-end KWOK validation for argocd, argocd-helm, and flux
bundles published to OCI. Each lane bundles, pushes to an in-cluster
OCI registry, lets the GitOps controller pull and reconcile (or for
flux-oci, apply), and asserts the expected GitOps shape.
Closes the long-standing gap where Argo CD / Flux bundle template
regressions could ship to main undetected. Three latent bundler bugs
surfaced and were fixed during the round-trip.
Coverage delta:
helm helm filesystem (baseline, unchanged)
argocd-oci NEW - bundle -> OCI push -> Argo CD pull -> reconcile
argocd-helm-oci NEW - helm OCI mediatype push -> helm install
flux-oci NEW - bundle -> OCI pull -> Kustomization apply
(HelmRelease reconciliation deferred; see #964)
Latent bundler bugs fixed:
- #960 argocd-oci `$values` Git-only ref under OCI -> inline values
- #961 helm OCI mediatype + semver-tag filter -> proper Helm OCI flow
- argocdhelm `-pre` parent-resolution stripped only `-post` suffix
Infrastructure hardening (from KWOK live testing):
- Async delete + force-finalize cleanup (KWOK can't run finalizers)
- 4-arm Argo CD sync gate (Synced+Healthy, Synced+Progressing,
OutOfSync+Healthy+op.Succeeded, Synced+Degraded+op.Succeeded)
- Retry max_attempts 3 -> 5 for transient CDN 502s
- list_bundle_entries SIGPIPE-safe under pipefail
- macOS port-5500 default (ControlCenter holds 5000)
- TestWrite_VendorCharts_KustomizeFallthrough hermeticity on macOS
Fixes: #843, #960, #961
Related: #962, #963, #964
f76ee55 to
45f17ca
Compare
njhensley
left a comment
There was a problem hiding this comment.
Thanks for the thorough response — all 12 actionable items addressed cleanly in 45f17cad, and the explicit deferral on the sync-gate predicate collapse (with ADR-008 reference) is acceptable.
A few highlights worth calling out:
- The
os.MkdirTempstaging approach withTestPackageAndPushHelmChart_SourceEqualsOutputDiras a regression guard is exactly the right shape. The self-discovered same-dir recursion bug being captured as a test is good engineering. - The widened
defaults.MaxChartYAMLBytes+validateChartName(called from bothloadAndRewriteChartYAMLAND the exportedbuildHelmChartTGZ) gives the OCI push path solid defense-in-depth. - The strike-gate widening to include
flux-oci(rather than just argocd-*) is a sharper read than the original review —wait_for_flux_syncalso returns 50, and gating on both deployers is the right call. - The two-phase
cleanup_between_testsmirroringcleanup_old_tests, withSYSTEM_NS_PATTERNcentralized inlib/cleanup.sh, eliminates the drift class entirely.
One minor follow-up surfaced from reading the new safety guard — see the inline comment. Not blocking; consider for a stacked PR.
Approving.
| local ctx | ||
| ctx=$(kubectl config current-context 2>/dev/null || true) | ||
| case "$ctx" in | ||
| kind-aicr-kwok|kind-aicr-kwok-test) |
There was a problem hiding this comment.
(Minor, non-blocking follow-up.) This hardcodes acceptance of kind-aicr-kwok or kind-aicr-kwok-test, but KWOK_CLUSTER is a publicly documented env var (see .github/actions/kwok-test/action.yml:115 and run-all-recipes.sh:64's CLUSTER_NAME="${KWOK_CLUSTER:-aicr-kwok-test}"). A developer running locally with KWOK_CLUSTER=my-dev-cluster legitimately creates kind-my-dev-cluster and is then hard-blocked by this safety check.
The default CI path is unaffected, but the env var exists specifically to permit non-default names, and the symptom (hard-fail with a generic error) is bad UX for the only population who'd encounter it.
Suggested fix — parameterize against the caller's expected context name:
ensure_kwok_context_loose() {
local ctx expected="${1:-${CONTEXT:-}}"
ctx=$(kubectl config current-context 2>/dev/null || true)
case "$ctx" in
kind-*)
if [[ -n "$expected" && "$ctx" != "$expected" ]]; then
echo "[ERROR] ensure_kwok_context: context '$ctx' does not match expected '$expected'" >&2
exit 1
fi
return 0
;;
*)
echo "[ERROR] ensure_kwok_context: context '${ctx:-<none>}' is not a kind- cluster — refusing" >&2
exit 1
;;
esac
}run-all-recipes.sh already computes CONTEXT="kind-${CLUSTER_NAME}" at line 65, so callers can pass it through or let the function pick it up from the environment. Preserves the safety invariant (context must be a kind- cluster) while honoring the existing customization knob.
Summary
Adds end-to-end validation of
argocd,argocd-helm, andfluxbundlesin the KWOK CI matrix. Each lane bundles the recipe, pushes the artifact
to an in-cluster OCI registry, lets the corresponding GitOps controller
pull and reconcile (or for
flux-oci, apply), and asserts the bundle'sGitOps shape was produced. Closes the long-standing gap where Argo CD
and Flux bundle template regressions could ship to
mainundetected.Three latent bundler bugs that this round-trip surfaced are fixed in
the same PR. Two further gaps are deferred to tracked follow-ups
(#963, #964).
Fixes: #843, #960, #961
Related: #962 (migrate KWOK argocd-* sync gate to Chainsaw),
#963 (Gitea-in-cluster for filesystem-bundle round-trip),
#964 (flux local-chart HelmReleases unreachable under OCI)
Deployer × output-format CI coverage
Before / after this PR, by deployer:
helmhelmdeployer's OCI artifact existsargocdbundle-variants(app-of-apps shape)argocd-helmbundle-dynamic(Chart.yaml + templates/)helm installround-trip)fluxbundle-fluxhelmfilebundle-helmfile(incl. realhelmfile build)helmfile.yaml'soci: trueflags chart repos, not the bundle artifactThe KWOK matrix goes from 1 deployer (helm) to 4 (helm, argocd-oci,
argocd-helm-oci, flux-oci).
For filesystem (Git-source) bundle round-trip coverage of
argocdandflux, see follow-up #963.What "pass" means per lane
The KWOK lanes validate bundle-format correctness, not chart
reconciliation completeness. Each lane stops at the strongest signal
its controller can produce without fighting KWOK fidelity gaps.
helmhelm install)argocd-ociargocd-helm-ocihelm pullOCI)flux-ociThe argocd-* lanes admit four reality-shaped terminal-pass outcomes to
handle KWOK pod-readiness simulation gaps and operator-mutation drift:
Synced + Healthy— canonical passSynced + Progressing— KWOK pod-readiness simulation gap (Prometheus StatefulSet, DaemonSet probes); not a bundle defectOutOfSync + Healthy + operationState.phase=Succeeded— operator-mutation drift (gpu-operator's ClusterPolicy, nvidia-dra-driver-gpu's DeviceClass); production users handle this withignoreDifferences, not a bundle defectSynced + Degraded + operationState.phase=Succeeded— chart applied but health controller marks at least one managed resource Degraded (typically a Pod whose Ready signal KWOK can't produce, e.g. kai-scheduler leader election); samephase=Succeededguard as arm 3 rejects genuinely-broken syncsThe matrix does not assert that every chart eventually reaches
Healthy under KWOK — that conflates bundle shape with chart-author
contracts and KWOK simulation fidelity.
flux-ociintentionally stops one step earlier ("bundle applied" =OCIRepository pulled + Kustomization applied + HelmRelease CRs created)
because the bundler emits local-chart HelmReleases pointing at a
placeholder Git URL (
https://github.com/YOUR_ORG/YOUR_REPO.git).Under OCI consumption these can't reconcile, and Flux's
dependsOncascade stalls every downstream HelmRelease. Flux's HelmRelease v2 API
blocks the obvious fix (
spec.chart.spec.sourceRef.kindrejectsOCIRepository;spec.chartRefhas no path support into a multi-chartartifact), so the proper fix needs an ADR-level design decision.
Tracked in #964.
Issues this PR exposed
The round-trip lanes immediately surfaced latent bundler bugs. Each
was invisible to existing CI because no lane previously exercised a
GitOps controller pulling an AICR bundle from a registry.
argocd-ocifailed withunsupported scheme "oci"from Argo CD repo-serverapplication.yaml.tmplrendered multi-source$valuesref against an OCI URL;$valuesis Git-only in Argo CDargocd-helm-ocifailed withunable to locate any tags in provided repositoryfromhelm pullhelm.config.v1+jsonconfig mediatype + semver tagargocd-helm-ocifailedcomponent "gpu-operator-pre" not found in registryon recipes withpreManifestFiles(e.g.gke-cos-training)argocdhelm.processFoldersparent-resolution stripped-postsuffix but not-preflux-ocilocal-chart HelmReleases stall under OCI;dependsOncascade stalls the rest of the graphGitRepositoryURL for inner local charts; Flux's HelmRelease API rejectsOCIRepositoryas a chart sourceRef andchartRefhas no path support, so a wrapper OCIRepository can't address sub-paths. Proper fix needs ADR-level choice (render at bundle time vs. publish per-chart).What's in this PR
Argo CD $values multi-source ref is Git-only; blocks argocd deployer over OCI #960 —
argocd.Generatorgains anInlineUpstreamValuesflag that, when set (auto-enabled by the bundler for OCI RepoURLs), inlines per-component values ashelm.valuesObjectunder a single-source Application. The Git-only$valuesmulti-source ref is no longer rendered for OCI targets.argocd-helm-oci: helm pull from in-cluster OCI registry returns 'unable to locate any tags' #961 — New
oci.PackageAndPushHelmChartflow: builds a chart-root-prefixedchart.tgz, pushes withapplication/vnd.cncf.helm.config.v1+jsonconfig +application/vnd.cncf.helm.chart.content.v1.tar+gziplayer, and rewritesChart.yaml.versionto the OCI tag sohelm install --version <tag>resolves. The CLI rejects non-semver tags at bundle time with an actionable remediation hint.argocdhelm
-preresolution — Parent-component lookup inprocessFoldersnow symmetrically strips both-preand-postsuffixes. Recipes withpreManifestFiles(gke-cos OS overlay, etc.) bundle correctly.CLI contract clarification —
--deployer argocd-helmno longer auto-fillsrepoURLfrom the OCI reference; that bundle is URL-portable by design andrepoURLis supplied athelm installtime.KWOK matrix scaffolding —
deployerdimension added tokwok-recipes.yaml(Tier 1 + Tier 3). Per-cell: Kind cluster + in-clusterregistry:2+ the GitOps controller(s) the selected deployer needs (Argo CD chart9.5.14or Flux 2 install manifestv2.8.7), repo-creds template Secret (Argo CD lanes only),aicr bundle --output oci://..., then eitherkubectl apply(argocd-oci, flux-oci) orhelm install(argocd-helm-oci), then sync-gate, thenverify_pods.flux-ocilane — Installs Flux 2 (source-controller / kustomize-controller / helm-controller) via the pinnedinstall.yaml. Applies anOCIRepository(withspec.layerSelector.mediaType: application/vnd.oci.image.layer.v1.tar+gzipfor the AICR generic OCI mediatype, plusspec.insecure: truefor the in-cluster HTTP registry) and aKustomizationpointing at it. Sync gate validates pull + apply + HelmRelease-CRs-created; does NOT wait for HelmRelease reconciliation — see #964.Sync gate (argocd) — Four-state terminal-pass predicate, non-root Application fallback for argocd-helm-oci wrappers whose
status.resourceshasn't populated yet, andset -e-safekubectlprobe.3-strike rule —
run-all-recipes.shbails the whole job with exit 50 after three consecutive GitOps sync timeouts (argocd-* or flux-oci) so a wedged cluster doesn't burn the full matrix budget.Cleanup hardening — Three call sites switched from
kubectl delete ns --wait=true --timeout=120sto async delete + force-finalize. KWOK can't run real controller finalizers, so--wait=trueblocked the full 2 min per namespace; with ~10 namespaces/recipe this exhausted the 15 min job budget on cleanup after the actual validation passed. Force-finalize bypasses the protocol by PATCHing empty finalizers — safe in the ephemeral KWOK cluster. Cleanup drops from ~20 min/recipe to ~5–10 s.Retry hardening —
KWOK_COMMAND_RETRIESdefault bumped from 3 to 5 (cumulative sleep 15s → 75s) so transient GitHub releases CDN 502s onkwok-stage-fastchart fetch don't cancel CI runs.SIGPIPE safety —
list_bundle_entriesnow traps SIGPIPE in its subshell so a downstreamhead -Nclosing the pipe early can't propagate exit 141 to the whole script.macOS port-5500 default —
kwok/kind-config.yamlexposes the in-cluster registry on host port 5500 (ControlCenter holds 5000); Linux CI runners are unaffected.Test hermeticity —
TestWrite_VendorCharts_KustomizeFallthroughsetsGIT_TERMINAL_PROMPT=0so kustomize's git fetcher fails fast instead of promptingUsername for 'https://github.com':mid-make qualifyon macOS.Testing
make qualifypasses locally on the final commit (test + lint + e2e + scan).KWOK acceptance against
eks-trainingon macOS:helm: ✅ PASS — baseline byte-identical tomainargocd-oci: ✅ PASS — 15/15 Applications terminal, 44/44 pods scheduledargocd-helm-oci: ✅ PASS — 15/15 Applications terminal, 44/44 pods scheduledflux-oci: ✅ PASS — OCIRepository pulled, Kustomization applied, 14 HelmReleases created (reconciliation deferred per bundler(flux): local-chart HelmReleases unreachable under OCI consumption #964)CI matrix covers 16 recipes × 4 deployers; gke-cos-* and other recipes with COS-OS
preManifestFilesspecifically exercise the-prefix.Risk
helmand non-OCI deployer paths are byte-identical tomain. Changesthat affect OCI users:
argocd+ OCI: per-component Applications now render single-source withhelm.valuesObjectinlined. Per-componentvalues.yamlfiles remain on disk for operator readability but are no longer referenced via$values. Git-backedargocdis unchanged.argocd-helm+ OCI: bundle is now published as a Helm OCI chart (helm.config.v1+jsonconfig +helm.chart.content.v1.tar+gziplayer).Chart.yaml.versionis rewritten to match the OCI tag sohelm install --version <tag>resolves correctly.argocd-helm+ OCI now rejects non-semver tags at bundle time with an actionable error message ("Wrap as a semver pre-release, e.g.0.0.0-<tag>"), instead of producing an artifact thathelm pullsilently refuses to find.flux+ OCI: No bundler changes. The bundler-side investigation found Flux's HelmRelease v2 API blocks the natural fix (spec.chart.spec.sourceRef.kindrejectsOCIRepository;spec.chartRefhas no path support). The flux-oci lane validates the bundle-apply step only and tracks the proper fix in #964.Checklist
make testwith-race)golangci-lint0 issues)-preresolution; argocd inline values; oci helm mediatypes + tag semver; oci helm tgz reproducibility)docs/contributor/kwok-testing.md,docs/design/008-kwok-deployer-matrix.md)git commit -S)