You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduce a convention-based pattern for deploy-time readiness and undeploy-time CR cleanup: each component can ship readiness.yaml and/or cleanup.yaml alongside its existing values.yaml and manifests/*.yaml, rendered through the same template pipeline. AICR generates manifests with the right annotations for each supported deployer; deploy tools execute them via their own native primitives.
Problem
Two related gaps currently live in generated bash:
Deploy-time readiness.deploy.sh prints Deployment complete. when installs finish, but for GPU bundles that's not "ready for workloads." Recent CUJ2 run: 7 min install → 26 min actually ready.
Undeploy finalizer race.undeploy.sh.tmpl has grown ~700 lines across 7 PRs (#253, #282, #364, #416, #477, #561, #602) chasing a race: today's pre-flight scans for stuck finalizers after helm has removed the controller — too late, nothing left to release them. Hence escalation to force_clear_namespace_finalizers.
A pre-uninstall bridge Job (or argocd sync-wave ordering — see below) runs while the controller is alive, so finalizers resolve normally and the race disappears.
Extend the existing recipes/components/<name>/ layout:
File
Purpose
Status
values.yaml, values-*.yaml
Helm values + overlays
existing
manifests/*.yaml
Raw manifests
existing
readiness.yaml
Post-install bridge Job
new
cleanup.yaml
Pre-uninstall bridge Job
new
Bundler picks up the new files by presence, renders through the existing pkg/manifest template pipeline (full access to .Values, .Release.Namespace, overlay values), and annotates for each target deployer. Components that don't need a bridge simply don't ship the file.
Example — recipes/components/gpu-operator/readiness.yaml
Waits for ClusterPolicy.status.state == ready. Reuses the upstream chart's gpu-operator SA — no new RBAC.
cleanup.yaml follows the same shape with helm.sh/hook: pre-delete, deletes CR instances via kubectl delete --wait=true, and includes a last-resort force-remove fallback for stuck finalizers.
For components without a controller SA (e.g., skyhook-customizations), readiness.yaml / cleanup.yaml ships a minimal inline SA + ClusterRole + ClusterRoleBinding scoped to the CR group. Decision is per-component, not global.
Deployer coverage
Deployer
Readiness mechanism
Cleanup mechanism
helm
helm.sh/hook: post-install (helm --wait blocks on Job)
helm.sh/hook: pre-delete (helm waits for Job before removing controller)
argocdhelm
Same as helm — outer chart's helm lifecycle drives the inner hooks
Same — helm pre-delete fires during app lifecycle
argocd
argocd.argoproj.io/hook: PostSync + sync-wave
No bridge Job needed — ArgoCD's existing automated.prune: true (already set in application.yaml.tmpl) + sync-wave ordering on CR resources (higher wave than controller) + resources-finalizer.argocd.argoproj.io/foreground on the Application handles it natively: CRs prune first, controller prunes last, finalizers resolve cleanly
The argocd path uses its own native primitives — the cleanup.yaml Job would only emit for the helm and argocdhelm deployers; the argocd deployer emits sync-wave annotations on the right CR resources instead. Same outcome, two rendering paths.
Finalizer force-remove fallback — always-on (loud events) or opt-in? Leaning toward always-on to match today's force_clear_namespace_finalizers.
Readiness beyond jsonpath — complex cases (DaemonSet rollout + CR status combined) may need a shell step inside the Job. YAGNI until needed.
Sync-wave emission for argocd deployer — does AICR's argocd deployer need extension to set sync-waves on rendered CR resources, or can this come from the chart's own templates?
Acceptance Criteria
Bundler picks up readiness.yaml and cleanup.yaml by presence
Files render through the same pipeline as manifests/*.yaml
Output includes hook annotations appropriate to the target deployer (helm/argocdhelm: helm hooks; argocd: sync-wave + finalizers)
undeploy.sh.tmpl reduces to a minimal uninstall loop
Summary
Introduce a convention-based pattern for deploy-time readiness and undeploy-time CR cleanup: each component can ship
readiness.yamland/orcleanup.yamlalongside its existingvalues.yamlandmanifests/*.yaml, rendered through the same template pipeline. AICR generates manifests with the right annotations for each supported deployer; deploy tools execute them via their own native primitives.Problem
Two related gaps currently live in generated bash:
Deploy-time readiness.
deploy.shprintsDeployment complete.when installs finish, but for GPU bundles that's not "ready for workloads." Recent CUJ2 run: 7 min install → 26 min actually ready.Undeploy finalizer race.
undeploy.sh.tmplhas grown ~700 lines across 7 PRs (#253, #282, #364, #416, #477, #561, #602) chasing a race: today's pre-flight scans for stuck finalizers after helm has removed the controller — too late, nothing left to release them. Hence escalation toforce_clear_namespace_finalizers.A
pre-uninstallbridge Job (or argocd sync-wave ordering — see below) runs while the controller is alive, so finalizers resolve normally and the race disappears.Related: #602, #607, #516.
Proposed Convention
Extend the existing
recipes/components/<name>/layout:values.yaml,values-*.yamlmanifests/*.yamlreadiness.yamlcleanup.yamlBundler picks up the new files by presence, renders through the existing
pkg/manifesttemplate pipeline (full access to.Values,.Release.Namespace, overlay values), and annotates for each target deployer. Components that don't need a bridge simply don't ship the file.Example —
recipes/components/gpu-operator/readiness.yamlWaits for
ClusterPolicy.status.state == ready. Reuses the upstream chart'sgpu-operatorSA — no new RBAC.cleanup.yamlfollows the same shape withhelm.sh/hook: pre-delete, deletes CR instances viakubectl delete --wait=true, and includes a last-resort force-remove fallback for stuck finalizers.For components without a controller SA (e.g.,
skyhook-customizations),readiness.yaml/cleanup.yamlships a minimal inline SA + ClusterRole + ClusterRoleBinding scoped to the CR group. Decision is per-component, not global.Deployer coverage
helm.sh/hook: post-install(helm--waitblocks on Job)helm.sh/hook: pre-delete(helm waits for Job before removing controller)pre-deletefires during app lifecycleargocd.argoproj.io/hook: PostSync+ sync-waveautomated.prune: true(already set inapplication.yaml.tmpl) + sync-wave ordering on CR resources (higher wave than controller) +resources-finalizer.argocd.argoproj.io/foregroundon the Application handles it natively: CRs prune first, controller prunes last, finalizers resolve cleanlyThe argocd path uses its own native primitives — the
cleanup.yamlJob would only emit for the helm and argocdhelm deployers; the argocd deployer emits sync-wave annotations on the right CR resources instead. Same outcome, two rendering paths.What drops out of
undeploy.sh.tmplextra_crds_for_release()hardcoded casecleanup.yamlskip_preflight_for_release()hardcoded casecheck_release_for_stuck_crds/check_crd_for_stuck_resourcesforce_clear_namespace_finalizersat script levelcleanup.yamldelete_release_cluster_resourcesorphan loopcapture_kubectl_json+ 5× ERROR blocksundeploy.sh.tmplcollapses to roughly:deploy.shneeds no new flags — existinghelm --waitalready blocks on the readiness Job.Hook injection for upstream charts
Helm hooks must live inside a chart's
templates/dir — no external override. For upstream charts AICR doesn't control:extraManifests,extraDeploy) — free lunch when upstream offers themreadiness.yaml/cleanup.yamlland in whichever chart AICR authors by the same convention.Open Questions
force_clear_namespace_finalizers.jsonpath— complex cases (DaemonSet rollout + CR status combined) may need a shell step inside the Job. YAGNI until needed.Acceptance Criteria
readiness.yamlandcleanup.yamlby presencemanifests/*.yamlundeploy.sh.tmplreduces to a minimal uninstall loopdeploy.shneeds no new flagsNon-Goals