Skip to content

Bridge Jobs for deploy/undeploy lifecycle flow control #610

@lockwobr

Description

@lockwobr

Summary

Introduce a convention-based pattern for deploy-time readiness and undeploy-time CR cleanup: each component can ship readiness.yaml and/or cleanup.yaml alongside its existing values.yaml and manifests/*.yaml, rendered through the same template pipeline. AICR generates manifests with the right annotations for each supported deployer; deploy tools execute them via their own native primitives.

Problem

Two related gaps currently live in generated bash:

Deploy-time readiness. deploy.sh prints Deployment complete. when installs finish, but for GPU bundles that's not "ready for workloads." Recent CUJ2 run: 7 min install → 26 min actually ready.

Undeploy finalizer race. undeploy.sh.tmpl has grown ~700 lines across 7 PRs (#253, #282, #364, #416, #477, #561, #602) chasing a race: today's pre-flight scans for stuck finalizers after helm has removed the controller — too late, nothing left to release them. Hence escalation to force_clear_namespace_finalizers.

A pre-uninstall bridge Job (or argocd sync-wave ordering — see below) runs while the controller is alive, so finalizers resolve normally and the race disappears.

Related: #602, #607, #516.

Proposed Convention

Extend the existing recipes/components/<name>/ layout:

File Purpose Status
values.yaml, values-*.yaml Helm values + overlays existing
manifests/*.yaml Raw manifests existing
readiness.yaml Post-install bridge Job new
cleanup.yaml Pre-uninstall bridge Job new

Bundler picks up the new files by presence, renders through the existing pkg/manifest template pipeline (full access to .Values, .Release.Namespace, overlay values), and annotates for each target deployer. Components that don't need a bridge simply don't ship the file.

Example — recipes/components/gpu-operator/readiness.yaml

Waits for ClusterPolicy.status.state == ready. Reuses the upstream chart's gpu-operator SA — no new RBAC.

{{- $gpuOp := index .Values "gpu-operator" }}
{{- if ne (toString ($gpuOp.readiness.enabled | default true)) "false" }}
---
apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-operator-readiness
  namespace: {{ .Release.Namespace }}
  annotations:
    "helm.sh/hook": post-install,post-upgrade
    "helm.sh/hook-weight": "20"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
    "argocd.argoproj.io/hook": PostSync
    "argocd.argoproj.io/sync-wave": "10"
spec:
  activeDeadlineSeconds: {{ $gpuOp.readiness.timeoutSeconds | default 1800 }}
  template:
    spec:
      serviceAccountName: gpu-operator
      restartPolicy: Never
      containers:
        - name: wait
          image: {{ $gpuOp.readiness.image | default "bitnami/kubectl:1.30" }}
          command: [kubectl, wait, --for=jsonpath={.status.state}=ready,
                    clusterpolicy, --all,
                    --timeout={{ $gpuOp.readiness.timeout | default "30m" }}]
{{- end }}

cleanup.yaml follows the same shape with helm.sh/hook: pre-delete, deletes CR instances via kubectl delete --wait=true, and includes a last-resort force-remove fallback for stuck finalizers.

For components without a controller SA (e.g., skyhook-customizations), readiness.yaml / cleanup.yaml ships a minimal inline SA + ClusterRole + ClusterRoleBinding scoped to the CR group. Decision is per-component, not global.

Deployer coverage

Deployer Readiness mechanism Cleanup mechanism
helm helm.sh/hook: post-install (helm --wait blocks on Job) helm.sh/hook: pre-delete (helm waits for Job before removing controller)
argocdhelm Same as helm — outer chart's helm lifecycle drives the inner hooks Same — helm pre-delete fires during app lifecycle
argocd argocd.argoproj.io/hook: PostSync + sync-wave No bridge Job needed — ArgoCD's existing automated.prune: true (already set in application.yaml.tmpl) + sync-wave ordering on CR resources (higher wave than controller) + resources-finalizer.argocd.argoproj.io/foreground on the Application handles it natively: CRs prune first, controller prunes last, finalizers resolve cleanly

The argocd path uses its own native primitives — the cleanup.yaml Job would only emit for the helm and argocdhelm deployers; the argocd deployer emits sync-wave annotations on the right CR resources instead. Same outcome, two rendering paths.

What drops out of undeploy.sh.tmpl

Today Under this pattern
extra_crds_for_release() hardcoded case component's own cleanup.yaml
skip_preflight_for_release() hardcoded case deleted
check_release_for_stuck_crds / check_crd_for_stuck_resources deleted or thin post-flight observability
force_clear_namespace_finalizers at script level bounded fallback inside each cleanup.yaml
delete_release_cluster_resources orphan loop deleted
capture_kubectl_json + 5× ERROR blocks deleted with pre-flight

undeploy.sh.tmpl collapses to roughly:

for release in reversed(releases); do
  helm uninstall "$release" -n "$ns" --timeout "${HELM_TIMEOUT}s"
done

deploy.sh needs no new flags — existing helm --wait already blocks on the readiness Job.

Hook injection for upstream charts

Helm hooks must live inside a chart's templates/ dir — no external override. For upstream charts AICR doesn't control:

  • Wrapper/envelope chart — natural fit for [Feature]: Generic Helm-based bundle format with vendored charts #516 (every component becomes a local chart)
  • Companion cleanup/readiness release — works with today's bundle format; two invocations per component in a specific order
  • Chart-provided extension points (extraManifests, extraDeploy) — free lunch when upstream offers them

readiness.yaml / cleanup.yaml land in whichever chart AICR authors by the same convention.

Open Questions

  1. kubectl image pinning — match cluster version, minimalist image, or AICR-shipped?
  2. Hook-injection default — wrapper (needs [Feature]: Generic Helm-based bundle format with vendored charts #516) vs companion release (works today). Probably ship companion first; migrate when [Feature]: Generic Helm-based bundle format with vendored charts #516 lands.
  3. Finalizer force-remove fallback — always-on (loud events) or opt-in? Leaning toward always-on to match today's force_clear_namespace_finalizers.
  4. Readiness beyond jsonpath — complex cases (DaemonSet rollout + CR status combined) may need a shell step inside the Job. YAGNI until needed.
  5. Sync-wave emission for argocd deployer — does AICR's argocd deployer need extension to set sync-waves on rendered CR resources, or can this come from the chart's own templates?

Acceptance Criteria

  • Bundler picks up readiness.yaml and cleanup.yaml by presence
  • Files render through the same pipeline as manifests/*.yaml
  • Output includes hook annotations appropriate to the target deployer (helm/argocdhelm: helm hooks; argocd: sync-wave + finalizers)
  • undeploy.sh.tmpl reduces to a minimal uninstall loop
  • deploy.sh needs no new flags
  • Tests cover: readiness-only, cleanup-only, both, neither; controller-SA reuse and inline-RBAC cases
  • Install + uninstall round-trip passes against Kind/KWOK for at least one recipe covering each lifecycle type

Non-Goals

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Minor defects; minor implications (no SLA commitment)area/bundlerenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions