Bridge Jobs for deploy/undeploy lifecycle flow control

## Summary

Introduce a convention-based pattern for deploy-time readiness and undeploy-time CR cleanup: each component can ship `readiness.yaml` and/or `cleanup.yaml` alongside its existing `values.yaml` and `manifests/*.yaml`, rendered through the same template pipeline. AICR generates manifests with the right annotations for each supported deployer; deploy tools execute them via their own native primitives.

## Problem

Two related gaps currently live in generated bash:

**Deploy-time readiness.** `deploy.sh` prints `Deployment complete.` when installs finish, but for GPU bundles that's not "ready for workloads." Recent CUJ2 run: 7 min install → 26 min actually ready.

**Undeploy finalizer race.** `undeploy.sh.tmpl` has grown ~700 lines across 7 PRs (#253, #282, #364, #416, #477, #561, #602) chasing a race: today's pre-flight scans for stuck finalizers *after* helm has removed the controller — too late, nothing left to release them. Hence escalation to `force_clear_namespace_finalizers`.

A `pre-uninstall` bridge Job (or argocd sync-wave ordering — see below) runs while the controller is alive, so finalizers resolve normally and the race disappears.

Related: #602, #607, #516.

## Proposed Convention

Extend the existing `recipes/components/<name>/` layout:

| File | Purpose | Status |
|---|---|---|
| `values.yaml`, `values-*.yaml` | Helm values + overlays | existing |
| `manifests/*.yaml` | Raw manifests | existing |
| `readiness.yaml` | Post-install bridge Job | new |
| `cleanup.yaml` | Pre-uninstall bridge Job | new |

Bundler picks up the new files by presence, renders through the existing `pkg/manifest` template pipeline (full access to `.Values`, `.Release.Namespace`, overlay values), and annotates for each target deployer. Components that don't need a bridge simply don't ship the file.

## Example — `recipes/components/gpu-operator/readiness.yaml`

Waits for `ClusterPolicy.status.state == ready`. Reuses the upstream chart's `gpu-operator` SA — no new RBAC.

```yaml
{{- $gpuOp := index .Values "gpu-operator" }}
{{- if ne (toString ($gpuOp.readiness.enabled | default true)) "false" }}
---
apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-operator-readiness
  namespace: {{ .Release.Namespace }}
  annotations:
    "helm.sh/hook": post-install,post-upgrade
    "helm.sh/hook-weight": "20"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
    "argocd.argoproj.io/hook": PostSync
    "argocd.argoproj.io/sync-wave": "10"
spec:
  activeDeadlineSeconds: {{ $gpuOp.readiness.timeoutSeconds | default 1800 }}
  template:
    spec:
      serviceAccountName: gpu-operator
      restartPolicy: Never
      containers:
        - name: wait
          image: {{ $gpuOp.readiness.image | default "bitnami/kubectl:1.30" }}
          command: [kubectl, wait, --for=jsonpath={.status.state}=ready,
                    clusterpolicy, --all,
                    --timeout={{ $gpuOp.readiness.timeout | default "30m" }}]
{{- end }}
```

`cleanup.yaml` follows the same shape with `helm.sh/hook: pre-delete`, deletes CR instances via `kubectl delete --wait=true`, and includes a last-resort force-remove fallback for stuck finalizers.

For components without a controller SA (e.g., `skyhook-customizations`), `readiness.yaml` / `cleanup.yaml` ships a minimal inline SA + ClusterRole + ClusterRoleBinding scoped to the CR group. Decision is per-component, not global.

## Deployer coverage

| Deployer | Readiness mechanism | Cleanup mechanism |
|---|---|---|
| **helm** | `helm.sh/hook: post-install` (helm `--wait` blocks on Job) | `helm.sh/hook: pre-delete` (helm waits for Job before removing controller) |
| **argocdhelm** | Same as helm — outer chart's helm lifecycle drives the inner hooks | Same — helm `pre-delete` fires during app lifecycle |
| **argocd** | `argocd.argoproj.io/hook: PostSync` + sync-wave | **No bridge Job needed** — ArgoCD's existing `automated.prune: true` (already set in `application.yaml.tmpl`) + sync-wave ordering on CR resources (higher wave than controller) + `resources-finalizer.argocd.argoproj.io/foreground` on the Application handles it natively: CRs prune first, controller prunes last, finalizers resolve cleanly |

The argocd path uses its own native primitives — the `cleanup.yaml` Job would only emit for the helm and argocdhelm deployers; the argocd deployer emits sync-wave annotations on the right CR resources instead. Same outcome, two rendering paths.

## What drops out of `undeploy.sh.tmpl`

| Today | Under this pattern |
|---|---|
| `extra_crds_for_release()` hardcoded case | component's own `cleanup.yaml` |
| `skip_preflight_for_release()` hardcoded case | deleted |
| `check_release_for_stuck_crds` / `check_crd_for_stuck_resources` | deleted or thin post-flight observability |
| `force_clear_namespace_finalizers` at script level | bounded fallback inside each `cleanup.yaml` |
| `delete_release_cluster_resources` orphan loop | deleted |
| `capture_kubectl_json` + 5× ERROR blocks | deleted with pre-flight |

`undeploy.sh.tmpl` collapses to roughly:

```bash
for release in reversed(releases); do
  helm uninstall "$release" -n "$ns" --timeout "${HELM_TIMEOUT}s"
done
```

`deploy.sh` needs no new flags — existing `helm --wait` already blocks on the readiness Job.

## Hook injection for upstream charts

Helm hooks must live inside a chart's `templates/` dir — no external override. For upstream charts AICR doesn't control:

- **Wrapper/envelope chart** — natural fit for #516 (every component becomes a local chart)
- **Companion cleanup/readiness release** — works with today's bundle format; two invocations per component in a specific order
- **Chart-provided extension points** (`extraManifests`, `extraDeploy`) — free lunch when upstream offers them

`readiness.yaml` / `cleanup.yaml` land in whichever chart AICR authors by the same convention.

## Open Questions

1. kubectl image pinning — match cluster version, minimalist image, or AICR-shipped?
2. Hook-injection default — wrapper (needs #516) vs companion release (works today). Probably ship companion first; migrate when #516 lands.
3. Finalizer force-remove fallback — always-on (loud events) or opt-in? Leaning toward always-on to match today's `force_clear_namespace_finalizers`.
4. Readiness beyond `jsonpath` — complex cases (DaemonSet rollout + CR status combined) may need a shell step inside the Job. YAGNI until needed.
5. Sync-wave emission for argocd deployer — does AICR's argocd deployer need extension to set sync-waves on rendered CR resources, or can this come from the chart's own templates?

## Acceptance Criteria

- Bundler picks up `readiness.yaml` and `cleanup.yaml` by presence
- Files render through the same pipeline as `manifests/*.yaml`
- Output includes hook annotations appropriate to the target deployer (helm/argocdhelm: helm hooks; argocd: sync-wave + finalizers)
- `undeploy.sh.tmpl` reduces to a minimal uninstall loop
- `deploy.sh` needs no new flags
- Tests cover: readiness-only, cleanup-only, both, neither; controller-SA reuse and inline-RBAC cases
- Install + uninstall round-trip passes against Kind/KWOK for at least one recipe covering each lifecycle type

## Non-Goals

- Day-N / continuous readiness (covered in #607)
- In-cluster readiness operator
- Migration away from helm/argocd (different conversation)


Deployer	Readiness mechanism	Cleanup mechanism
helm	`helm.sh/hook: post-install` (helm `--wait` blocks on Job)	`helm.sh/hook: pre-delete` (helm waits for Job before removing controller)
argocdhelm	Same as helm — outer chart's helm lifecycle drives the inner hooks	Same — helm `pre-delete` fires during app lifecycle
argocd	`argocd.argoproj.io/hook: PostSync` + sync-wave	No bridge Job needed — ArgoCD's existing `automated.prune: true` (already set in `application.yaml.tmpl`) + sync-wave ordering on CR resources (higher wave than controller) + `resources-finalizer.argocd.argoproj.io/foreground` on the Application handles it natively: CRs prune first, controller prunes last, finalizers resolve cleanly

Today	Under this pattern
`extra_crds_for_release()` hardcoded case	component's own `cleanup.yaml`
`skip_preflight_for_release()` hardcoded case	deleted
`check_release_for_stuck_crds` / `check_crd_for_stuck_resources`	deleted or thin post-flight observability
`force_clear_namespace_finalizers` at script level	bounded fallback inside each `cleanup.yaml`
`delete_release_cluster_resources` orphan loop	deleted
`capture_kubectl_json` + 5× ERROR blocks	deleted with pre-flight

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bridge Jobs for deploy/undeploy lifecycle flow control #610

Summary

Problem

Proposed Convention

Example — `recipes/components/gpu-operator/readiness.yaml`

Deployer coverage

What drops out of `undeploy.sh.tmpl`

Hook injection for upstream charts

Open Questions

Acceptance Criteria

Non-Goals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	Purpose	Status
`values.yaml`, `values-*.yaml`	Helm values + overlays	existing
`manifests/*.yaml`	Raw manifests	existing
`readiness.yaml`	Post-install bridge Job	new
`cleanup.yaml`	Pre-uninstall bridge Job	new

Bridge Jobs for deploy/undeploy lifecycle flow control #610

Description

Summary

Problem

Proposed Convention

Example — recipes/components/gpu-operator/readiness.yaml

Deployer coverage

What drops out of undeploy.sh.tmpl

Hook injection for upstream charts

Open Questions

Acceptance Criteria

Non-Goals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Example — `recipes/components/gpu-operator/readiness.yaml`

What drops out of `undeploy.sh.tmpl`