feat(gpu-operator-post): add Helm post-install hook to re-roll DRA kubelet plugin across all deployers

## Summary

Add a Helm `post-install,post-upgrade` hook to the `gpu-operator-post` chart that waits for `ClusterPolicy.status.state == ready` and then re-rolls the `nvidia-dra-driver-gpu` kubelet-plugin DaemonSet. This extends the same protection PR #965 added to the Helm-deployer `deploy.sh` flow to **every deployer** that consumes the rendered Helm releases — Helm direct, helmfile, Flux, Argo CD, argocd-helm.

## Motivation

PR #965 closed two regressions introduced by the `gpu-operator v25.10.1 → v26.3.1` bump:

1. **Stale NVML after driver migration.** The `nvidia-dra-driver-gpu` kubelet-plugin's NVML handle goes stale when gpu-operator's `k8s-driver-manager` reloads kernel modules. Mitigated by a `aicr.nvidia.com/gpu-operator-chart-version` annotation on the DRA pod template that forces a re-roll on chart bumps (every deployer renders the template, so every deployer re-rolls).
2. **Timing race within a deployer run.** Even after the annotation fires correctly, `k8s-driver-manager` runs the per-node module reload **asynchronously** after `helm upgrade gpu-operator` returns. On multi-GPU-node clusters, the DRA pod can re-roll *before* the slowest node's migration finishes — leaving the freshly-rolled pod pinned to the pre-migration driver state, which produces:
   ```
   invalid CDI Spec: failed add device "<uuid>-gpu-N": empty device edits
   ```
   Reproduced on a GB200 EKS cluster (`yljtrxpmzu`, p6e-gb200.36xlarge × 2 nodes) during PR #965 validation. PR #965 closes this with a `kubectl wait` + `kubectl rollout restart` block in `pkg/bundler/deployer/helm/templates/deploy.sh.tmpl` — but **only the default Helm deployer runs `deploy.sh`**. Users who consume the same bundle via `helmfile apply`, `flux reconcile`, or Argo CD's sync get the rendered manifests but skip the post-deploy script. They still hit the race.

This issue tracks the **deployer-agnostic** durable fix.

## Proposed mechanism: Helm post-install hook on `gpu-operator-post`

Add a Job manifest to `recipes/components/gpu-operator-post/templates/` (or wherever the chart templates live) annotated as a Helm hook:

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: dra-kubelet-plugin-rollout-after-driver-migration
  namespace: gpu-operator
  annotations:
    "helm.sh/hook": post-install,post-upgrade
    "helm.sh/hook-weight": "10"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  ttlSecondsAfterFinished: 600
  template:
    spec:
      serviceAccountName: dra-kubelet-plugin-rollout
      restartPolicy: OnFailure
      containers:
        - name: rollout
          image: <kubectl-capable image, e.g. bitnami/kubectl:1.34 or AICR's own>
          command:
            - /bin/sh
            - -c
            - |
              set -e
              echo "Waiting for ClusterPolicy.status.state=ready..."
              kubectl wait --for=jsonpath='{.status.state}=ready' \
                clusterpolicy/cluster-policy --timeout=20m
              echo "Restarting nvidia-dra-driver-gpu kubelet plugin..."
              kubectl rollout restart daemonset \
                -n nvidia-dra-driver nvidia-dra-driver-gpu-kubelet-plugin
              kubectl rollout status daemonset \
                -n nvidia-dra-driver nvidia-dra-driver-gpu-kubelet-plugin --timeout=5m
```

The Job runs for **every deployer** that delivers Helm releases. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle.

## Coordination with PR #965's existing mechanisms

This hook makes the existing `deploy.sh` `kubectl wait + rollout restart` block (in `pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`) **redundant** for Helm-deployer users — the hook runs at the same point in the release lifecycle and does the same work, but inside the cluster instead of in deploy.sh. Two paths:

1. **Keep both** (defense in depth). The hook is the cross-deployer protection; deploy.sh's block remains as a Helm-deployer-side safety net. Idempotent — extra rollout-restart on an already-rolled DaemonSet is a no-op.
2. **Remove deploy.sh's block when the hook ships**. Simpler. Hook becomes single point of truth.

Lean toward (1) initially and tighten to (2) in a follow-up once the hook has shipped in a release and proven reliable.

This hook is **orthogonal** to the chart-version annotation mechanism on `recipes/components/nvidia-dra-driver-gpu/values.yaml` (also in PR #965, durability tracked in #973). The annotation triggers the re-roll **during** the `helm upgrade nvidia-dra-driver-gpu` call; the hook triggers the re-roll **after** the full `gpu-operator` migration settles. They protect against different timing edges:

| Protection layer | Triggers on | Fixes |
|---|---|---|
| Pod-template annotation | every chart-version bump | DRA pod never re-rolling at all (the original aicr2 bug) |
| `gpu-operator-post` Helm hook (this issue) | every install/upgrade of `gpu-operator-post` | DRA pod re-rolling *before* the per-node driver migration finishes |
| `deploy.sh` `kubectl wait + restart` (PR #965) | every `./deploy.sh` invocation | Same as the hook, but only for the default Helm deployer |

## Considerations

- **RBAC**. The hook Job needs a ServiceAccount with:
  - `get`/`list`/`watch` on `clusterpolicies.nvidia.com` (gpu-operator's CR)
  - `get`/`patch` on `daemonsets.apps` in the `nvidia-dra-driver` namespace
  - `get`/`list`/`watch` on Pods in `nvidia-dra-driver` for `rollout status`
  Cross-namespace RBAC: the SA lives in `gpu-operator`, accesses `nvidia-dra-driver`. Use a Role+RoleBinding in `nvidia-dra-driver` granting the gpu-operator-namespaced SA. ClusterRole+ClusterRoleBinding for the ClusterPolicy CR.

- **Idempotency**. If the hook is interrupted mid-flight (cluster restart, pod evicted), `helm.sh/hook-delete-policy: before-hook-creation` ensures the next install/upgrade re-creates the Job cleanly. An extra rollout-restart on an already-rolled DRA pod is a no-op (helm's pod-template-hash already matches the desired state).

- **Image choice**. AICR's existing Helm-hook Jobs (if any) should be reused for consistency. If not, `bitnami/kubectl:1.34` is the standard non-AICR option. If preferred, AICR could ship its own `kubectl`-equipped image; the existing `gpu-operator-post` chart already runs some kubectl actions, worth seeing what container image it uses.

- **Timeout / failure semantics.** 20-min wait on ClusterPolicy ready is conservative — `k8s-driver-manager` on a 16-node cluster with `maxParallelUpgrades=5` could realistically take 12-15 min worst case (4 sequential rotations × 3-min per node). If the wait times out, Job's `restartPolicy: OnFailure` will retry; if it persists, Helm marks the release `failed` and the user investigates. Don't fail silently.

- **What about no-DRA-driver clusters?** If a cluster doesn't deploy `nvidia-dra-driver-gpu` (e.g. a minimal deployment), the hook's `kubectl rollout restart` would fail. Two options:
  - Gate the hook on a values flag (`postInstallHooks.dra.enabled`, default `true`) — explicit opt-out
  - `kubectl rollout restart` with `|| true` after a `get daemonset` check — soft-fail
  Pick one in implementation; the latter is more ergonomic, the former more explicit.

## Acceptance criteria

- [ ] Hook Job manifest added to the `gpu-operator-post` chart, gated as `helm.sh/hook: post-install,post-upgrade`.
- [ ] RBAC (ServiceAccount, Role/RoleBinding in nvidia-dra-driver, ClusterRole/ClusterRoleBinding for ClusterPolicy) ships with the same chart.
- [ ] **Generated-artifact parity test**: render the bundle with the Helm deployer AND a GitOps deployer (helmfile or flux or argocd); assert both rendered outputs include the hook Job manifest with the correct annotations. Catches the cross-deployer parity that motivated this issue.
- [ ] **Cluster verification**: on a cluster with an existing `gpu-operator v25.10.1` deployment, applying the new bundle via a GitOps deployer (NOT `./deploy.sh`) results in the DRA kubelet-plugin pods being re-rolled after the per-node driver migration completes, and `aicr validate --phase performance` passes without manual intervention.
- [ ] Once shipped + verified, follow-up to evaluate removing the redundant `kubectl wait + rollout restart` block from `pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`.

## Out of scope

- Bundler-derived annotation (the v4 design in #973) — orthogonal to this hook. Both can coexist; both can be implemented independently.
- Upstream patch to `nvidia-dra-driver-gpu` to detect NVML staleness in-process — that's the truly correct long-term fix, but lives upstream and depends on NVIDIA's release cadence. Worth filing as an enhancement request to `NVIDIA/k8s-dra-driver-gpu` separately.

## Related

- #894 — gpu-operator chart bump that first surfaced this class of bug
- #965 — PR that ships the v26.3.1 bump and (a) the chart-version annotation, (b) the Helm-deployer-side `kubectl wait + rollout restart`. This issue's hook extends (b) to all deployers.
- #973 — separate durability concern about the chart-version annotation's manual maintenance (orthogonal to this issue, may merge into the same chart change at implementation time)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gpu-operator-post): add Helm post-install hook to re-roll DRA kubelet plugin across all deployers #980

Summary

Motivation

Proposed mechanism: Helm post-install hook on `gpu-operator-post`

Coordination with PR #965's existing mechanisms

Considerations

Acceptance criteria

Out of scope

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Protection layer	Triggers on	Fixes
Pod-template annotation	every chart-version bump	DRA pod never re-rolling at all (the original aicr2 bug)
`gpu-operator-post` Helm hook (this issue)	every install/upgrade of `gpu-operator-post`	DRA pod re-rolling before the per-node driver migration finishes
`deploy.sh` `kubectl wait + restart` (PR #965)	every `./deploy.sh` invocation	Same as the hook, but only for the default Helm deployer

feat(gpu-operator-post): add Helm post-install hook to re-roll DRA kubelet plugin across all deployers #980

Description

Summary

Motivation

Proposed mechanism: Helm post-install hook on gpu-operator-post

Coordination with PR #965's existing mechanisms

Considerations

Acceptance criteria

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Proposed mechanism: Helm post-install hook on `gpu-operator-post`