You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a Helm post-install,post-upgrade hook to the gpu-operator-post chart that waits for ClusterPolicy.status.state == ready and then re-rolls the nvidia-dra-driver-gpu kubelet-plugin DaemonSet. This extends the same protection PR #965 added to the Helm-deployer deploy.sh flow to every deployer that consumes the rendered Helm releases — Helm direct, helmfile, Flux, Argo CD, argocd-helm.
Motivation
PR #965 closed two regressions introduced by the gpu-operator v25.10.1 → v26.3.1 bump:
Stale NVML after driver migration. The nvidia-dra-driver-gpu kubelet-plugin's NVML handle goes stale when gpu-operator's k8s-driver-manager reloads kernel modules. Mitigated by a aicr.nvidia.com/gpu-operator-chart-version annotation on the DRA pod template that forces a re-roll on chart bumps (every deployer renders the template, so every deployer re-rolls).
Timing race within a deployer run. Even after the annotation fires correctly, k8s-driver-manager runs the per-node module reload asynchronously after helm upgrade gpu-operator returns. On multi-GPU-node clusters, the DRA pod can re-roll before the slowest node's migration finishes — leaving the freshly-rolled pod pinned to the pre-migration driver state, which produces:
The Job runs for every deployer that delivers Helm releases. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle.
This hook makes the existing deploy.shkubectl wait + rollout restart block (in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl) redundant for Helm-deployer users — the hook runs at the same point in the release lifecycle and does the same work, but inside the cluster instead of in deploy.sh. Two paths:
Keep both (defense in depth). The hook is the cross-deployer protection; deploy.sh's block remains as a Helm-deployer-side safety net. Idempotent — extra rollout-restart on an already-rolled DaemonSet is a no-op.
Remove deploy.sh's block when the hook ships. Simpler. Hook becomes single point of truth.
Lean toward (1) initially and tighten to (2) in a follow-up once the hook has shipped in a release and proven reliable.
This hook is orthogonal to the chart-version annotation mechanism on recipes/components/nvidia-dra-driver-gpu/values.yaml (also in PR #965, durability tracked in #973). The annotation triggers the re-roll during the helm upgrade nvidia-dra-driver-gpu call; the hook triggers the re-roll after the full gpu-operator migration settles. They protect against different timing edges:
Protection layer
Triggers on
Fixes
Pod-template annotation
every chart-version bump
DRA pod never re-rolling at all (the original aicr2 bug)
gpu-operator-post Helm hook (this issue)
every install/upgrade of gpu-operator-post
DRA pod re-rolling before the per-node driver migration finishes
Same as the hook, but only for the default Helm deployer
Considerations
RBAC. The hook Job needs a ServiceAccount with:
get/list/watch on clusterpolicies.nvidia.com (gpu-operator's CR)
get/patch on daemonsets.apps in the nvidia-dra-driver namespace
get/list/watch on Pods in nvidia-dra-driver for rollout status
Cross-namespace RBAC: the SA lives in gpu-operator, accesses nvidia-dra-driver. Use a Role+RoleBinding in nvidia-dra-driver granting the gpu-operator-namespaced SA. ClusterRole+ClusterRoleBinding for the ClusterPolicy CR.
Idempotency. If the hook is interrupted mid-flight (cluster restart, pod evicted), helm.sh/hook-delete-policy: before-hook-creation ensures the next install/upgrade re-creates the Job cleanly. An extra rollout-restart on an already-rolled DRA pod is a no-op (helm's pod-template-hash already matches the desired state).
Image choice. AICR's existing Helm-hook Jobs (if any) should be reused for consistency. If not, bitnami/kubectl:1.34 is the standard non-AICR option. If preferred, AICR could ship its own kubectl-equipped image; the existing gpu-operator-post chart already runs some kubectl actions, worth seeing what container image it uses.
Timeout / failure semantics. 20-min wait on ClusterPolicy ready is conservative — k8s-driver-manager on a 16-node cluster with maxParallelUpgrades=5 could realistically take 12-15 min worst case (4 sequential rotations × 3-min per node). If the wait times out, Job's restartPolicy: OnFailure will retry; if it persists, Helm marks the release failed and the user investigates. Don't fail silently.
What about no-DRA-driver clusters? If a cluster doesn't deploy nvidia-dra-driver-gpu (e.g. a minimal deployment), the hook's kubectl rollout restart would fail. Two options:
Gate the hook on a values flag (postInstallHooks.dra.enabled, default true) — explicit opt-out
kubectl rollout restart with || true after a get daemonset check — soft-fail
Pick one in implementation; the latter is more ergonomic, the former more explicit.
Acceptance criteria
Hook Job manifest added to the gpu-operator-post chart, gated as helm.sh/hook: post-install,post-upgrade.
RBAC (ServiceAccount, Role/RoleBinding in nvidia-dra-driver, ClusterRole/ClusterRoleBinding for ClusterPolicy) ships with the same chart.
Generated-artifact parity test: render the bundle with the Helm deployer AND a GitOps deployer (helmfile or flux or argocd); assert both rendered outputs include the hook Job manifest with the correct annotations. Catches the cross-deployer parity that motivated this issue.
Cluster verification: on a cluster with an existing gpu-operator v25.10.1 deployment, applying the new bundle via a GitOps deployer (NOT ./deploy.sh) results in the DRA kubelet-plugin pods being re-rolled after the per-node driver migration completes, and aicr validate --phase performance passes without manual intervention.
Once shipped + verified, follow-up to evaluate removing the redundant kubectl wait + rollout restart block from pkg/bundler/deployer/helm/templates/deploy.sh.tmpl.
Upstream patch to nvidia-dra-driver-gpu to detect NVML staleness in-process — that's the truly correct long-term fix, but lives upstream and depends on NVIDIA's release cadence. Worth filing as an enhancement request to NVIDIA/k8s-dra-driver-gpu separately.
Summary
Add a Helm
post-install,post-upgradehook to thegpu-operator-postchart that waits forClusterPolicy.status.state == readyand then re-rolls thenvidia-dra-driver-gpukubelet-plugin DaemonSet. This extends the same protection PR #965 added to the Helm-deployerdeploy.shflow to every deployer that consumes the rendered Helm releases — Helm direct, helmfile, Flux, Argo CD, argocd-helm.Motivation
PR #965 closed two regressions introduced by the
gpu-operator v25.10.1 → v26.3.1bump:nvidia-dra-driver-gpukubelet-plugin's NVML handle goes stale when gpu-operator'sk8s-driver-managerreloads kernel modules. Mitigated by aaicr.nvidia.com/gpu-operator-chart-versionannotation on the DRA pod template that forces a re-roll on chart bumps (every deployer renders the template, so every deployer re-rolls).k8s-driver-managerruns the per-node module reload asynchronously afterhelm upgrade gpu-operatorreturns. On multi-GPU-node clusters, the DRA pod can re-roll before the slowest node's migration finishes — leaving the freshly-rolled pod pinned to the pre-migration driver state, which produces:yljtrxpmzu, p6e-gb200.36xlarge × 2 nodes) during PR chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 #965 validation. PR chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 #965 closes this with akubectl wait+kubectl rollout restartblock inpkg/bundler/deployer/helm/templates/deploy.sh.tmpl— but only the default Helm deployer runsdeploy.sh. Users who consume the same bundle viahelmfile apply,flux reconcile, or Argo CD's sync get the rendered manifests but skip the post-deploy script. They still hit the race.This issue tracks the deployer-agnostic durable fix.
Proposed mechanism: Helm post-install hook on
gpu-operator-postAdd a Job manifest to
recipes/components/gpu-operator-post/templates/(or wherever the chart templates live) annotated as a Helm hook:The Job runs for every deployer that delivers Helm releases. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle.
Coordination with PR #965's existing mechanisms
This hook makes the existing
deploy.shkubectl wait + rollout restartblock (inpkg/bundler/deployer/helm/templates/deploy.sh.tmpl) redundant for Helm-deployer users — the hook runs at the same point in the release lifecycle and does the same work, but inside the cluster instead of in deploy.sh. Two paths:Lean toward (1) initially and tighten to (2) in a follow-up once the hook has shipped in a release and proven reliable.
This hook is orthogonal to the chart-version annotation mechanism on
recipes/components/nvidia-dra-driver-gpu/values.yaml(also in PR #965, durability tracked in #973). The annotation triggers the re-roll during thehelm upgrade nvidia-dra-driver-gpucall; the hook triggers the re-roll after the fullgpu-operatormigration settles. They protect against different timing edges:gpu-operator-postHelm hook (this issue)gpu-operator-postdeploy.shkubectl wait + restart(PR #965)./deploy.shinvocationConsiderations
RBAC. The hook Job needs a ServiceAccount with:
get/list/watchonclusterpolicies.nvidia.com(gpu-operator's CR)get/patchondaemonsets.appsin thenvidia-dra-drivernamespaceget/list/watchon Pods innvidia-dra-driverforrollout statusCross-namespace RBAC: the SA lives in
gpu-operator, accessesnvidia-dra-driver. Use a Role+RoleBinding innvidia-dra-drivergranting the gpu-operator-namespaced SA. ClusterRole+ClusterRoleBinding for the ClusterPolicy CR.Idempotency. If the hook is interrupted mid-flight (cluster restart, pod evicted),
helm.sh/hook-delete-policy: before-hook-creationensures the next install/upgrade re-creates the Job cleanly. An extra rollout-restart on an already-rolled DRA pod is a no-op (helm's pod-template-hash already matches the desired state).Image choice. AICR's existing Helm-hook Jobs (if any) should be reused for consistency. If not,
bitnami/kubectl:1.34is the standard non-AICR option. If preferred, AICR could ship its ownkubectl-equipped image; the existinggpu-operator-postchart already runs some kubectl actions, worth seeing what container image it uses.Timeout / failure semantics. 20-min wait on ClusterPolicy ready is conservative —
k8s-driver-manageron a 16-node cluster withmaxParallelUpgrades=5could realistically take 12-15 min worst case (4 sequential rotations × 3-min per node). If the wait times out, Job'srestartPolicy: OnFailurewill retry; if it persists, Helm marks the releasefailedand the user investigates. Don't fail silently.What about no-DRA-driver clusters? If a cluster doesn't deploy
nvidia-dra-driver-gpu(e.g. a minimal deployment), the hook'skubectl rollout restartwould fail. Two options:postInstallHooks.dra.enabled, defaulttrue) — explicit opt-outkubectl rollout restartwith|| trueafter aget daemonsetcheck — soft-failPick one in implementation; the latter is more ergonomic, the former more explicit.
Acceptance criteria
gpu-operator-postchart, gated ashelm.sh/hook: post-install,post-upgrade.gpu-operator v25.10.1deployment, applying the new bundle via a GitOps deployer (NOT./deploy.sh) results in the DRA kubelet-plugin pods being re-rolled after the per-node driver migration completes, andaicr validate --phase performancepasses without manual intervention.kubectl wait + rollout restartblock frompkg/bundler/deployer/helm/templates/deploy.sh.tmpl.Out of scope
nvidia-dra-driver-gputo detect NVML staleness in-process — that's the truly correct long-term fix, but lives upstream and depends on NVIDIA's release cadence. Worth filing as an enhancement request toNVIDIA/k8s-dra-driver-gpuseparately.Related
kubectl wait + rollout restart. This issue's hook extends (b) to all deployers.