Skip to content

feat(gpu-operator-post): add Helm post-install hook to re-roll DRA kubelet plugin across all deployers #980

@yuanchen8911

Description

@yuanchen8911

Summary

Add a Helm post-install,post-upgrade hook to the gpu-operator-post chart that waits for ClusterPolicy.status.state == ready and then re-rolls the nvidia-dra-driver-gpu kubelet-plugin DaemonSet. This extends the same protection PR #965 added to the Helm-deployer deploy.sh flow to every deployer that consumes the rendered Helm releases — Helm direct, helmfile, Flux, Argo CD, argocd-helm.

Motivation

PR #965 closed two regressions introduced by the gpu-operator v25.10.1 → v26.3.1 bump:

  1. Stale NVML after driver migration. The nvidia-dra-driver-gpu kubelet-plugin's NVML handle goes stale when gpu-operator's k8s-driver-manager reloads kernel modules. Mitigated by a aicr.nvidia.com/gpu-operator-chart-version annotation on the DRA pod template that forces a re-roll on chart bumps (every deployer renders the template, so every deployer re-rolls).
  2. Timing race within a deployer run. Even after the annotation fires correctly, k8s-driver-manager runs the per-node module reload asynchronously after helm upgrade gpu-operator returns. On multi-GPU-node clusters, the DRA pod can re-roll before the slowest node's migration finishes — leaving the freshly-rolled pod pinned to the pre-migration driver state, which produces:
    invalid CDI Spec: failed add device "<uuid>-gpu-N": empty device edits
    
    Reproduced on a GB200 EKS cluster (yljtrxpmzu, p6e-gb200.36xlarge × 2 nodes) during PR chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 #965 validation. PR chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 #965 closes this with a kubectl wait + kubectl rollout restart block in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl — but only the default Helm deployer runs deploy.sh. Users who consume the same bundle via helmfile apply, flux reconcile, or Argo CD's sync get the rendered manifests but skip the post-deploy script. They still hit the race.

This issue tracks the deployer-agnostic durable fix.

Proposed mechanism: Helm post-install hook on gpu-operator-post

Add a Job manifest to recipes/components/gpu-operator-post/templates/ (or wherever the chart templates live) annotated as a Helm hook:

apiVersion: batch/v1
kind: Job
metadata:
  name: dra-kubelet-plugin-rollout-after-driver-migration
  namespace: gpu-operator
  annotations:
    "helm.sh/hook": post-install,post-upgrade
    "helm.sh/hook-weight": "10"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  ttlSecondsAfterFinished: 600
  template:
    spec:
      serviceAccountName: dra-kubelet-plugin-rollout
      restartPolicy: OnFailure
      containers:
        - name: rollout
          image: <kubectl-capable image, e.g. bitnami/kubectl:1.34 or AICR's own>
          command:
            - /bin/sh
            - -c
            - |
              set -e
              echo "Waiting for ClusterPolicy.status.state=ready..."
              kubectl wait --for=jsonpath='{.status.state}=ready' \
                clusterpolicy/cluster-policy --timeout=20m
              echo "Restarting nvidia-dra-driver-gpu kubelet plugin..."
              kubectl rollout restart daemonset \
                -n nvidia-dra-driver nvidia-dra-driver-gpu-kubelet-plugin
              kubectl rollout status daemonset \
                -n nvidia-dra-driver nvidia-dra-driver-gpu-kubelet-plugin --timeout=5m

The Job runs for every deployer that delivers Helm releases. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle.

Coordination with PR #965's existing mechanisms

This hook makes the existing deploy.sh kubectl wait + rollout restart block (in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl) redundant for Helm-deployer users — the hook runs at the same point in the release lifecycle and does the same work, but inside the cluster instead of in deploy.sh. Two paths:

  1. Keep both (defense in depth). The hook is the cross-deployer protection; deploy.sh's block remains as a Helm-deployer-side safety net. Idempotent — extra rollout-restart on an already-rolled DaemonSet is a no-op.
  2. Remove deploy.sh's block when the hook ships. Simpler. Hook becomes single point of truth.

Lean toward (1) initially and tighten to (2) in a follow-up once the hook has shipped in a release and proven reliable.

This hook is orthogonal to the chart-version annotation mechanism on recipes/components/nvidia-dra-driver-gpu/values.yaml (also in PR #965, durability tracked in #973). The annotation triggers the re-roll during the helm upgrade nvidia-dra-driver-gpu call; the hook triggers the re-roll after the full gpu-operator migration settles. They protect against different timing edges:

Protection layer Triggers on Fixes
Pod-template annotation every chart-version bump DRA pod never re-rolling at all (the original aicr2 bug)
gpu-operator-post Helm hook (this issue) every install/upgrade of gpu-operator-post DRA pod re-rolling before the per-node driver migration finishes
deploy.sh kubectl wait + restart (PR #965) every ./deploy.sh invocation Same as the hook, but only for the default Helm deployer

Considerations

  • RBAC. The hook Job needs a ServiceAccount with:

    • get/list/watch on clusterpolicies.nvidia.com (gpu-operator's CR)
    • get/patch on daemonsets.apps in the nvidia-dra-driver namespace
    • get/list/watch on Pods in nvidia-dra-driver for rollout status
      Cross-namespace RBAC: the SA lives in gpu-operator, accesses nvidia-dra-driver. Use a Role+RoleBinding in nvidia-dra-driver granting the gpu-operator-namespaced SA. ClusterRole+ClusterRoleBinding for the ClusterPolicy CR.
  • Idempotency. If the hook is interrupted mid-flight (cluster restart, pod evicted), helm.sh/hook-delete-policy: before-hook-creation ensures the next install/upgrade re-creates the Job cleanly. An extra rollout-restart on an already-rolled DRA pod is a no-op (helm's pod-template-hash already matches the desired state).

  • Image choice. AICR's existing Helm-hook Jobs (if any) should be reused for consistency. If not, bitnami/kubectl:1.34 is the standard non-AICR option. If preferred, AICR could ship its own kubectl-equipped image; the existing gpu-operator-post chart already runs some kubectl actions, worth seeing what container image it uses.

  • Timeout / failure semantics. 20-min wait on ClusterPolicy ready is conservative — k8s-driver-manager on a 16-node cluster with maxParallelUpgrades=5 could realistically take 12-15 min worst case (4 sequential rotations × 3-min per node). If the wait times out, Job's restartPolicy: OnFailure will retry; if it persists, Helm marks the release failed and the user investigates. Don't fail silently.

  • What about no-DRA-driver clusters? If a cluster doesn't deploy nvidia-dra-driver-gpu (e.g. a minimal deployment), the hook's kubectl rollout restart would fail. Two options:

    • Gate the hook on a values flag (postInstallHooks.dra.enabled, default true) — explicit opt-out
    • kubectl rollout restart with || true after a get daemonset check — soft-fail
      Pick one in implementation; the latter is more ergonomic, the former more explicit.

Acceptance criteria

  • Hook Job manifest added to the gpu-operator-post chart, gated as helm.sh/hook: post-install,post-upgrade.
  • RBAC (ServiceAccount, Role/RoleBinding in nvidia-dra-driver, ClusterRole/ClusterRoleBinding for ClusterPolicy) ships with the same chart.
  • Generated-artifact parity test: render the bundle with the Helm deployer AND a GitOps deployer (helmfile or flux or argocd); assert both rendered outputs include the hook Job manifest with the correct annotations. Catches the cross-deployer parity that motivated this issue.
  • Cluster verification: on a cluster with an existing gpu-operator v25.10.1 deployment, applying the new bundle via a GitOps deployer (NOT ./deploy.sh) results in the DRA kubelet-plugin pods being re-rolled after the per-node driver migration completes, and aicr validate --phase performance passes without manual intervention.
  • Once shipped + verified, follow-up to evaluate removing the redundant kubectl wait + rollout restart block from pkg/bundler/deployer/helm/templates/deploy.sh.tmpl.

Out of scope

  • Bundler-derived annotation (the v4 design in chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973) — orthogonal to this hook. Both can coexist; both can be implemented independently.
  • Upstream patch to nvidia-dra-driver-gpu to detect NVML staleness in-process — that's the truly correct long-term fix, but lives upstream and depends on NVIDIA's release cadence. Worth filing as an enhancement request to NVIDIA/k8s-dra-driver-gpu separately.

Related

Metadata

Metadata

Assignees

No fields configured for Enhancement.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions