Skip to content

WIP: feat(gpu-operator-post): re-roll DRA kubelet plugin across all deployers (#980)#1030

Closed
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/980-dra-rollout-hook
Closed

WIP: feat(gpu-operator-post): re-roll DRA kubelet plugin across all deployers (#980)#1030
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/980-dra-rollout-hook

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented May 26, 2026

Summary

Add a gpu-operator-post Job that waits for the gpu-operator driver
migration to settle and then re-rolls the nvidia-dra-driver-gpu
kubelet plugin DaemonSet. Ships via the bundler-synthesized
gpu-operator-post chart so the same protection that
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl provides for the
default Helm deployer (./deploy.sh) now applies across every deployer
the bundler emits — Helm, helmfile, Flux, Argo CD, argocd-helm,
including their --vendor-charts variants.

Motivation / Context

k8s-driver-manager reloads NVIDIA kernel modules asynchronously
after helm upgrade gpu-operator returns. If the DRA kubelet plugin
re-rolls before the slowest node finishes, the freshly-rolled pod pins
its NVML handle to the pre-migration driver state and CDI spec
generation fails with invalid CDI Spec: empty device edits, leaving
GPU workloads stuck in ContainerCreating. PR #965 closed this for
./deploy.sh users, but helmfile, flux, argocd, and
argocd-helm consumers never execute that script and still hit the
race.

Fixes: N/A (WIP — flip to Closes: #980 before promoting to ready-for-review)
Related: #980, #965, #973, #894

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe) — recipes/overlays/base.yaml wiring
  • Bundlers (pkg/bundler, pkg/component/*) — new manifest packed into the synthesized gpu-operator-post chart
  • Docs/examples (docs/) — docs/user/container-images.md regenerated via make bom-docs; docs/user/component-catalog.md upgrade note updated
  • Other: recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml

Implementation Notes

Dual re-fire mechanism across bundler paths.
The bundler emits the same manifest through two paths with different
hook handling, so the Job carries explicit annotations sized to survive
both:

  1. Non-vendored localformat (default for Helm / helmfile /
    Argo CD / argocd-helm).
    pkg/bundler/deployer/localformat/local_helm.go:165 unconditionally
    calls stripHelmHooks, which removes sync-phase helm.sh/hook
    annotations because Argo CD's syncPolicy.automated would
    otherwise treat the resource as a PostSync hook that never fires
    under path-based sources. After stripping, the Job ships as a
    regular chart resource and re-fire on upgrade is achieved by keying
    the Job metadata.name on the parent gpu-operator chart version,
    which the bundler injects into the synthesized chart's values map
    at gpu-operator._aicrParentChartVersion (see
    DefaultBundler.injectDRAParentChartVersionValue in
    pkg/bundler/bundler.go). The values-map path is used in
    preference to .Chart.Version because (a) the synthesized chart's
    Chart.yaml hardcodes 0.1.0 so .Chart.Version would never
    change across gpu-operator bumps, and (b) under flux-oci,
    source-controller suffixes .Chart.Version with +<artifact-sha>
    — the + is rejected by K8s label-value validation. Every
    gpu-operator chart bump produces a uniquely-named Job manifest
    that every deployer applies.

  2. Vendored localformat (--vendor-charts).
    pkg/bundler/deployer/localformat/vendor_folder.go:279 folds mixed
    recipe-side manifests into the vendored wrapper and calls
    injectPostInstallHooks, which auto-tags every document that
    does NOT already declare helm.sh/hook with post-install
    only and an auto-assigned helm.sh/hook-weight starting at
    postInstallHookWeightBase = 100
    (pkg/bundler/deployer/localformat/vendor_wrapper.go:124,
    vendor_wrapper.go:223). Two problems for this manifest:

    • (a) auto-injection is post-install only, so the Job
      runs on first install and is silently skipped on every
      subsequent helm upgrade — exactly when the stale-NVML race
      occurs.
    • (b) if only the Job author-declares a hook, the
      ServiceAccount / ClusterRole / ClusterRoleBinding still get
      auto-injected at weights 100/101/102 while the Job's
      author-declared hook takes Helm's default weight 0. Helm
      orders hooks by ascending weight, so the Job would run
      before its ServiceAccount/RBAC exist and fail with
      ServiceAccount "aicr-dra-rollout-hook" not found.

    Both are addressed by author-declaring
    helm.sh/hook: post-install,post-upgrade plus
    helm.sh/hook-delete-policy: before-hook-creation on all four
    resources
    with explicit ascending hook-weights:

    ServiceAccount        helm.sh/hook-weight: "1"
    ClusterRole           helm.sh/hook-weight: "2"
    ClusterRoleBinding    helm.sh/hook-weight: "3"
    Job                   helm.sh/hook-weight: "10"

    injectHookOnDoc honors author-declared hooks, the weights
    guarantee the Job's prerequisites exist before it runs, and
    post-upgrade refreshes both the Job and its RBAC on every
    helm upgrade. stripHelmHooks on the non-vendored path strips
    ALL of these annotations (every phase here is in
    syncPhaseHooks at
    pkg/bundler/deployer/localformat/hooks.go:109-120), so this
    declaration is inert there and Helm's default kind-based
    ordering (SA → ClusterRole → ClusterRoleBinding → Job) plus the
    chart-version-keyed Job name from path 1 deliver the same
    install order and re-fire mechanism without hooks.

helm.sh/resource-policy: keep applies to path 1 and prevents Helm's
three-way merge from pruning prior-version Jobs mid-rollout. It is a
no-op for the vendored hook path (Helm does not apply resource-policy
to hook resources); retention there is governed by
helm.sh/hook-delete-policy, and the chart-version-keyed Job name
means before-hook-creation only deletes a prior Job on same-version
re-deploys (Argo CD self-heal, Helm re-apply), where the Job is
intentionally re-run anyway. No ttlSecondsAfterFinished: the Job is
part of GitOps desired state, so a Kubernetes TTL delete would make
the resource missing relative to the rendered manifest and trigger an
immediate recreate on the next reconcile.

Fail-closed shell-script semantics (POSIX /bin/sh).
The Job is a correctness mitigation for stale NVML. Exiting 0 after a
classification flake or wait timeout would tell Helm/GitOps
"mitigation succeeded" when it may not have run, which is strictly
worse than a failed release because operators lose the actionable
signal. The script therefore:

  • Exits 0 only when state is positively determined as a no-op:
    DRA DaemonSet absent (fresh install before
    nvidia-dra-driver-gpu has applied), nvidia-driver-daemonset
    absent (host-managed driver mode — GKE COS, OKE, Kind, Talos),
    or zero managed nodes labeled nvidia.com/gpu.deploy.driver=true.
  • Exits 1 on every other failure: any classification kubectl get returns non-zero, kubectl wait for the per-node
    gpu-driver-upgrade-state=upgrade-done label times out at 15m,
    kubectl rollout restart fails, or kubectl rollout status
    times out at 5m.
  • Uses a capture-then-check pattern
    KUBECTL_OUT=$(kubectl ... 2>&1) || { echo "ERROR: ..." >&2; exit 1; } — to distinguish "API call failed" from "result is empty".
    alpine/k8s's BusyBox /bin/sh lacks set -o pipefail, so the
    prior kubectl ... | grep ... | head patterns silently masked
    kubectl exit codes and let a failed API call exit 0 after no
    rollout occurred.
  • backoffLimit: 2 on the Job is load-bearing under fail-closed
    semantics: a transient apiserver hiccup auto-retries twice before
    surfacing as a failed release.

Script mirrors deploy.sh.tmpl line 347:

  • Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos) by
    absence of nvidia-driver-daemonset and skips the migration wait.
  • Waits up to 15m for every node labeled
    nvidia.com/gpu.deploy.driver=true to reach
    nvidia.com/gpu-driver-upgrade-state=upgrade-done.
  • Discovers the DRA kubelet-plugin DaemonSet at runtime by
    app.kubernetes.io/name=nvidia-dra-driver-gpu label across all
    namespaces, so os-talos / privileged-* overlays that move the DRA
    driver out of its default namespace work without any recipe change.
  • Issues kubectl rollout restart and waits up to 5m for
    kubectl rollout status.

RBAC (ClusterRole + ClusterRoleBinding only):

  • daemonsets.appsget/list/watch/patch (cluster-wide; watch
    required by kubectl rollout status).
  • nodesget/list/watch (for the kubectl wait on the
    gpu-driver-upgrade-state label).
  • podsget/list/watch (rollout-status pod readiness).

No namespaced Role: gpu-operator-post installs before
nvidia-dra-driver-gpu's chart, so a Role inside nvidia-dra-driver
would fail on a fresh cluster because the namespace doesn't exist yet.

Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest
list digest @sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be,
matching the repo's digest-pinning standard. alpine/k8s ships kubectl
plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no
pipefail). bitnami/kubectl was rejected: Bitnami's catalog
reshuffle deleted all non-latest tags from the free namespace, and
bitnamilegacy/kubectl tops out at 1.33.4.

deploy.sh block kept as defense-in-depth. Per issue #980's stated
preference. Extra rollout-restart on an already-rolled DaemonSet is a
no-op. Follow-up will evaluate removing the script-side block once the
Job has shipped in a release and proven reliable.

Wiring lives on the gpu-operator componentRef in
recipes/overlays/base.yaml
so it cascades to every service overlay
(EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge.

Cross-Deployer Install-Time Gating

The Job acts as a per-deployer install-time gate via:

  • helm ./deploy.shgpu-operator-post case in
    pkg/bundler/deployer/helm/templates/deploy.sh.tmpl sets
    --wait-for-jobs + --timeout 25m + COMPONENT_FORCE_WAIT=true
    (overrides --no-wait). gpu-operator vendored-charts case lifted
    to 25m for the path where the Job ships as a hook on the primary
    chart.
  • helmfilereleaseOverrides in
    pkg/bundler/deployer/helmfile/releases.go sets
    gpu-operator-post: {timeout: 25m, waitForJobs: true} and
    gpu-operator: {timeout: 25m} for the vendored case.
  • FluxreleaseTimeoutOverrides in
    pkg/bundler/deployer/flux/helm.go sets gpu-operator-post: 25m;
    templates emit spec.timeout conditionally. helm-controller
    already waits for hook Jobs by default.

Known limitations

These are reliable for the common case but NOT airtight in two
scenarios, tracked as follow-ups (intentionally NOT addressed in
this PR to keep scope contained):

  1. Argo CD app-of-apps gating. Sync-waves on child Applications
    give ordering but not gating. Argo CD 1.8 removed the default
    Health assessment for
    argoproj.io/Application
    ,
    so the parent app-of-apps treats each child as immediately
    Healthy on creation and does not wait for the Job's
    status.succeeded before advancing to the next sync-wave (here,
    nvidia-dra-driver-gpu). To restore gating, operators must add
    a resource.customizations Lua block on argocd-cm that
    derives Application Health from status.health.status. Without
    that customization, Argo CD users fall back to the Job's own
    fail-closed semantics — exit 1 on wait/restart/status failure
    surfaces the gpu-operator-post Application as unHealthy, but
    nvidia-dra-driver-gpu may still sync in parallel.

  2. Post-bundle driver.version edits. The Job-name slug is
    rendered at bundle time
    (pkg/bundler/deployer/localformat/local_helm.go:155 calls
    manifest.Render which bakes .Values into the static output),
    so a driver.version change applied AFTER aicr bundle does
    NOT update the Job name — same-chart upgrades skip the
    migration mitigation. Common workflow aicr bundle --set gpuoperator:driver.version=… -o new-bundle/ works as expected
    (re-bundling produces a uniquely-named Job). Operators editing
    the generated values.yaml, helm install --set, or AICR
    --dynamic paths on driver.version must re-run aicr bundle
    to refresh the Job name. Closing this fully needs deferring the
    dra-rollout-hook manifest's render to helm install time — a
    structural change to the localformat render pipeline.

The transient race window is also bounded and self-correcting
in the residual cases above (worst-case ~15-20m before the Job's
restart fires, then backoffLimit: 2 on real failure), not a
permanent stuck state. The Job exits 1 on any wait/restart/status
failure, so a release that didn't successfully re-roll the
DaemonSet shows up as a failed Helm release / failing GitOps
Application — operators get the actionable signal that the prior
warning-only design suppressed.

Testing

yamllint -c .yamllint.yaml \
  recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml \
  recipes/overlays/base.yaml   # clean

GOFLAGS="-mod=vendor" go test \
  ./pkg/manifest/... \
  ./pkg/bundler/deployer/localformat/... \
  ./pkg/bundler/deployer/helmfile/... \
  ./pkg/bundler/deployer/helm/... \
  ./pkg/recipe/... \
  ./pkg/bundler/...
# all PASS

make qualify   # PASSED (unit + lint + chainsaw e2e + vuln scan)
make bom-docs  # no diff (alpine/k8s image pin unchanged)

Cross-review by Claude Code + Codex + CodeRabbit surfaced four
findings across two iterations, all addressed in this revision:

  • [major, blocking] Vendored bundles bypassed re-fire. With
    --vendor-charts, injectPostInstallHooks auto-tagged the Job
    post-install only — Helm skipped it on upgrade and the feat(gpu-operator-post): add Helm post-install hook to re-roll DRA kubelet plugin across all deployers #980
    mitigation was silently absent for vendored Helm / helmfile /
    Argo CD / argocd-helm. Fix: author-declare
    helm.sh/hook: post-install,post-upgrade on the Job.
  • [P1, blocking] Vendored hook ordering broke install (caught
    by Codex on the follow-up review). The first revision of the
    fix above declared a hook only on the Job, leaving
    ServiceAccount / ClusterRole / ClusterRoleBinding to be
    auto-tagged at weights 100/101/102 while the Job's
    author-declared hook took Helm's default weight 0 — running
    the Job before its ServiceAccount existed. Fix: author-declare
    hooks on all four resources with ascending weights
    (SA=1, CR=2, CRB=3, Job=10).
  • [major] Warning-only timeouts and silent API errors. The prior
    kubectl wait and kubectl rollout status timeouts only printed
    WARNING: and the Job exited 0; 2>/dev/null | ... || true
    pipelines masked classification API failures. Fix: switch to
    fail-closed exits + capture-then-check pattern as described above.
  • [minor] Stale upgrade-note in docs/user/component-catalog.md.
    The Helmfile sub-bullet claimed the new bundle no longer
    references gpu-operator-post, but this PR reintroduces it for
    the DRA rollout hook (different content from the old
    dcgm-exporter ConfigMap). Updated to acknowledge the
    reintroduction.

TBD before flipping to ready-for-review (why this is still draft):

  • Generated-artifact parity test: bundle a representative recipe
    with the Helm deployer AND a GitOps deployer (AND with
    --vendor-charts); assert the rendered output contains the
    aicr-dra-rollout-hook-<version> Job + ClusterRole +
    ClusterRoleBinding in every variant, and that the vendored
    variant carries the author-declared helm.sh/hook
    annotations.
  • Cluster verification: on a cluster running an older
    gpu-operator, deploy the new bundle via a GitOps deployer
    (NOT ./deploy.sh) AND with --vendor-charts; assert the
    DRA kubelet plugin re-rolls after per-node driver migration
    completes and aicr validate --phase performance passes
    without manual intervention.

Follow-Up Work (Out of Scope for This PR)

  • Argo CD argocd-cm Application Health customization. Restore
    gating on the app-of-apps path by emitting a documented
    resource.customizations Lua block (or guidance to apply one)
    that derives argoproj.io/Application Health from
    status.health.status. See Known limitations.
  • Post-bundle driver.version render deferral. Defer
    recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml
    render to helm install time so install-time --set / --dynamic
    / values.yaml edits on driver.version re-key the Job. Structural
    change to the localformat render pipeline
    (pkg/bundler/deployer/localformat/local_helm.go:155). See
    Known limitations.
  • Old-version Job cleanup mechanism (CronJob or sweep) if Succeeded
    Jobs accumulating one-per-past-chart-bump becomes operationally
    noisy.
  • Evaluate removing the redundant kubectl wait + rollout restart
    block from pkg/bundler/deployer/helm/templates/deploy.sh.tmpl once
    this Job has shipped in a release and proven reliable.

Risk Assessment

  • Medium — touches the gpu-operator install path for every
    recipe in the registry (wired at base.yaml). Failure modes
    are now fail-closed: any classification flake, migration
    wait timeout, restart failure, or rollout-status timeout
    surfaces as a failed Helm release / failing GitOps
    Application — operators see it. Job failure does not break
    gpu-operator itself (the Job runs after the chart resources
    apply). The helm.sh/resource-policy: keep on prior Jobs
    keeps a small footprint (one Job per past chart version,
    manually GC-able by label).

Rollout notes: First effective run is the next gpu-operator chart
bump on a cluster running an older gpu-operator; the Job fires
automatically as part of the gpu-operator-post chart install/upgrade.
No flag, no opt-in. Reverting is a one-line revert of base.yaml plus
the manifest delete (plus make bom-docs to refresh
container-images.md).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — DEFERRED (parity test, see TBD above)
  • I updated docs if user-facing behavior changed — docs/user/container-images.md (regenerated), docs/user/component-catalog.md (upgrade-note revised)
  • Changes follow existing patterns in the codebase — mirrors kernel-module-params.yaml Helm-template + deploy.sh.tmpl script
  • Commits are cryptographically signed (git commit -S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a Helm post-install/post-upgrade Job manifest (chart-version–keyed) and cluster RBAC that detects GPU driver migration state, optionally waits for node upgrade labels (≤15m), discovers the DRA kubelet-plugin DaemonSet, issues a rollout restart and waits (≤5m). The Job runs with a pinned image, restrictive securityContext, restart/backoff settings, and is registered in the base recipe overlay; the BOM/docs add the new container image. Flux bundler is updated to accept a parent chartVersion, normalize it, and pre-render manifests before writing Chart.yaml and manifests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#1017: Related changes to manifest rendering and Flux bundler used alongside chart-templating updates.

Suggested labels

area/bundler

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a gpu-operator-post Job that re-rolls the DRA kubelet plugin across all deployers, directly matching the changeset's primary objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively explains the changeset, including motivation, implementation details, cross-deployer gating, known limitations, and testing status.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml`:
- Around line 178-182: The daemonset lookup uses a broad substring match
'/kubelet-plugin/' which can yield false positives or miss the intended DRA
daemonset; update the command that sets DRA_DS (the variable computed with
kubectl get daemonset and awk) to match a more specific/anchored name pattern
such as the expected full daemonset name (e.g.,
'nvidia-dra-driver-gpu-kubelet-plugin') or a stricter regex that anchors the
suffix/prefix, so that DRA_DS reliably selects only the intended DaemonSet in
the DRA_NS namespace before the emptiness check and early exit.
- Line 149: Replace the invalid image tag "docker.io/bitnami/kubectl:1.34" in
the dra-rollout hook manifest (the line starting with "image:
docker.io/bitnami/kubectl:1.34") with a valid bitnami/kubectl tag; verify the
correct tag on Docker Hub (or use a known working tag like the latest stable
bitnami/kubectl) and update the image field accordingly so the hook Job can pull
the image successfully.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d14cf4b7-4226-4113-9930-526ca9b649a7

📥 Commits

Reviewing files that changed from the base of the PR and between c210874 and b92ce59.

📒 Files selected for processing (2)
  • recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml
  • recipes/overlays/base.yaml

Comment thread recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml Outdated
Comment thread recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml Outdated
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
Six findings from Codex review on PR NVIDIA#1030's initial commit (all
verified against source):

P1 — Use only supported manifest template functions
  pkg/manifest/render.go:98 (helmFuncMap) registers toYaml, nindent,
  toString, default, replace, trunc, trimSuffix. The Sprig "quote"
  function is not in the map, so v1's "{{ $draNamespace | quote }}"
  fails template parsing during aicr bundle. Replaced with a literal
  string in the script (the namespace is no longer hardcoded — see
  P2.1 below — so the inline "..." is gone too) and replaced the two
  label values that used "{{ .Chart.Version | quote }}" with
  YAML-syntactic quotes around the template expression.

P1 — Ensure the DRA namespace exists before RBAC
  On a fresh cluster, gpu-operator-post installs immediately after
  gpu-operator-pre/gpu-operator but BEFORE nvidia-dra-driver-gpu has
  applied its chart with --create-namespace. v1's Role and
  RoleBinding in $draNamespace would therefore fail because the
  nvidia-dra-driver namespace does not exist yet. Collapsed the
  cross-namespace Role+RoleBinding into a single ClusterRole+
  ClusterRoleBinding that grants daemonsets get/list/watch/patch
  cluster-wide; the cluster-scoped binding has no per-namespace
  dependency. Trade-off is slightly broader RBAC, but the SA is
  short-lived and scoped to this one Job.

P1 — Preserve hook semantics through manifest bundling
  pkg/bundler/deployer/localformat/local_helm.go:165 unconditionally
  calls stripHelmHooks on every manifest rendered into a local-helm
  chart. stripHelmHooks removes helm.sh/hook* annotations for
  post-install / post-upgrade / pre-install / pre-upgrade so that
  Argo CD's syncPolicy.automated can apply the resource at all
  (PostSync hooks never fire on path-based sources under automated
  sync). v1 relied on those annotations and would have shipped as a
  regular Job that fires only once at first install, never on
  upgrades. Re-firing is now achieved by keying the Job name on
  .Chart.Version (the synthesized gpu-operator-post chart inherits
  the parent gpu-operator componentRef's version at
  pkg/bundler/deployer/localformat/writer.go:121), so each chart
  bump produces a new Job manifest that every deployer applies.
  helm.sh/resource-policy: keep prevents Helm's three-way merge
  from pruning prior-version Jobs mid-rollout;
  ttlSecondsAfterFinished: 600 lets Kubernetes self-clean succeeded
  Jobs after 10m so the namespace doesn't accumulate one per past
  chart bump.

P2 — Derive the DRA namespace instead of hardcoding it
  The os-talos mixin places nvidia-dra-driver-gpu in
  privileged-nvidia-dra-driver-gpu, and any recipe can override the
  component namespace. v1's hardcoded "nvidia-dra-driver" both
  bound the RBAC to the wrong namespace and pointed DRA_NS at the
  wrong place. With RBAC collapsed to ClusterRole (P1.2 above), the
  remaining script-side concern is locating the DaemonSet at
  runtime. The script now discovers the DRA kubelet-plugin
  DaemonSet via kubectl get daemonset -A
  -l app.kubernetes.io/name=nvidia-dra-driver-gpu, filtering for
  the kubelet-plugin/dra-driver DaemonSet (not the controller
  Deployment), with a fallback that takes the first matched
  DaemonSet for chart variants that don't set a component label.

P2 — Grant daemonsets/watch for kubectl rollout status
  kubectl rollout status daemonset/... uses Watch on the DaemonSet
  resource, not just Get on Pods. v1's Role granted only
  get/list/patch on daemonsets, so every hook run would get
  Forbidden on the watch and skip the readiness gate. Added
  "watch" to the daemonsets ClusterRole rule. The pods rule
  retains get/list/watch.

P2 — Pin the kubectl image by digest
  Repo standard is digest-pinned images (validator catalog,
  on-tag.yaml, EKS images, renovate runner all use @sha256:…).
  Separately, Bitnami's catalog reshuffle removed all non-latest
  tags from bitnami/kubectl (only "latest" remains in the free
  namespace today; bitnamilegacy/kubectl tops out at 1.33.4),
  so the v1 reference "bitnami/kubectl:1.34" no longer resolves.
  Switched to docker.io/alpine/k8s:1.34.8 pinned by the multi-arch
  manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be.
  alpine/k8s ships kubectl plus a BusyBox shell, so the embedded
  script was re-targeted to POSIX /bin/sh (no [[…]], no pipefail).

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
@yuanchen8911 yuanchen8911 changed the title WIP: feat(gpu-operator-post): re-roll DRA kubelet plugin via Helm hook (#980) WIP: feat(gpu-operator-post): re-roll DRA kubelet plugin across all deployers (#980) May 26, 2026
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
Two Codex re-review findings on PR NVIDIA#1030:

P1 — Drop ttlSecondsAfterFinished on the chart-version-keyed Job
  v2 set ttlSecondsAfterFinished: 600 so prior-version Jobs would
  self-clean. With the Job as a regular (non-hook) chart resource,
  it is now part of the GitOps desired state: Argo CD's self-heal
  and Helm's three-way merge both treat it as desired. A Kubernetes
  TTL delete makes the resource missing relative to the rendered
  manifest, so the next reconcile recreates it and re-runs the DRA
  rollout-restart on every cycle. Removed the TTL. Old-version Jobs
  persist as Succeeded records (with helm.sh/resource-policy: keep
  preserving them across upgrades); operators can garbage-collect
  by label or via a follow-up cleanup mechanism if they become
  operationally noisy.

P2 — Regenerate docs/user/container-images.md to include alpine/k8s
  CLAUDE.md requires `make bom-docs` after any change that adds a
  new image reference. The new dra-rollout-hook manifest pins
  docker.io/alpine/k8s:1.34.8@sha256:4d352ba…, and the BOM tool
  picks it up from manifests. gpu-operator's per-component image
  count goes from 14 → 15, with the new image listed under the
  gpu-operator section.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml`:
- Around line 119-126: Update the earlier header comment in
dra-rollout-hook.yaml (the lines that currently state TTL cleans up old Jobs) to
match the later rationale: explicitly note that ttlSecondsAfterFinished is
intentionally omitted because Argo CD/Helm treat chart-version-keyed Jobs as
desired state and a TTL-based deletion would cause GitOps to immediately
recreate the Job on reconcile; reference ttlSecondsAfterFinished, Job, Argo CD
and Helm in the revised header so both comments are consistent and avoid
contradictory operator guidance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4ab151dc-817c-49ac-8ed0-5d1bb4d9020b

📥 Commits

Reviewing files that changed from the base of the PR and between e67e3ad and 59b88a9.

📒 Files selected for processing (2)
  • docs/user/container-images.md
  • recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml

Comment thread recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml Outdated
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
Codex P3 re-review on PR NVIDIA#1030 noted two comments that hadn't kept up
with the design pivot:

dra-rollout-hook.yaml — the header still claimed
ttlSecondsAfterFinished lets Kubernetes self-clean prior Jobs, but
the spec intentionally omits TTL (per the prior fixup) and the
inline annotation comment already explains why. Rewrote the header
paragraph to match: TTL is deliberately absent because the Job is
GitOps desired state and a TTL delete would trigger a recreate on
every reconcile.

base.yaml — the manifestFiles comment described this entry as a
"Helm post-install,post-upgrade hook," but it ships as a regular
chart-version-keyed Job because the local-helm bundler strips
helm.sh/hook* for Argo CD compatibility. Rewrote to describe the
actual mechanism (versioned Job name → re-fires on every chart bump)
and note the hook-strip rationale.

Comment-only change. No spec, RBAC, or script behavior diff.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
… parent

PR NVIDIA#1030's Tier 1 KWOK flux-oci tests failed across every recipe with:

  Job.batch "aicr-dra-rollout-hook-0-1-0-537018ad0a4c" is invalid:
  metadata.labels: Invalid value: "0.1.0+537018ad0a4c": a valid label
  must be an empty string or consist of alphanumeric characters,
  '-', '_' or '.', ...

The flux deployer was writing manifest templates raw into the
synthesized -pre / -post chart and leaving substitution to Helm. The
synthesized chart's Chart.yaml hardcoded version: 0.1.0, so Helm
resolved .Chart.Version to a value unrelated to the parent component
— and under flux-oci, source-controller appended a +<artifact-sha>
suffix that the K8s API server rejected as a label value.

The chart-version-keyed Job name from NVIDIA#980 only worked under
localformat (helm / helmfile / argocd-helm) because that deployer
already calls pkg/manifest.Render at bundle time with the parent
componentRef's version. Flux didn't, so the same template silently
behaved differently across deployers.

Fix: thread the parent's normalized chart version into
generateManifestHelmChart and (a) pre-render every manifest via
pkg/manifest.Render with ChartVersion=<parent>, (b) write Chart.yaml
with version: <parent> too so any external reader of the synthesized
chart's metadata sees the real identity.

After the fix, rendered output is identical across deployers — the
dra-rollout-hook Job name becomes aicr-dra-rollout-hook-26-3-1 (the
gpu-operator chart pin) under every deployer, and the
aicr.nvidia.com/gpu-operator-chart-version label carries "26.3.1"
(valid label value, no '+').

Drive-by: refresh the dra-rollout-hook.yaml header comment that
incorrectly attributed the .Chart.Version substitution to the
synthesized chart's Chart.yaml — the substitution now happens at
manifest.Render time, not at Helm install time.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/deployer/flux/helm.go`:
- Around line 233-247: The current call to manifest.Render(...) is consuming
deploy-time Helm values (Values: g.ComponentValues[compName]) which hard-codes
.Values into generated templates; remove/stop passing in component values to
manifest.Render so only chart metadata (ComponentName, Namespace, ChartName,
ChartVersion) are pre-rendered and leave .Values lookups intact for
runtime/Helm. Update the manifest.Render invocation in
pkg/bundler/deployer/flux/helm.go (the manifest.Render and manifest.RenderInput
usage) to omit or pass an empty map for Values, and if any manifests require
DynamicValues, explicitly detect and fail with a clear error until partial
rendering is implemented; ensure downstream HelmRelease spec.values/valuesFrom
wiring logic remains unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8e1e978f-d5f0-4dcb-ac6b-c8c82378c87b

📥 Commits

Reviewing files that changed from the base of the PR and between 85b5463 and aa478a2.

📒 Files selected for processing (3)
  • pkg/bundler/deployer/flux/flux.go
  • pkg/bundler/deployer/flux/helm.go
  • recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml

Comment thread pkg/bundler/deployer/flux/helm.go Outdated
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
…ifest pre-render

Codex re-review on PR NVIDIA#1030 caught a contract violation in the
flux/helm.go change from commit aa478a2: calling manifest.Render at
bundle time substitutes every .Values.* expression in the manifest
template with the static values, before splitDynamicPaths separates
dynamic paths into the Flux ConfigMap. Operators editing
configmap-values.yaml to override a dynamic path then have no effect
on the rendered manifest — the static default has already been baked
in. The cited test (TestGenerate_WithDynamicValues_ManifestComponent)
only verifies ConfigMap wiring, not that .Values.* references survive
into the rendered chart, so it passed despite the regression.

Fix has two parts:

1. Revert the flux/helm.go manifest.Render call. Manifests are
   written raw again so Helm substitutes .Values.* at install time
   from the merged static + dynamic values, restoring the documented
   Flux dynamic-values behavior. Keep the chartVersion threading
   that sets Chart.yaml's version field to the parent's normalized
   value — that's cosmetic metadata only (no template uses
   .Chart.Version on this path anymore).

2. Plumb the parent gpu-operator chart version through the values
   map instead. DefaultBundler.injectDRAParentChartVersionValue
   writes gpu-operator componentValues[_aicrParentChartVersion]
   = NormalizeVersionWithDefault(ref.Version) at bundle generation,
   gated on both gpu-operator and nvidia-dra-driver-gpu being
   enabled in the filtered recipe. The dra-rollout-hook template
   reads it via `{{ index .Values "gpu-operator" "_aicrParentChartVersion" }}`
   and uses that for both the Job name slug and the
   aicr.nvidia.com/gpu-operator-chart-version label.

Why the values-map path beats .Chart.Version on every deployer:

 - localformat / argocd-helm / helmfile already pre-render manifest
   templates via pkg/manifest.Render with the parent's wrapped
   values, so the injected key resolves at bundle time to the real
   gpu-operator version.
 - flux writes the manifest raw and Helm resolves at install time
   from the HelmRelease's spec.values (which carries the injected
   value verbatim, no source-controller '+<artifact-sha>' suffix
   that would otherwise have broken K8s label validation).
 - The synthesized chart's Chart.yaml-hardcoded `version: 0.1.0`
   no longer matters for the rollout-hook's identity.

Tests added: bundler_dra_parent_chart_version_test.go covers the
helper end-to-end — positive case, both gating branches (DRA
disabled, gpu-operator disabled), user --set override is overridden
by the bundler-derived value, nil-tolerant inputs.

End-to-end verification: `helm template` on locally-bundled flux and
helmfile outputs both produce Job name `aicr-dra-rollout-hook-26-3-1`
and label value `"26.3.1"` — identical across deployers, no '+',
valid against K8s label validation.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
@github-actions github-actions Bot added size/XL and removed size/L labels May 26, 2026
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
Second Codex pass on PR NVIDIA#1030 flagged a gating mismatch:
injectDRAParentChartVersionValue required BOTH gpu-operator and
nvidia-dra-driver-gpu enabled, but the dra-rollout-hook manifest is
wired into base.yaml's gpu-operator componentRef's manifestFiles and
gets emitted whenever gpu-operator is enabled — independent of DRA.

Repro per Codex:

  aicr bundle --recipe recipes/overlays/h100-eks-ubuntu-training.yaml \
              --set nvidia-dra-driver-gpu:enabled=false

  ⇒ gpu-operator-post emits a Job with:
      metadata.name: aicr-dra-rollout-hook-<nil>
      aicr.nvidia.com/gpu-operator-chart-version: "<nil>"
      helm.sh/chart: gpu-operator-<nil>

  The K8s API server rejects the Job because <nil> isn't valid
  DNS-1123.

Fix: drop the \!draEnabled half of the gate. The injection now runs
whenever gpu-operator itself is enabled (which mirrors the
manifest's emission condition). When DRA is in fact disabled at
runtime, the hook script's existing "no DRA DaemonSet found → exit
0" branch keeps the Job a no-op — over-injecting the value is
harmless. The unit test that previously asserted "no injection when
DRA disabled" now asserts the inverse, matching the documented
behavior.

Verified locally: `aicr bundle --deployer flux --set
nvidia-dra-driver-gpu:enabled=false` followed by `helm template`
on the synthesized gpu-operator-post chart produces
`aicr-dra-rollout-hook-26-3-1` Job name with label `"26.3.1"`,
identical to the DRA-enabled rendering.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
Fourth Codex pass on PR NVIDIA#1030 flagged that the Job's pod spec has
no tolerations, so on tainted CI clusters (KWOK nodes carry
accelerator/system taints by default in the Tier-1 KWOK suite) the
hook pod stays Pending with FailedScheduling. Flux's HelmRelease
waits for the synthesized gpu-operator-post chart to become Ready;
the un-schedulable Job pod keeps it in flight, every downstream
release stalls on DependencyNotReady, and the cross-deployer
rollout protection NVIDIA#980 was designed to provide silently doesn't.

Fix: add `tolerations: [{operator: Exists}]` to the Job's pod
template. The Job is short-lived (a few kubectl calls + an optional
15m wait for driver migration on managed-driver overlays) and
exclusively talks to the kube-apiserver — it does NOT touch the
GPU, so landing on a GPU-tainted node is harmless and a system
node works equally well. Tolerating every taint lets the scheduler
place it on whichever node has capacity first.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
The argocd-helm-oci wrapper script was passing the FULL bundle URL to
`helm install --set repoURL=…` (including the per-recipe chart name
at the end). That matched the pre-PR-NVIDIA#1032 contract where the parent
Application's `source.chart` was hardcoded to `aicr-bundle`.

PR NVIDIA#1032 (and NVIDIA#1035's reinforcement) changed the parent App template
to expect the parent-namespace-only repoURL and to append .Chart.Name
itself via the separate `source.chart` field. The wrapper script
wasn't updated to match. Result on every PR with argocd-helm-oci
Tier-1 KWOK coverage: the parent App resolves to
`oci://registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>`,
the OCI artifact lookup 404s, gpu-operator-post's Application can
never sync, and the whole stack times out on
`GitOps sync timeout strike 1/3`.

The failure was masked on `main` because the most-recent KWOK
Cluster Validation run on `main` (#26469449378 at 0d3e62d, success)
ran *before* PR NVIDIA#1035 merged. After NVIDIA#1035 / NVIDIA#1036 / NVIDIA#1038 all landed
on main, no fresh KWOK run has triggered on `main` yet — but the
next one will fail the same way every open PR's argocd-helm-oci
Tier-1 jobs are currently failing.

Fix is a one-line drop of the per-recipe suffix from
OCI_IN_CLUSTER_REF in the argocd-helm-oci branch of generate_bundle.
The flux branch keeps the per-recipe suffix because flux's
OCIRepository CR consumes the FULL artifact URL (recipe segment
included). Updated the surrounding comment to point at the
post-NVIDIA#1032 contract so the next reader understands the asymmetry.

End-to-end check (verified from PR NVIDIA#1030's debug artifact at
b3f2296): repo-server log shows
`registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>: not found`,
caused by the same double-append. With the recipe suffix dropped,
Argo's resolution `<repoURL>/<chart>:<tag>` aligns with the pushed
artifact at `oci://…/aicr/<recipe>:<tag>`.

Refs PR NVIDIA#1030 (where this surfaced), PR NVIDIA#1032 (contract change),
PR NVIDIA#1035 (parent App template enforcement).
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Tracked down the Tier-1 argocd-helm-oci CI failures — not a #980 bug, but a pre-existing wrapper-script issue surfaced by my PR's CI re-run.

Root cause: the kwok wrapper script kwok/scripts/validate-scheduling.sh was passing the FULL bundle URL via helm install --set repoURL=oci://…/aicr/<recipe> (the pre-PR-#1032 contract). PR #1032 changed the parent App template to expect the parent-namespace-only repoURL, with .Chart.Name appended via the separate source.chart field. The wrapper wasn't updated to match. Result on every PR with argocd-helm-oci coverage: parent App resolves to oci://…/aicr/<recipe>/<recipe>:<tag>, the OCI lookup 404s, and gpu-operator-post Application never syncs (debug-artifact evidence: failed to resolve revision: …/aicr/aks/aks:0.0.0-…-argocd-helm-oci: not found).

The failure was masked on main because the most-recent KWOK Cluster Validation run on main (#26469449378 at 0d3e62df) predates PR #1035. After #1035 / #1036 / #1038 all landed on main today, no new main KWOK run has fired yet — but it would fail the same way every open PR's argocd-helm-oci jobs are failing.

Fix in commit 5620000 — one-line drop of the per-recipe suffix from OCI_IN_CLUSTER_REF in the argocd-helm-oci branch of generate_bundle. The flux branch keeps the full URL because flux's OCIRepository consumes the full artifact reference. CI will re-run and the Tier-1 argocd-helm-oci jobs should go green alongside the already-passing flux-oci ones.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
…d runs)

Codex P3 follow-up on the previous trim hardening: two passes of
`trimSuffix . | _ | -` still leave the label invalid when a SemVer
pre-release lands a three-separator run like `.--` at the truncation
boundary. Repro Codex provided:

  parent = 1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.--b
  → aicr.nvidia.com/gpu-operator-chart-version: "…aaaaaa."   (rejected)

pkg/manifest's helmFuncMap doesn't register a regex helper, so a
chain of `trimSuffix` is the only suffix-stripping primitive
available across both the bundler-time render path (limited
funcMap) and the Helm install-time path. Going from 2 cycles to 5
cycles handles up to 15 trailing-separator chars of any combo, and
tightening the trunc from 63 → 60 leaves 3 chars of headroom for the
strip to consume without losing meaningful version digits. Applied
the same treatment to $chartLabel which has the identical hazard
(plus the `_` from upstream `replace "+" "_"`).

Verified with Codex's three-separator repro:
  …aaaa.--b (trunc-boundary `.--` tail)
  → "1.2.3-aaaaaa…aaaa"   (60 chars, ends alphanumeric)

Normal 26.3.1 / 26.3.1+build.1 paths still render correctly.

This is not formally bulletproof — a 16-separator tail would still
slip through. A true fix needs either a regex helper added to
pkg/manifest's helmFuncMap or a Go-side label sanitizer pre-injected
via the values map (similar to _aicrParentChartVersion). Filed as
a follow-up; the current chain handles every realistic SemVer input.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Six findings from Codex review on PR NVIDIA#1030's initial commit (all
verified against source):

P1 — Use only supported manifest template functions
  pkg/manifest/render.go:98 (helmFuncMap) registers toYaml, nindent,
  toString, default, replace, trunc, trimSuffix. The Sprig "quote"
  function is not in the map, so v1's "{{ $draNamespace | quote }}"
  fails template parsing during aicr bundle. Replaced with a literal
  string in the script (the namespace is no longer hardcoded — see
  P2.1 below — so the inline "..." is gone too) and replaced the two
  label values that used "{{ .Chart.Version | quote }}" with
  YAML-syntactic quotes around the template expression.

P1 — Ensure the DRA namespace exists before RBAC
  On a fresh cluster, gpu-operator-post installs immediately after
  gpu-operator-pre/gpu-operator but BEFORE nvidia-dra-driver-gpu has
  applied its chart with --create-namespace. v1's Role and
  RoleBinding in $draNamespace would therefore fail because the
  nvidia-dra-driver namespace does not exist yet. Collapsed the
  cross-namespace Role+RoleBinding into a single ClusterRole+
  ClusterRoleBinding that grants daemonsets get/list/watch/patch
  cluster-wide; the cluster-scoped binding has no per-namespace
  dependency. Trade-off is slightly broader RBAC, but the SA is
  short-lived and scoped to this one Job.

P1 — Preserve hook semantics through manifest bundling
  pkg/bundler/deployer/localformat/local_helm.go:165 unconditionally
  calls stripHelmHooks on every manifest rendered into a local-helm
  chart. stripHelmHooks removes helm.sh/hook* annotations for
  post-install / post-upgrade / pre-install / pre-upgrade so that
  Argo CD's syncPolicy.automated can apply the resource at all
  (PostSync hooks never fire on path-based sources under automated
  sync). v1 relied on those annotations and would have shipped as a
  regular Job that fires only once at first install, never on
  upgrades. Re-firing is now achieved by keying the Job name on
  .Chart.Version (the synthesized gpu-operator-post chart inherits
  the parent gpu-operator componentRef's version at
  pkg/bundler/deployer/localformat/writer.go:121), so each chart
  bump produces a new Job manifest that every deployer applies.
  helm.sh/resource-policy: keep prevents Helm's three-way merge
  from pruning prior-version Jobs mid-rollout;
  ttlSecondsAfterFinished: 600 lets Kubernetes self-clean succeeded
  Jobs after 10m so the namespace doesn't accumulate one per past
  chart bump.

P2 — Derive the DRA namespace instead of hardcoding it
  The os-talos mixin places nvidia-dra-driver-gpu in
  privileged-nvidia-dra-driver-gpu, and any recipe can override the
  component namespace. v1's hardcoded "nvidia-dra-driver" both
  bound the RBAC to the wrong namespace and pointed DRA_NS at the
  wrong place. With RBAC collapsed to ClusterRole (P1.2 above), the
  remaining script-side concern is locating the DaemonSet at
  runtime. The script now discovers the DRA kubelet-plugin
  DaemonSet via kubectl get daemonset -A
  -l app.kubernetes.io/name=nvidia-dra-driver-gpu, filtering for
  the kubelet-plugin/dra-driver DaemonSet (not the controller
  Deployment), with a fallback that takes the first matched
  DaemonSet for chart variants that don't set a component label.

P2 — Grant daemonsets/watch for kubectl rollout status
  kubectl rollout status daemonset/... uses Watch on the DaemonSet
  resource, not just Get on Pods. v1's Role granted only
  get/list/patch on daemonsets, so every hook run would get
  Forbidden on the watch and skip the readiness gate. Added
  "watch" to the daemonsets ClusterRole rule. The pods rule
  retains get/list/watch.

P2 — Pin the kubectl image by digest
  Repo standard is digest-pinned images (validator catalog,
  on-tag.yaml, EKS images, renovate runner all use @sha256:…).
  Separately, Bitnami's catalog reshuffle removed all non-latest
  tags from bitnami/kubectl (only "latest" remains in the free
  namespace today; bitnamilegacy/kubectl tops out at 1.33.4),
  so the v1 reference "bitnami/kubectl:1.34" no longer resolves.
  Switched to docker.io/alpine/k8s:1.34.8 pinned by the multi-arch
  manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be.
  alpine/k8s ships kubectl plus a BusyBox shell, so the embedded
  script was re-targeted to POSIX /bin/sh (no [[…]], no pipefail).

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
@yuanchen8911 yuanchen8911 force-pushed the feat/980-dra-rollout-hook branch from 0a2f7cb to 7436b4f Compare May 27, 2026 00:18
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Two Codex re-review findings on PR NVIDIA#1030:

P1 — Drop ttlSecondsAfterFinished on the chart-version-keyed Job
  v2 set ttlSecondsAfterFinished: 600 so prior-version Jobs would
  self-clean. With the Job as a regular (non-hook) chart resource,
  it is now part of the GitOps desired state: Argo CD's self-heal
  and Helm's three-way merge both treat it as desired. A Kubernetes
  TTL delete makes the resource missing relative to the rendered
  manifest, so the next reconcile recreates it and re-runs the DRA
  rollout-restart on every cycle. Removed the TTL. Old-version Jobs
  persist as Succeeded records (with helm.sh/resource-policy: keep
  preserving them across upgrades); operators can garbage-collect
  by label or via a follow-up cleanup mechanism if they become
  operationally noisy.

P2 — Regenerate docs/user/container-images.md to include alpine/k8s
  CLAUDE.md requires `make bom-docs` after any change that adds a
  new image reference. The new dra-rollout-hook manifest pins
  docker.io/alpine/k8s:1.34.8@sha256:4d352ba…, and the BOM tool
  picks it up from manifests. gpu-operator's per-component image
  count goes from 14 → 15, with the new image listed under the
  gpu-operator section.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Codex P3 re-review on PR NVIDIA#1030 noted two comments that hadn't kept up
with the design pivot:

dra-rollout-hook.yaml — the header still claimed
ttlSecondsAfterFinished lets Kubernetes self-clean prior Jobs, but
the spec intentionally omits TTL (per the prior fixup) and the
inline annotation comment already explains why. Rewrote the header
paragraph to match: TTL is deliberately absent because the Job is
GitOps desired state and a TTL delete would trigger a recreate on
every reconcile.

base.yaml — the manifestFiles comment described this entry as a
"Helm post-install,post-upgrade hook," but it ships as a regular
chart-version-keyed Job because the local-helm bundler strips
helm.sh/hook* for Argo CD compatibility. Rewrote to describe the
actual mechanism (versioned Job name → re-fires on every chart bump)
and note the hook-strip rationale.

Comment-only change. No spec, RBAC, or script behavior diff.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
… parent

PR NVIDIA#1030's Tier 1 KWOK flux-oci tests failed across every recipe with:

  Job.batch "aicr-dra-rollout-hook-0-1-0-537018ad0a4c" is invalid:
  metadata.labels: Invalid value: "0.1.0+537018ad0a4c": a valid label
  must be an empty string or consist of alphanumeric characters,
  '-', '_' or '.', ...

The flux deployer was writing manifest templates raw into the
synthesized -pre / -post chart and leaving substitution to Helm. The
synthesized chart's Chart.yaml hardcoded version: 0.1.0, so Helm
resolved .Chart.Version to a value unrelated to the parent component
— and under flux-oci, source-controller appended a +<artifact-sha>
suffix that the K8s API server rejected as a label value.

The chart-version-keyed Job name from NVIDIA#980 only worked under
localformat (helm / helmfile / argocd-helm) because that deployer
already calls pkg/manifest.Render at bundle time with the parent
componentRef's version. Flux didn't, so the same template silently
behaved differently across deployers.

Fix: thread the parent's normalized chart version into
generateManifestHelmChart and (a) pre-render every manifest via
pkg/manifest.Render with ChartVersion=<parent>, (b) write Chart.yaml
with version: <parent> too so any external reader of the synthesized
chart's metadata sees the real identity.

After the fix, rendered output is identical across deployers — the
dra-rollout-hook Job name becomes aicr-dra-rollout-hook-26-3-1 (the
gpu-operator chart pin) under every deployer, and the
aicr.nvidia.com/gpu-operator-chart-version label carries "26.3.1"
(valid label value, no '+').

Drive-by: refresh the dra-rollout-hook.yaml header comment that
incorrectly attributed the .Chart.Version substitution to the
synthesized chart's Chart.yaml — the substitution now happens at
manifest.Render time, not at Helm install time.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…ifest pre-render

Codex re-review on PR NVIDIA#1030 caught a contract violation in the
flux/helm.go change from commit aa478a2: calling manifest.Render at
bundle time substitutes every .Values.* expression in the manifest
template with the static values, before splitDynamicPaths separates
dynamic paths into the Flux ConfigMap. Operators editing
configmap-values.yaml to override a dynamic path then have no effect
on the rendered manifest — the static default has already been baked
in. The cited test (TestGenerate_WithDynamicValues_ManifestComponent)
only verifies ConfigMap wiring, not that .Values.* references survive
into the rendered chart, so it passed despite the regression.

Fix has two parts:

1. Revert the flux/helm.go manifest.Render call. Manifests are
   written raw again so Helm substitutes .Values.* at install time
   from the merged static + dynamic values, restoring the documented
   Flux dynamic-values behavior. Keep the chartVersion threading
   that sets Chart.yaml's version field to the parent's normalized
   value — that's cosmetic metadata only (no template uses
   .Chart.Version on this path anymore).

2. Plumb the parent gpu-operator chart version through the values
   map instead. DefaultBundler.injectDRAParentChartVersionValue
   writes gpu-operator componentValues[_aicrParentChartVersion]
   = NormalizeVersionWithDefault(ref.Version) at bundle generation,
   gated on both gpu-operator and nvidia-dra-driver-gpu being
   enabled in the filtered recipe. The dra-rollout-hook template
   reads it via `{{ index .Values "gpu-operator" "_aicrParentChartVersion" }}`
   and uses that for both the Job name slug and the
   aicr.nvidia.com/gpu-operator-chart-version label.

Why the values-map path beats .Chart.Version on every deployer:

 - localformat / argocd-helm / helmfile already pre-render manifest
   templates via pkg/manifest.Render with the parent's wrapped
   values, so the injected key resolves at bundle time to the real
   gpu-operator version.
 - flux writes the manifest raw and Helm resolves at install time
   from the HelmRelease's spec.values (which carries the injected
   value verbatim, no source-controller '+<artifact-sha>' suffix
   that would otherwise have broken K8s label validation).
 - The synthesized chart's Chart.yaml-hardcoded `version: 0.1.0`
   no longer matters for the rollout-hook's identity.

Tests added: bundler_dra_parent_chart_version_test.go covers the
helper end-to-end — positive case, both gating branches (DRA
disabled, gpu-operator disabled), user --set override is overridden
by the bundler-derived value, nil-tolerant inputs.

End-to-end verification: `helm template` on locally-bundled flux and
helmfile outputs both produce Job name `aicr-dra-rollout-hook-26-3-1`
and label value `"26.3.1"` — identical across deployers, no '+',
valid against K8s label validation.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Second Codex pass on PR NVIDIA#1030 flagged a gating mismatch:
injectDRAParentChartVersionValue required BOTH gpu-operator and
nvidia-dra-driver-gpu enabled, but the dra-rollout-hook manifest is
wired into base.yaml's gpu-operator componentRef's manifestFiles and
gets emitted whenever gpu-operator is enabled — independent of DRA.

Repro per Codex:

  aicr bundle --recipe recipes/overlays/h100-eks-ubuntu-training.yaml \
              --set nvidia-dra-driver-gpu:enabled=false

  ⇒ gpu-operator-post emits a Job with:
      metadata.name: aicr-dra-rollout-hook-<nil>
      aicr.nvidia.com/gpu-operator-chart-version: "<nil>"
      helm.sh/chart: gpu-operator-<nil>

  The K8s API server rejects the Job because <nil> isn't valid
  DNS-1123.

Fix: drop the \!draEnabled half of the gate. The injection now runs
whenever gpu-operator itself is enabled (which mirrors the
manifest's emission condition). When DRA is in fact disabled at
runtime, the hook script's existing "no DRA DaemonSet found → exit
0" branch keeps the Job a no-op — over-injecting the value is
harmless. The unit test that previously asserted "no injection when
DRA disabled" now asserts the inverse, matching the documented
behavior.

Verified locally: `aicr bundle --deployer flux --set
nvidia-dra-driver-gpu:enabled=false` followed by `helm template`
on the synthesized gpu-operator-post chart produces
`aicr-dra-rollout-hook-26-3-1` Job name with label `"26.3.1"`,
identical to the DRA-enabled rendering.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Fourth Codex pass on PR NVIDIA#1030 flagged that the Job's pod spec has
no tolerations, so on tainted CI clusters (KWOK nodes carry
accelerator/system taints by default in the Tier-1 KWOK suite) the
hook pod stays Pending with FailedScheduling. Flux's HelmRelease
waits for the synthesized gpu-operator-post chart to become Ready;
the un-schedulable Job pod keeps it in flight, every downstream
release stalls on DependencyNotReady, and the cross-deployer
rollout protection NVIDIA#980 was designed to provide silently doesn't.

Fix: add `tolerations: [{operator: Exists}]` to the Job's pod
template. The Job is short-lived (a few kubectl calls + an optional
15m wait for driver migration on managed-driver overlays) and
exclusively talks to the kube-apiserver — it does NOT touch the
GPU, so landing on a GPU-tainted node is harmless and a system
node works equally well. Tolerating every taint lets the scheduler
place it on whichever node has capacity first.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…afe version)

Two of three new P2 findings on PR NVIDIA#1030 addressed; the third is
filed as a follow-up note in the PR conversation.

P2 — Check for DRA before waiting on driver migration

  When nvidia-dra-driver-gpu was disabled or absent but gpu-operator
  managed drivers, the script entered the 15-minute
  gpu-driver-upgrade-state wait BEFORE the DRA-DaemonSet absence
  check at the end. That made every DRA-disabled deployment or
  upgrade block for up to 15 minutes on a no-op path.

  Restructured the script so the DRA discovery step runs FIRST; if
  no DRA kubelet plugin DaemonSet is found we exit 0 immediately,
  skipping the migration wait. The driver-daemonset check and the
  per-node upgrade-state wait now run only when there is a DRA
  DS to subsequently restart.

P2 — Sanitize the chart-version label

  $parentVersion can legally carry semver build metadata
  (e.g. 26.3.1+build.1). The Job's name slug and the helm.sh/chart
  label already replace '+' with '-', but the
  aicr.nvidia.com/gpu-operator-chart-version label was emitting the
  raw value, which Kubernetes label validation would reject. Added
  a sibling $versionLabel that strips '+' (and trims to 63 chars
  / strips a trailing '-') and use it for both the metadata.labels
  and the spec.template.metadata.labels sites. Dots remain — they
  are valid in label values.

Out of scope (filed as follow-up):

  P2 — Preserve the post hook on vendored upgrades

  When --vendor-charts is used with the helm / helmfile deployers,
  localformat folds manifestFiles into the vendored gpu-operator
  wrapper instead of emitting a separate gpu-operator-post chart,
  and injectPostInstallHooks annotates the result with only
  helm.sh/hook: post-install. The version-keyed Job would still
  re-fire on upgrades via its name (per the existing design), but
  the absence of helm.sh/hook: post-upgrade in the vendored path
  means Argo CD / Flux paths through that codepath miss the
  upgrade-time re-trigger. Tracking this as a follow-up issue —
  the fix is in pkg/bundler/deployer/localformat (or wherever
  injectPostInstallHooks lives), separate from this manifest.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Codex P1 on PR NVIDIA#1030 at head 0375989: the `$versionLabel`
Go-template block comment (`{{- /* … */}}`) breaks raw-YAML parsing
of the manifest before Helm rendering. `recipes/manifest_images_test.go`
(TestComponentManifestImagesAreFullyQualified /
TestComponentManifestImagesAreDigestPinned) decodes every
`components/*/manifests/*.yaml` file as YAML directly — the
continuation lines inside the `/* */` block are not valid YAML
content, so the test fails with `decode yaml: yaml: line 65: did
not find expected node content`.

Replace the block comment with ordinary YAML `#` comments at column
0 (before the `{{- $saName }}` definitions block) so it survives
raw-YAML decode while still being preserved verbatim through Helm's
whitespace-trim rules at render time.

Verified locally: `go test ./recipes -run TestComponentManifestImagesAre`
passes.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Codex P2 on PR NVIDIA#1030 at head 162d1df: the Job's name is keyed only
on $versionSlug (the resolved gpu-operator chart version). Job
.spec.template is immutable in Kubernetes — any change to the image
digest, inline shell script, tolerations, resources, or
securityContext on this Job will, under a future AICR release that
ships the SAME gpu-operator chart pin as the current bundle,
produce an `Forbidden: field is immutable` rejection when Helm/Argo
CD/Flux try to update the existing Job. helm.sh/resource-policy:
keep prevents Helm-side prune but doesn't make the resource
updatable.

Fix: introduce a hand-bumped $hookRev counter that's appended to the
Job name (e.g. aicr-dra-rollout-hook-26-3-1-r1). Maintainers bump
this constant (1 → 2 → 3 → …) in lockstep with any material edit to
the Job spec below; the bumped name makes the next install create a
fresh Job that the deployer can apply cleanly, while prior Jobs
persist as Succeeded records (already protected by
helm.sh/resource-policy: keep). The constant lives in the same file
as the spec it tracks, so the dependency is local; a future PR
adding a content-derived hash via sha256sum on a named template can
replace it without changing the Job-name shape.

Documented the contract inline at the $hookRev definition and the
intent of the trailing "-r<rev>" suffix is now self-explanatory in
the rendered Job name.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
… cap

Codex P2 on PR NVIDIA#1030 at head 9c5f52c: $versionSlug was truncated at
50 chars, then the assembled
"aicr-dra-rollout-hook-{slug}-r{rev}" name was truncated at 63
chars. For a long parent-version string (e.g.
"1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"), the outer trunc
drops the entire "-r<rev>" tail, so a future $hookRev bump won't
change the Job name and the immutable-pod-template failure mode
returns.

Budget the slug's truncation to reserve space for the suffix:

  "aicr-dra-rollout-hook-" (22) + <slug> + "-r<rev>" (≤ 6) ≤ 63
  ⇒  <slug> ≤ 35

Reserve 6 chars for "-r<rev>" (covers up to -r9999, far more
revisions than this hook will ever see). Trunc the slug at 35 and
keep the outer trunc 63 as a defense-in-depth cap.

Verified with the long-version repro Codex cited:
  parent = 1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
  → name = aicr-dra-rollout-hook-1-2-3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaa-r1
  (60 chars, suffix intact)

The normal 26.3.1 / 26.3.1+build.1 paths still render correctly.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Codex P2 on PR NVIDIA#1030 at head 1a06ad6: $versionLabel only trimmed
a trailing '-' after the 63-char trunc, but a SemVer pre-release
with a '.' at exactly position 63 leaves the rendered label value
ending in '.', which Kubernetes label validation rejects (label
values must start and end with [A-Za-z0-9]). $chartLabel has the
same hazard with '_' because the upstream `replace "+" "_"` can
leave the underscore at the truncation boundary.

Repro Codex cited:

  parent = 1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.b
  → aicr.nvidia.com/gpu-operator-chart-version:
       "1.2.3-aaaaaaa…aaaa."   (rejected: ends in '.')

Fix: chain `trimSuffix` calls for each of '.', '_', '-' on both
label sites, doubled-up to handle the rare case of two consecutive
separators landing at the truncation boundary. The chained shape
is intentionally verbose (no regex available in pkg/manifest's
helmFuncMap, only the seven funcs registered there — `trimSuffix`
is one of them and runs cleanly through both the localformat
bundler-time render path and the Helm install-time path).

Verified with the long-pre-release repro and the normal
26.3.1 / 26.3.1+build.1 paths: both labels now end alphanumeric.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…d runs)

Codex P3 follow-up on the previous trim hardening: two passes of
`trimSuffix . | _ | -` still leave the label invalid when a SemVer
pre-release lands a three-separator run like `.--` at the truncation
boundary. Repro Codex provided:

  parent = 1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.--b
  → aicr.nvidia.com/gpu-operator-chart-version: "…aaaaaa."   (rejected)

pkg/manifest's helmFuncMap doesn't register a regex helper, so a
chain of `trimSuffix` is the only suffix-stripping primitive
available across both the bundler-time render path (limited
funcMap) and the Helm install-time path. Going from 2 cycles to 5
cycles handles up to 15 trailing-separator chars of any combo, and
tightening the trunc from 63 → 60 leaves 3 chars of headroom for the
strip to consume without losing meaningful version digits. Applied
the same treatment to $chartLabel which has the identical hazard
(plus the `_` from upstream `replace "+" "_"`).

Verified with Codex's three-separator repro:
  …aaaa.--b (trunc-boundary `.--` tail)
  → "1.2.3-aaaaaa…aaaa"   (60 chars, ends alphanumeric)

Normal 26.3.1 / 26.3.1+build.1 paths still render correctly.

This is not formally bulletproof — a 16-separator tail would still
slip through. A true fix needs either a regex helper added to
pkg/manifest's helmFuncMap or a Go-side label sanitizer pre-injected
via the values map (similar to _aicrParentChartVersion). Filed as
a follow-up; the current chain handles every realistic SemVer input.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Six findings from Codex review on PR NVIDIA#1030's initial commit (all
verified against source):

P1 — Use only supported manifest template functions
  pkg/manifest/render.go:98 (helmFuncMap) registers toYaml, nindent,
  toString, default, replace, trunc, trimSuffix. The Sprig "quote"
  function is not in the map, so v1's "{{ $draNamespace | quote }}"
  fails template parsing during aicr bundle. Replaced with a literal
  string in the script (the namespace is no longer hardcoded — see
  P2.1 below — so the inline "..." is gone too) and replaced the two
  label values that used "{{ .Chart.Version | quote }}" with
  YAML-syntactic quotes around the template expression.

P1 — Ensure the DRA namespace exists before RBAC
  On a fresh cluster, gpu-operator-post installs immediately after
  gpu-operator-pre/gpu-operator but BEFORE nvidia-dra-driver-gpu has
  applied its chart with --create-namespace. v1's Role and
  RoleBinding in $draNamespace would therefore fail because the
  nvidia-dra-driver namespace does not exist yet. Collapsed the
  cross-namespace Role+RoleBinding into a single ClusterRole+
  ClusterRoleBinding that grants daemonsets get/list/watch/patch
  cluster-wide; the cluster-scoped binding has no per-namespace
  dependency. Trade-off is slightly broader RBAC, but the SA is
  short-lived and scoped to this one Job.

P1 — Preserve hook semantics through manifest bundling
  pkg/bundler/deployer/localformat/local_helm.go:165 unconditionally
  calls stripHelmHooks on every manifest rendered into a local-helm
  chart. stripHelmHooks removes helm.sh/hook* annotations for
  post-install / post-upgrade / pre-install / pre-upgrade so that
  Argo CD's syncPolicy.automated can apply the resource at all
  (PostSync hooks never fire on path-based sources under automated
  sync). v1 relied on those annotations and would have shipped as a
  regular Job that fires only once at first install, never on
  upgrades. Re-firing is now achieved by keying the Job name on
  .Chart.Version (the synthesized gpu-operator-post chart inherits
  the parent gpu-operator componentRef's version at
  pkg/bundler/deployer/localformat/writer.go:121), so each chart
  bump produces a new Job manifest that every deployer applies.
  helm.sh/resource-policy: keep prevents Helm's three-way merge
  from pruning prior-version Jobs mid-rollout;
  ttlSecondsAfterFinished: 600 lets Kubernetes self-clean succeeded
  Jobs after 10m so the namespace doesn't accumulate one per past
  chart bump.

P2 — Derive the DRA namespace instead of hardcoding it
  The os-talos mixin places nvidia-dra-driver-gpu in
  privileged-nvidia-dra-driver-gpu, and any recipe can override the
  component namespace. v1's hardcoded "nvidia-dra-driver" both
  bound the RBAC to the wrong namespace and pointed DRA_NS at the
  wrong place. With RBAC collapsed to ClusterRole (P1.2 above), the
  remaining script-side concern is locating the DaemonSet at
  runtime. The script now discovers the DRA kubelet-plugin
  DaemonSet via kubectl get daemonset -A
  -l app.kubernetes.io/name=nvidia-dra-driver-gpu, filtering for
  the kubelet-plugin/dra-driver DaemonSet (not the controller
  Deployment), with a fallback that takes the first matched
  DaemonSet for chart variants that don't set a component label.

P2 — Grant daemonsets/watch for kubectl rollout status
  kubectl rollout status daemonset/... uses Watch on the DaemonSet
  resource, not just Get on Pods. v1's Role granted only
  get/list/patch on daemonsets, so every hook run would get
  Forbidden on the watch and skip the readiness gate. Added
  "watch" to the daemonsets ClusterRole rule. The pods rule
  retains get/list/watch.

P2 — Pin the kubectl image by digest
  Repo standard is digest-pinned images (validator catalog,
  on-tag.yaml, EKS images, renovate runner all use @sha256:…).
  Separately, Bitnami's catalog reshuffle removed all non-latest
  tags from bitnami/kubectl (only "latest" remains in the free
  namespace today; bitnamilegacy/kubectl tops out at 1.33.4),
  so the v1 reference "bitnami/kubectl:1.34" no longer resolves.
  Switched to docker.io/alpine/k8s:1.34.8 pinned by the multi-arch
  manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be.
  alpine/k8s ships kubectl plus a BusyBox shell, so the embedded
  script was re-targeted to POSIX /bin/sh (no [[…]], no pipefail).

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
@yuanchen8911 yuanchen8911 force-pushed the feat/980-dra-rollout-hook branch from 7436b4f to fe41152 Compare May 27, 2026 20:03
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Two Codex re-review findings on PR NVIDIA#1030:

P1 — Drop ttlSecondsAfterFinished on the chart-version-keyed Job
  v2 set ttlSecondsAfterFinished: 600 so prior-version Jobs would
  self-clean. With the Job as a regular (non-hook) chart resource,
  it is now part of the GitOps desired state: Argo CD's self-heal
  and Helm's three-way merge both treat it as desired. A Kubernetes
  TTL delete makes the resource missing relative to the rendered
  manifest, so the next reconcile recreates it and re-runs the DRA
  rollout-restart on every cycle. Removed the TTL. Old-version Jobs
  persist as Succeeded records (with helm.sh/resource-policy: keep
  preserving them across upgrades); operators can garbage-collect
  by label or via a follow-up cleanup mechanism if they become
  operationally noisy.

P2 — Regenerate docs/user/container-images.md to include alpine/k8s
  CLAUDE.md requires `make bom-docs` after any change that adds a
  new image reference. The new dra-rollout-hook manifest pins
  docker.io/alpine/k8s:1.34.8@sha256:4d352ba…, and the BOM tool
  picks it up from manifests. gpu-operator's per-component image
  count goes from 14 → 15, with the new image listed under the
  gpu-operator section.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Codex P3 re-review on PR NVIDIA#1030 noted two comments that hadn't kept up
with the design pivot:

dra-rollout-hook.yaml — the header still claimed
ttlSecondsAfterFinished lets Kubernetes self-clean prior Jobs, but
the spec intentionally omits TTL (per the prior fixup) and the
inline annotation comment already explains why. Rewrote the header
paragraph to match: TTL is deliberately absent because the Job is
GitOps desired state and a TTL delete would trigger a recreate on
every reconcile.

base.yaml — the manifestFiles comment described this entry as a
"Helm post-install,post-upgrade hook," but it ships as a regular
chart-version-keyed Job because the local-helm bundler strips
helm.sh/hook* for Argo CD compatibility. Rewrote to describe the
actual mechanism (versioned Job name → re-fires on every chart bump)
and note the hook-strip rationale.

Comment-only change. No spec, RBAC, or script behavior diff.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
… parent

PR NVIDIA#1030's Tier 1 KWOK flux-oci tests failed across every recipe with:

  Job.batch "aicr-dra-rollout-hook-0-1-0-537018ad0a4c" is invalid:
  metadata.labels: Invalid value: "0.1.0+537018ad0a4c": a valid label
  must be an empty string or consist of alphanumeric characters,
  '-', '_' or '.', ...

The flux deployer was writing manifest templates raw into the
synthesized -pre / -post chart and leaving substitution to Helm. The
synthesized chart's Chart.yaml hardcoded version: 0.1.0, so Helm
resolved .Chart.Version to a value unrelated to the parent component
— and under flux-oci, source-controller appended a +<artifact-sha>
suffix that the K8s API server rejected as a label value.

The chart-version-keyed Job name from NVIDIA#980 only worked under
localformat (helm / helmfile / argocd-helm) because that deployer
already calls pkg/manifest.Render at bundle time with the parent
componentRef's version. Flux didn't, so the same template silently
behaved differently across deployers.

Fix: thread the parent's normalized chart version into
generateManifestHelmChart and (a) pre-render every manifest via
pkg/manifest.Render with ChartVersion=<parent>, (b) write Chart.yaml
with version: <parent> too so any external reader of the synthesized
chart's metadata sees the real identity.

After the fix, rendered output is identical across deployers — the
dra-rollout-hook Job name becomes aicr-dra-rollout-hook-26-3-1 (the
gpu-operator chart pin) under every deployer, and the
aicr.nvidia.com/gpu-operator-chart-version label carries "26.3.1"
(valid label value, no '+').

Drive-by: refresh the dra-rollout-hook.yaml header comment that
incorrectly attributed the .Chart.Version substitution to the
synthesized chart's Chart.yaml — the substitution now happens at
manifest.Render time, not at Helm install time.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…ifest pre-render

Codex re-review on PR NVIDIA#1030 caught a contract violation in the
flux/helm.go change from commit aa478a2: calling manifest.Render at
bundle time substitutes every .Values.* expression in the manifest
template with the static values, before splitDynamicPaths separates
dynamic paths into the Flux ConfigMap. Operators editing
configmap-values.yaml to override a dynamic path then have no effect
on the rendered manifest — the static default has already been baked
in. The cited test (TestGenerate_WithDynamicValues_ManifestComponent)
only verifies ConfigMap wiring, not that .Values.* references survive
into the rendered chart, so it passed despite the regression.

Fix has two parts:

1. Revert the flux/helm.go manifest.Render call. Manifests are
   written raw again so Helm substitutes .Values.* at install time
   from the merged static + dynamic values, restoring the documented
   Flux dynamic-values behavior. Keep the chartVersion threading
   that sets Chart.yaml's version field to the parent's normalized
   value — that's cosmetic metadata only (no template uses
   .Chart.Version on this path anymore).

2. Plumb the parent gpu-operator chart version through the values
   map instead. DefaultBundler.injectDRAParentChartVersionValue
   writes gpu-operator componentValues[_aicrParentChartVersion]
   = NormalizeVersionWithDefault(ref.Version) at bundle generation,
   gated on both gpu-operator and nvidia-dra-driver-gpu being
   enabled in the filtered recipe. The dra-rollout-hook template
   reads it via `{{ index .Values "gpu-operator" "_aicrParentChartVersion" }}`
   and uses that for both the Job name slug and the
   aicr.nvidia.com/gpu-operator-chart-version label.

Why the values-map path beats .Chart.Version on every deployer:

 - localformat / argocd-helm / helmfile already pre-render manifest
   templates via pkg/manifest.Render with the parent's wrapped
   values, so the injected key resolves at bundle time to the real
   gpu-operator version.
 - flux writes the manifest raw and Helm resolves at install time
   from the HelmRelease's spec.values (which carries the injected
   value verbatim, no source-controller '+<artifact-sha>' suffix
   that would otherwise have broken K8s label validation).
 - The synthesized chart's Chart.yaml-hardcoded `version: 0.1.0`
   no longer matters for the rollout-hook's identity.

Tests added: bundler_dra_parent_chart_version_test.go covers the
helper end-to-end — positive case, both gating branches (DRA
disabled, gpu-operator disabled), user --set override is overridden
by the bundler-derived value, nil-tolerant inputs.

End-to-end verification: `helm template` on locally-bundled flux and
helmfile outputs both produce Job name `aicr-dra-rollout-hook-26-3-1`
and label value `"26.3.1"` — identical across deployers, no '+',
valid against K8s label validation.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Second Codex pass on PR NVIDIA#1030 flagged a gating mismatch:
injectDRAParentChartVersionValue required BOTH gpu-operator and
nvidia-dra-driver-gpu enabled, but the dra-rollout-hook manifest is
wired into base.yaml's gpu-operator componentRef's manifestFiles and
gets emitted whenever gpu-operator is enabled — independent of DRA.

Repro per Codex:

  aicr bundle --recipe recipes/overlays/h100-eks-ubuntu-training.yaml \
              --set nvidia-dra-driver-gpu:enabled=false

  ⇒ gpu-operator-post emits a Job with:
      metadata.name: aicr-dra-rollout-hook-<nil>
      aicr.nvidia.com/gpu-operator-chart-version: "<nil>"
      helm.sh/chart: gpu-operator-<nil>

  The K8s API server rejects the Job because <nil> isn't valid
  DNS-1123.

Fix: drop the \!draEnabled half of the gate. The injection now runs
whenever gpu-operator itself is enabled (which mirrors the
manifest's emission condition). When DRA is in fact disabled at
runtime, the hook script's existing "no DRA DaemonSet found → exit
0" branch keeps the Job a no-op — over-injecting the value is
harmless. The unit test that previously asserted "no injection when
DRA disabled" now asserts the inverse, matching the documented
behavior.

Verified locally: `aicr bundle --deployer flux --set
nvidia-dra-driver-gpu:enabled=false` followed by `helm template`
on the synthesized gpu-operator-post chart produces
`aicr-dra-rollout-hook-26-3-1` Job name with label `"26.3.1"`,
identical to the DRA-enabled rendering.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Fourth Codex pass on PR NVIDIA#1030 flagged that the Job's pod spec has
no tolerations, so on tainted CI clusters (KWOK nodes carry
accelerator/system taints by default in the Tier-1 KWOK suite) the
hook pod stays Pending with FailedScheduling. Flux's HelmRelease
waits for the synthesized gpu-operator-post chart to become Ready;
the un-schedulable Job pod keeps it in flight, every downstream
release stalls on DependencyNotReady, and the cross-deployer
rollout protection NVIDIA#980 was designed to provide silently doesn't.

Fix: add `tolerations: [{operator: Exists}]` to the Job's pod
template. The Job is short-lived (a few kubectl calls + an optional
15m wait for driver migration on managed-driver overlays) and
exclusively talks to the kube-apiserver — it does NOT touch the
GPU, so landing on a GPU-tainted node is harmless and a system
node works equally well. Tolerating every taint lets the scheduler
place it on whichever node has capacity first.

Refs NVIDIA#980, PR NVIDIA#1030 (draft)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…afe version)

Two of three new P2 findings on PR NVIDIA#1030 addressed; the third is
filed as a follow-up note in the PR conversation.

P2 — Check for DRA before waiting on driver migration

  When nvidia-dra-driver-gpu was disabled or absent but gpu-operator
  managed drivers, the script entered the 15-minute
  gpu-driver-upgrade-state wait BEFORE the DRA-DaemonSet absence
  check at the end. That made every DRA-disabled deployment or
  upgrade block for up to 15 minutes on a no-op path.

  Restructured the script so the DRA discovery step runs FIRST; if
  no DRA kubelet plugin DaemonSet is found we exit 0 immediately,
  skipping the migration wait. The driver-daemonset check and the
  per-node upgrade-state wait now run only when there is a DRA
  DS to subsequently restart.

P2 — Sanitize the chart-version label

  $parentVersion can legally carry semver build metadata
  (e.g. 26.3.1+build.1). The Job's name slug and the helm.sh/chart
  label already replace '+' with '-', but the
  aicr.nvidia.com/gpu-operator-chart-version label was emitting the
  raw value, which Kubernetes label validation would reject. Added
  a sibling $versionLabel that strips '+' (and trims to 63 chars
  / strips a trailing '-') and use it for both the metadata.labels
  and the spec.template.metadata.labels sites. Dots remain — they
  are valid in label values.

Out of scope (filed as follow-up):

  P2 — Preserve the post hook on vendored upgrades

  When --vendor-charts is used with the helm / helmfile deployers,
  localformat folds manifestFiles into the vendored gpu-operator
  wrapper instead of emitting a separate gpu-operator-post chart,
  and injectPostInstallHooks annotates the result with only
  helm.sh/hook: post-install. The version-keyed Job would still
  re-fire on upgrades via its name (per the existing design), but
  the absence of helm.sh/hook: post-upgrade in the vendored path
  means Argo CD / Flux paths through that codepath miss the
  upgrade-time re-trigger. Tracking this as a follow-up issue —
  the fix is in pkg/bundler/deployer/localformat (or wherever
  injectPostInstallHooks lives), separate from this manifest.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Codex P1 on PR NVIDIA#1030 at head 0375989: the `$versionLabel`
Go-template block comment (`{{- /* … */}}`) breaks raw-YAML parsing
of the manifest before Helm rendering. `recipes/manifest_images_test.go`
(TestComponentManifestImagesAreFullyQualified /
TestComponentManifestImagesAreDigestPinned) decodes every
`components/*/manifests/*.yaml` file as YAML directly — the
continuation lines inside the `/* */` block are not valid YAML
content, so the test fails with `decode yaml: yaml: line 65: did
not find expected node content`.

Replace the block comment with ordinary YAML `#` comments at column
0 (before the `{{- $saName }}` definitions block) so it survives
raw-YAML decode while still being preserved verbatim through Helm's
whitespace-trim rules at render time.

Verified locally: `go test ./recipes -run TestComponentManifestImagesAre`
passes.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Codex P2 on PR NVIDIA#1030 at head 162d1df: the Job's name is keyed only
on $versionSlug (the resolved gpu-operator chart version). Job
.spec.template is immutable in Kubernetes — any change to the image
digest, inline shell script, tolerations, resources, or
securityContext on this Job will, under a future AICR release that
ships the SAME gpu-operator chart pin as the current bundle,
produce an `Forbidden: field is immutable` rejection when Helm/Argo
CD/Flux try to update the existing Job. helm.sh/resource-policy:
keep prevents Helm-side prune but doesn't make the resource
updatable.

Fix: introduce a hand-bumped $hookRev counter that's appended to the
Job name (e.g. aicr-dra-rollout-hook-26-3-1-r1). Maintainers bump
this constant (1 → 2 → 3 → …) in lockstep with any material edit to
the Job spec below; the bumped name makes the next install create a
fresh Job that the deployer can apply cleanly, while prior Jobs
persist as Succeeded records (already protected by
helm.sh/resource-policy: keep). The constant lives in the same file
as the spec it tracks, so the dependency is local; a future PR
adding a content-derived hash via sha256sum on a named template can
replace it without changing the Job-name shape.

Documented the contract inline at the $hookRev definition and the
intent of the trailing "-r<rev>" suffix is now self-explanatory in
the rendered Job name.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
… cap

Codex P2 on PR NVIDIA#1030 at head 9c5f52c: $versionSlug was truncated at
50 chars, then the assembled
"aicr-dra-rollout-hook-{slug}-r{rev}" name was truncated at 63
chars. For a long parent-version string (e.g.
"1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"), the outer trunc
drops the entire "-r<rev>" tail, so a future $hookRev bump won't
change the Job name and the immutable-pod-template failure mode
returns.

Budget the slug's truncation to reserve space for the suffix:

  "aicr-dra-rollout-hook-" (22) + <slug> + "-r<rev>" (≤ 6) ≤ 63
  ⇒  <slug> ≤ 35

Reserve 6 chars for "-r<rev>" (covers up to -r9999, far more
revisions than this hook will ever see). Trunc the slug at 35 and
keep the outer trunc 63 as a defense-in-depth cap.

Verified with the long-version repro Codex cited:
  parent = 1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
  → name = aicr-dra-rollout-hook-1-2-3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaa-r1
  (60 chars, suffix intact)

The normal 26.3.1 / 26.3.1+build.1 paths still render correctly.

Refs PR NVIDIA#1030.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
Codex P2 on PR NVIDIA#1030 at head 1a06ad6: $versionLabel only trimmed
a trailing '-' after the 63-char trunc, but a SemVer pre-release
with a '.' at exactly position 63 leaves the rendered label value
ending in '.', which Kubernetes label validation rejects (label
values must start and end with [A-Za-z0-9]). $chartLabel has the
same hazard with '_' because the upstream `replace "+" "_"` can
leave the underscore at the truncation boundary.

Repro Codex cited:

  parent = 1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.b
  → aicr.nvidia.com/gpu-operator-chart-version:
       "1.2.3-aaaaaaa…aaaa."   (rejected: ends in '.')

Fix: chain `trimSuffix` calls for each of '.', '_', '-' on both
label sites, doubled-up to handle the rare case of two consecutive
separators landing at the truncation boundary. The chained shape
is intentionally verbose (no regex available in pkg/manifest's
helmFuncMap, only the seven funcs registered there — `trimSuffix`
is one of them and runs cleanly through both the localformat
bundler-time render path and the Helm install-time path).

Verified with the long-pre-release repro and the normal
26.3.1 / 26.3.1+build.1 paths: both labels now end alphanumeric.

Refs PR NVIDIA#1030.
…ers (NVIDIA#980)

Add a Job to the bundler-synthesized gpu-operator-post chart that waits
for gpu-operator's k8s-driver-manager to finish migrating drivers on
every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet
plugin DaemonSet so its NVML handle rebinds to the post-migration
driver state. Extends the kubectl wait + rollout-restart block already
in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer
the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not
just the default Helm deployer that runs ./deploy.sh.

Re-fire mechanism — dual approach across bundler paths:

  1. Non-vendored localformat (default for Helm/helmfile/Argo CD/
     argocd-helm). local_helm.stripHelmHooks unconditionally removes
     sync-phase helm.sh/hook annotations because Argo CD's
     syncPolicy.automated would otherwise treat the resource as a
     PostSync hook that never fires under path-based sources. After
     stripping, the Job ships as a regular chart resource; re-fire on
     upgrade is achieved by keying the Job name on the parent
     gpu-operator chart version, which the bundler injects into the
     synthesized chart's values map at
     gpu-operator._aicrParentChartVersion (see
     DefaultBundler.injectDRAParentChartVersionValue). The values-map
     path is used in preference to .Chart.Version because the
     synthesized chart's Chart.yaml hardcodes 0.1.0 (which would
     defeat re-fire) and because flux-oci's source-controller
     suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s
     label values).

  2. Vendored localformat (--vendor-charts). vendor_folder.go calls
     injectPostInstallHooks on each mixed recipe-side manifest, which
     auto-tags resources with helm.sh/hook: post-install only UNLESS
     the document already declares helm.sh/hook (vendor_wrapper.go
     injectHookOnDoc:223), and auto-assigns sequential hook-weights
     starting at postInstallHookWeightBase=100. Two consequences for
     this manifest:
       (a) auto-injected post-install hooks fire only on first install
           and are silently skipped on every helm upgrade — exactly
           when the stale-NVML race occurs.
       (b) if only the Job author-declares a hook, the SA / Cluster
           Role / ClusterRoleBinding still get auto-injected weights
           100/101/102 while the Job takes Helm's default weight 0,
           putting the Job BEFORE its ServiceAccount/RBAC in install
           order and failing with "ServiceAccount not found".
     Both are addressed by author-declaring helm.sh/hook:
     post-install,post-upgrade plus before-hook-creation on ALL FOUR
     resources with ascending hook-weights:
         ServiceAccount=1, ClusterRole=2, ClusterRoleBinding=3, Job=10
     vendor_wrapper honors the author-declared hooks; the explicit
     weights guarantee the Job's prerequisites exist before it runs;
     post-upgrade re-fires the Job (and refreshes the RBAC) on every
     helm upgrade. stripHelmHooks on the non-vendored path strips ALL
     of these annotations (every phase here is in syncPhaseHooks at
     hooks.go:109-120), so this declaration is inert there and Helm's
     default kind-based ordering (SA → ClusterRole → ClusterRole
     Binding → Job) plus the chart-version-keyed Job name (path 1)
     provide the same install order and re-fire mechanism without
     hooks.

helm.sh/resource-policy: keep applies to path 1 only and prevents
Helm's three-way merge from pruning prior-version Jobs mid-rollout.
It is a no-op for the vendored hook path (Helm does not apply
resource-policy to hook resources); retention there is governed by
helm.sh/hook-delete-policy=before-hook-creation, which deletes a hook
resource ONLY when a new hook with the same name is created. SA / CR /
CRB names are stable so they cycle on every install/upgrade (idempotent
delete-and-recreate); Job names are chart-version-keyed so prior-
version Jobs survive cross-version upgrades as Succeeded records.
ttlSecondsAfterFinished is intentionally NOT set: the Job is part of
GitOps desired state, so a Kubernetes TTL delete would make the
resource missing relative to the rendered manifest and trigger an
immediate recreate on the next reconcile.

Shell script is fail-closed (POSIX sh, alpine/k8s busybox):

  - Exits 0 only when state is positively determined as a no-op:
    DRA DaemonSet absent (fresh install), driver DaemonSet absent
    (host-managed driver mode), or zero managed nodes labeled
    nvidia.com/gpu.deploy.driver=true.
  - Exits 1 on every other failure: any classification kubectl-get
    fails, kubectl-wait for upgrade-done times out, rollout-restart
    fails, or rollout-status times out within 5m.
  - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1)
    || { echo "..."; exit 1; }) to distinguish "API call failed" from
    "result is empty" — POSIX sh lacks `set -o pipefail`, so the
    previous `kubectl ... | grep ... | head` patterns silently masked
    kubectl exit codes and the warning-only timeouts could exit 0
    after an incomplete mitigation.
  - backoffLimit: 2 on the Job is load-bearing: a transient apiserver
    hiccup auto-retries twice before surfacing as a failed release.

RBAC (ClusterRole + ClusterRoleBinding only):

  - daemonsets.apps — get/list/watch/patch (cluster-wide; watch
    required by kubectl rollout status).
  - nodes — get/list/watch (for the upgrade-done label wait).
  - pods  — get/list/watch (rollout-status pod readiness).

A namespaced Role was rejected because gpu-operator-post installs
before nvidia-dra-driver-gpu's chart, so a Role inside the
nvidia-dra-driver namespace would fail on a fresh cluster (namespace
doesn't exist yet).

Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list
digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be,
matching the repo's digest-pinning standard. alpine/k8s ships kubectl
plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no
pipefail).

The existing deploy.sh.tmpl block is intentionally kept as
defense-in-depth for Helm-deployer users; the extra rollout-restart on
an already-rolled DaemonSet is a no-op. A follow-up will evaluate
removing the script-side block once this Job has shipped in a release
and proven reliable.

Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml
so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via
the additive manifestFiles merge.

Refs NVIDIA#980
Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal
durability for the chart-version annotation), NVIDIA#894 (chart bump that
first surfaced the stale-NVML class of bug).

Cross-review iteration 3 — cross-deployer gating + values-only re-fire:

  - [P1] Cross-deployer install-time gating wired across every
    bundler path. The prior revision relied on the Job's eventual
    rollout-restart to self-correct the transient race; without
    install-time gating, downstream releases (notably
    nvidia-dra-driver-gpu) could install against stale-NVML driver
    state for the 15-20m window between gpu-operator-post applying
    its chart resources and the Job's kubectl wait + rollout
    restart returning. Fix:
      - pkg/bundler/deployer/helm/templates/deploy.sh.tmpl: per-name
        case for gpu-operator-post sets COMPONENT_HELM_TIMEOUT=25m
        + COMPONENT_EXTRA_WAIT_ARGS=--wait-for-jobs +
        COMPONENT_FORCE_WAIT=true (overrides --no-wait — the flag
        is a readiness-convenience knob, not a correctness-bypass
        knob; components that exist specifically to close a known
        race must always wait). Vendored gpu-operator case lifted
        to 25m for the --vendor-charts path where the Job ships as
        a hook on the primary chart.
      - pkg/bundler/deployer/helmfile/releases.go: new
        releaseOverrides map keyed by synthesized folder name;
        gpu-operator-post -> {timeout: 25m, waitForJobs: true};
        gpu-operator (vendored coverage) -> {timeout: 25m}. Mirrors
        the deploy.sh case switch.
      - pkg/bundler/deployer/flux/helm.go: new Timeout field on
        HelmReleaseData / ChartRefHelmReleaseData +
        releaseTimeoutOverrides map (gpu-operator-post -> 25m).
        helmrelease.yaml.tmpl and helmrelease-chartref.yaml.tmpl
        emit spec.timeout conditionally. helm-controller already
        waits for hook Jobs by default, so no waitForJobs knob is
        needed on this path — only the outer timeout.
      - Argo CD path is naturally gated by sync-wave + Job-Healthy
        default: app-of-apps waits for each child Application to
        reach Healthy, which requires the Job's status.succeeded.
        No explicit wiring needed.

  - [medium] Driver-version-only overrides now re-fire the Job on
    every deployer. The prior key was _aicrParentChartVersion (=
    gpu-operator chart version) only. On non-hook deployers
    (Helmfile / Argo CD / argocd-helm where stripHelmHooks removes
    the upgrade-hook annotations), the Job is a regular chart
    resource — kubectl apply on the same name is a no-op, so
    --set gpuoperator:driver.version=<X> did NOT re-run the
    migration mitigation when the gpu-operator chart pin was
    unchanged. Fix: new draDriverVersionValueKey = _aicrDriverVersion
    is injected alongside the parent-chart version (reading the
    merged componentValues at driver.version, so any source —
    registry default / overlay / user --set — flows through).
    The Job-name slug becomes "<chartSlug>-d<driverSlug>" when
    driver.version is set; host-managed-driver overlays
    (driver.enabled=false, version unset) keep the original
    1-component shape. The chart-version label
    (aicr.nvidia.com/gpu-operator-chart-version) stays semantically
    tied to the chart pin only.

  - [low] Empty gpu-operator.Version no longer produces an
    unbundleable manifest. The prior helper early-returned on
    empty Version, but the hook ships unconditionally on the
    gpu-operator componentRef, so the template would render
    "<nil>" into the Job name and chart-version label — invalid
    DNS-1123 / label values. Fix: fall through with
    NormalizeVersionWithDefault's "0.1.0" default + slog.Warn so
    operators get a debuggable signal of the underlying recipe
    gap.

  - User-facing doc: --no-wait correctness-gate exception
    documented in docs/user/cli-reference.md (deploy-script
    section) and pkg/bundler/deployer/helm/templates/README.md.tmpl
    so operators reading the doc know the flag does not bypass
    gpu-operator-post.

  - Tests: TestInjectDRAParentChartVersionValue_DriverVersionKey,
    _DriverVersionAbsent, and _EmptyVersionFallback added to pin
    the new behavior. All existing bundler / manifest / recipes
    tests pass with -race; golangci-lint clean.

End-to-end render verified on a real EKS H100 training recipe:
gpu-operator componentValues carries
  _aicrParentChartVersion: 26.3.1
  _aicrDriverVersion: 580.126.20
and the rendered Job name is
  aicr-dra-rollout-hook-26-3-1-d580-126-20-r1
— a driver.version-only override now produces a uniquely-named
Job and triggers re-fire on every deployer.

Document known limits of the cross-deployer gate:

  - Argo CD app-of-apps. Sync-waves give ordering but not gating
    after Argo CD 1.8 removed default Health assessment for
    argoproj.io/Application
    (https://argo-cd.readthedocs.io/en/stable/operator-manual/upgrading/1.7-1.8/#health-assessment-of-argoprojioapplication-crd-has-been-removed).
    Restoring gating needs a resource.customizations Lua block on
    argocd-cm — out of scope for the bundle (cluster-wide config).
    Without it, Argo CD users fall back to the Job's fail-closed
    semantics: exit 1 surfaces gpu-operator-post Application as
    unHealthy, but nvidia-dra-driver-gpu may still sync in
    parallel.

  - Post-bundle driver.version edits. The Job-name slug is baked
    at bundle time by pkg/manifest.Render via
    pkg/bundler/deployer/localformat/local_helm.go:155, so a
    driver.version change applied after `aicr bundle` does not
    update the Job name (re-bundle picks it up; install-time
    `--set` / `--dynamic` does not). Closing this fully needs
    deferring this manifest's render to helm install time — a
    structural change to the localformat render pipeline tracked
    as a follow-up.

PR description updated to reflect both limitations and the
already-wired install-time gating across helm / helmfile / Flux.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant