WIP: feat(gpu-operator-post): re-roll DRA kubelet plugin across all deployers (#980) by yuanchen8911 · Pull Request #1030 · NVIDIA/aicr

yuanchen8911 · 2026-05-26T16:18:25Z

Summary

Add a gpu-operator-post Job that waits for the gpu-operator driver
migration to settle and then re-rolls the nvidia-dra-driver-gpu
kubelet plugin DaemonSet. Ships via the bundler-synthesized
gpu-operator-post chart so the same protection that
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl provides for the
default Helm deployer (./deploy.sh) now applies across every deployer
the bundler emits — Helm, helmfile, Flux, Argo CD, argocd-helm,
including their --vendor-charts variants.

Motivation / Context

k8s-driver-manager reloads NVIDIA kernel modules asynchronously
after helm upgrade gpu-operator returns. If the DRA kubelet plugin
re-rolls before the slowest node finishes, the freshly-rolled pod pins
its NVML handle to the pre-migration driver state and CDI spec
generation fails with invalid CDI Spec: empty device edits, leaving
GPU workloads stuck in ContainerCreating. PR #965 closed this for
./deploy.sh users, but helmfile, flux, argocd, and
argocd-helm consumers never execute that script and still hit the
race.

Fixes: N/A (WIP — flip to Closes: #980 before promoting to ready-for-review)
Related: #980, #965, #973, #894

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)

Component(s) Affected

Recipe engine / data (pkg/recipe) — recipes/overlays/base.yaml wiring
Bundlers (pkg/bundler, pkg/component/*) — new manifest packed into the synthesized gpu-operator-post chart
Docs/examples (docs/) — docs/user/container-images.md regenerated via make bom-docs; docs/user/component-catalog.md upgrade note updated
Other: recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml

Implementation Notes

Dual re-fire mechanism across bundler paths.
The bundler emits the same manifest through two paths with different
hook handling, so the Job carries explicit annotations sized to survive
both:

Non-vendored localformat (default for Helm / helmfile /
Argo CD / argocd-helm).
pkg/bundler/deployer/localformat/local_helm.go:165 unconditionally
calls stripHelmHooks, which removes sync-phase helm.sh/hook
annotations because Argo CD's syncPolicy.automated would
otherwise treat the resource as a PostSync hook that never fires
under path-based sources. After stripping, the Job ships as a
regular chart resource and re-fire on upgrade is achieved by keying
the Job metadata.name on the parent gpu-operator chart version,
which the bundler injects into the synthesized chart's values map
at gpu-operator._aicrParentChartVersion (see
DefaultBundler.injectDRAParentChartVersionValue in
pkg/bundler/bundler.go). The values-map path is used in
preference to .Chart.Version because (a) the synthesized chart's
Chart.yaml hardcodes 0.1.0 so .Chart.Version would never
change across gpu-operator bumps, and (b) under flux-oci,
source-controller suffixes .Chart.Version with +<artifact-sha>
— the + is rejected by K8s label-value validation. Every
gpu-operator chart bump produces a uniquely-named Job manifest
that every deployer applies.
Vendored localformat (--vendor-charts).
pkg/bundler/deployer/localformat/vendor_folder.go:279 folds mixed
recipe-side manifests into the vendored wrapper and calls
injectPostInstallHooks, which auto-tags every document that
does NOT already declare helm.sh/hook with post-install
only and an auto-assigned helm.sh/hook-weight starting at
postInstallHookWeightBase = 100
(pkg/bundler/deployer/localformat/vendor_wrapper.go:124,
vendor_wrapper.go:223). Two problems for this manifest:
- (a) auto-injection is post-install only, so the Job
  runs on first install and is silently skipped on every
  subsequent helm upgrade — exactly when the stale-NVML race
  occurs.
- (b) if only the Job author-declares a hook, the
  ServiceAccount / ClusterRole / ClusterRoleBinding still get
  auto-injected at weights 100/101/102 while the Job's
  author-declared hook takes Helm's default weight 0. Helm
  orders hooks by ascending weight, so the Job would run
  before its ServiceAccount/RBAC exist and fail with
  ServiceAccount "aicr-dra-rollout-hook" not found.
Both are addressed by author-declaring
helm.sh/hook: post-install,post-upgrade plus
helm.sh/hook-delete-policy: before-hook-creation on all four
resources with explicit ascending hook-weights:
```
ServiceAccount        helm.sh/hook-weight: "1"
ClusterRole           helm.sh/hook-weight: "2"
ClusterRoleBinding    helm.sh/hook-weight: "3"
Job                   helm.sh/hook-weight: "10"
```
injectHookOnDoc honors author-declared hooks, the weights
guarantee the Job's prerequisites exist before it runs, and
post-upgrade refreshes both the Job and its RBAC on every
helm upgrade. stripHelmHooks on the non-vendored path strips
ALL of these annotations (every phase here is in
syncPhaseHooks at
pkg/bundler/deployer/localformat/hooks.go:109-120), so this
declaration is inert there and Helm's default kind-based
ordering (SA → ClusterRole → ClusterRoleBinding → Job) plus the
chart-version-keyed Job name from path 1 deliver the same
install order and re-fire mechanism without hooks.

helm.sh/resource-policy: keep applies to path 1 and prevents Helm's
three-way merge from pruning prior-version Jobs mid-rollout. It is a
no-op for the vendored hook path (Helm does not apply resource-policy
to hook resources); retention there is governed by
helm.sh/hook-delete-policy, and the chart-version-keyed Job name
means before-hook-creation only deletes a prior Job on same-version
re-deploys (Argo CD self-heal, Helm re-apply), where the Job is
intentionally re-run anyway. No ttlSecondsAfterFinished: the Job is
part of GitOps desired state, so a Kubernetes TTL delete would make
the resource missing relative to the rendered manifest and trigger an
immediate recreate on the next reconcile.

Fail-closed shell-script semantics (POSIX /bin/sh).
The Job is a correctness mitigation for stale NVML. Exiting 0 after a
classification flake or wait timeout would tell Helm/GitOps
"mitigation succeeded" when it may not have run, which is strictly
worse than a failed release because operators lose the actionable
signal. The script therefore:

Exits 0 only when state is positively determined as a no-op:
DRA DaemonSet absent (fresh install before
nvidia-dra-driver-gpu has applied), nvidia-driver-daemonset
absent (host-managed driver mode — GKE COS, OKE, Kind, Talos),
or zero managed nodes labeled nvidia.com/gpu.deploy.driver=true.
Exits 1 on every other failure: any classification kubectl get returns non-zero, kubectl wait for the per-node
gpu-driver-upgrade-state=upgrade-done label times out at 15m,
kubectl rollout restart fails, or kubectl rollout status
times out at 5m.
Uses a capture-then-check pattern —
KUBECTL_OUT=$(kubectl ... 2>&1) || { echo "ERROR: ..." >&2; exit 1; } — to distinguish "API call failed" from "result is empty".
alpine/k8s's BusyBox /bin/sh lacks set -o pipefail, so the
prior kubectl ... | grep ... | head patterns silently masked
kubectl exit codes and let a failed API call exit 0 after no
rollout occurred.
backoffLimit: 2 on the Job is load-bearing under fail-closed
semantics: a transient apiserver hiccup auto-retries twice before
surfacing as a failed release.

Script mirrors deploy.sh.tmpl line 347:

Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos) by
absence of nvidia-driver-daemonset and skips the migration wait.
Waits up to 15m for every node labeled
nvidia.com/gpu.deploy.driver=true to reach
nvidia.com/gpu-driver-upgrade-state=upgrade-done.
Discovers the DRA kubelet-plugin DaemonSet at runtime by
app.kubernetes.io/name=nvidia-dra-driver-gpu label across all
namespaces, so os-talos / privileged-* overlays that move the DRA
driver out of its default namespace work without any recipe change.
Issues kubectl rollout restart and waits up to 5m for
kubectl rollout status.

RBAC (ClusterRole + ClusterRoleBinding only):

daemonsets.apps — get/list/watch/patch (cluster-wide; watch
required by kubectl rollout status).
nodes — get/list/watch (for the kubectl wait on the
gpu-driver-upgrade-state label).
pods — get/list/watch (rollout-status pod readiness).

No namespaced Role: gpu-operator-post installs before
nvidia-dra-driver-gpu's chart, so a Role inside nvidia-dra-driver
would fail on a fresh cluster because the namespace doesn't exist yet.

Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest
list digest @sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be,
matching the repo's digest-pinning standard. alpine/k8s ships kubectl
plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no
pipefail). bitnami/kubectl was rejected: Bitnami's catalog
reshuffle deleted all non-latest tags from the free namespace, and
bitnamilegacy/kubectl tops out at 1.33.4.

deploy.sh block kept as defense-in-depth. Per issue #980's stated
preference. Extra rollout-restart on an already-rolled DaemonSet is a
no-op. Follow-up will evaluate removing the script-side block once the
Job has shipped in a release and proven reliable.

Wiring lives on the gpu-operator componentRef in
recipes/overlays/base.yaml so it cascades to every service overlay
(EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge.

Cross-Deployer Install-Time Gating

The Job acts as a per-deployer install-time gate via:

helm ./deploy.sh — gpu-operator-post case in
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl sets
--wait-for-jobs + --timeout 25m + COMPONENT_FORCE_WAIT=true
(overrides --no-wait). gpu-operator vendored-charts case lifted
to 25m for the path where the Job ships as a hook on the primary
chart.
helmfile — releaseOverrides in
pkg/bundler/deployer/helmfile/releases.go sets
gpu-operator-post: {timeout: 25m, waitForJobs: true} and
gpu-operator: {timeout: 25m} for the vendored case.
Flux — releaseTimeoutOverrides in
pkg/bundler/deployer/flux/helm.go sets gpu-operator-post: 25m;
templates emit spec.timeout conditionally. helm-controller
already waits for hook Jobs by default.

Known limitations

These are reliable for the common case but NOT airtight in two
scenarios, tracked as follow-ups (intentionally NOT addressed in
this PR to keep scope contained):

Argo CD app-of-apps gating. Sync-waves on child Applications
give ordering but not gating. Argo CD 1.8 removed the default
Health assessment for
argoproj.io/Application,
so the parent app-of-apps treats each child as immediately
Healthy on creation and does not wait for the Job's
status.succeeded before advancing to the next sync-wave (here,
nvidia-dra-driver-gpu). To restore gating, operators must add
a resource.customizations Lua block on argocd-cm that
derives Application Health from status.health.status. Without
that customization, Argo CD users fall back to the Job's own
fail-closed semantics — exit 1 on wait/restart/status failure
surfaces the gpu-operator-post Application as unHealthy, but
nvidia-dra-driver-gpu may still sync in parallel.
Post-bundle driver.version edits. The Job-name slug is
rendered at bundle time
(pkg/bundler/deployer/localformat/local_helm.go:155 calls
manifest.Render which bakes .Values into the static output),
so a driver.version change applied AFTER aicr bundle does
NOT update the Job name — same-chart upgrades skip the
migration mitigation. Common workflow aicr bundle --set gpuoperator:driver.version=… -o new-bundle/ works as expected
(re-bundling produces a uniquely-named Job). Operators editing
the generated values.yaml, helm install --set, or AICR
--dynamic paths on driver.version must re-run aicr bundle
to refresh the Job name. Closing this fully needs deferring the
dra-rollout-hook manifest's render to helm install time — a
structural change to the localformat render pipeline.

The transient race window is also bounded and self-correcting
in the residual cases above (worst-case ~15-20m before the Job's
restart fires, then backoffLimit: 2 on real failure), not a
permanent stuck state. The Job exits 1 on any wait/restart/status
failure, so a release that didn't successfully re-roll the
DaemonSet shows up as a failed Helm release / failing GitOps
Application — operators get the actionable signal that the prior
warning-only design suppressed.

Testing

yamllint -c .yamllint.yaml \
  recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml \
  recipes/overlays/base.yaml   # clean

GOFLAGS="-mod=vendor" go test \
  ./pkg/manifest/... \
  ./pkg/bundler/deployer/localformat/... \
  ./pkg/bundler/deployer/helmfile/... \
  ./pkg/bundler/deployer/helm/... \
  ./pkg/recipe/... \
  ./pkg/bundler/...
# all PASS

make qualify   # PASSED (unit + lint + chainsaw e2e + vuln scan)
make bom-docs  # no diff (alpine/k8s image pin unchanged)

Cross-review by Claude Code + Codex + CodeRabbit surfaced four
findings across two iterations, all addressed in this revision:

[major, blocking] Vendored bundles bypassed re-fire. With
--vendor-charts, injectPostInstallHooks auto-tagged the Job
post-install only — Helm skipped it on upgrade and the feat(gpu-operator-post): add Helm post-install hook to re-roll DRA kubelet plugin across all deployers #980
mitigation was silently absent for vendored Helm / helmfile /
Argo CD / argocd-helm. Fix: author-declare
helm.sh/hook: post-install,post-upgrade on the Job.
[P1, blocking] Vendored hook ordering broke install (caught
by Codex on the follow-up review). The first revision of the
fix above declared a hook only on the Job, leaving
ServiceAccount / ClusterRole / ClusterRoleBinding to be
auto-tagged at weights 100/101/102 while the Job's
author-declared hook took Helm's default weight 0 — running
the Job before its ServiceAccount existed. Fix: author-declare
hooks on all four resources with ascending weights
(SA=1, CR=2, CRB=3, Job=10).
[major] Warning-only timeouts and silent API errors. The prior
kubectl wait and kubectl rollout status timeouts only printed
WARNING: and the Job exited 0; 2>/dev/null | ... || true
pipelines masked classification API failures. Fix: switch to
fail-closed exits + capture-then-check pattern as described above.
[minor] Stale upgrade-note in docs/user/component-catalog.md.
The Helmfile sub-bullet claimed the new bundle no longer
references gpu-operator-post, but this PR reintroduces it for
the DRA rollout hook (different content from the old
dcgm-exporter ConfigMap). Updated to acknowledge the
reintroduction.

TBD before flipping to ready-for-review (why this is still draft):

Generated-artifact parity test: bundle a representative recipe
with the Helm deployer AND a GitOps deployer (AND with
--vendor-charts); assert the rendered output contains the
aicr-dra-rollout-hook-<version> Job + ClusterRole +
ClusterRoleBinding in every variant, and that the vendored
variant carries the author-declared helm.sh/hook
annotations.
Cluster verification: on a cluster running an older
gpu-operator, deploy the new bundle via a GitOps deployer
(NOT ./deploy.sh) AND with --vendor-charts; assert the
DRA kubelet plugin re-rolls after per-node driver migration
completes and aicr validate --phase performance passes
without manual intervention.

Follow-Up Work (Out of Scope for This PR)

Argo CD argocd-cm Application Health customization. Restore
gating on the app-of-apps path by emitting a documented
resource.customizations Lua block (or guidance to apply one)
that derives argoproj.io/Application Health from
status.health.status. See Known limitations.
Post-bundle driver.version render deferral. Defer
recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml
render to helm install time so install-time --set / --dynamic
/ values.yaml edits on driver.version re-key the Job. Structural
change to the localformat render pipeline
(pkg/bundler/deployer/localformat/local_helm.go:155). See
Known limitations.
Old-version Job cleanup mechanism (CronJob or sweep) if Succeeded
Jobs accumulating one-per-past-chart-bump becomes operationally
noisy.
Evaluate removing the redundant kubectl wait + rollout restart
block from pkg/bundler/deployer/helm/templates/deploy.sh.tmpl once
this Job has shipped in a release and proven reliable.

Risk Assessment

Medium — touches the gpu-operator install path for every
recipe in the registry (wired at base.yaml). Failure modes
are now fail-closed: any classification flake, migration
wait timeout, restart failure, or rollout-status timeout
surfaces as a failed Helm release / failing GitOps
Application — operators see it. Job failure does not break
gpu-operator itself (the Job runs after the chart resources
apply). The helm.sh/resource-policy: keep on prior Jobs
keeps a small footprint (one Job per past chart version,
manually GC-able by label).

Rollout notes: First effective run is the next gpu-operator chart
bump on a cluster running an older gpu-operator; the Job fires
automatically as part of the gpu-operator-post chart install/upgrade.
No flag, no opt-in. Reverting is a one-line revert of base.yaml plus
the manifest delete (plus make bom-docs to refresh
container-images.md).

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — DEFERRED (parity test, see TBD above)
I updated docs if user-facing behavior changed — docs/user/container-images.md (regenerated), docs/user/component-catalog.md (upgrade-note revised)
Changes follow existing patterns in the codebase — mirrors kernel-module-params.yaml Helm-template + deploy.sh.tmpl script
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-05-26T16:22:42Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a Helm post-install/post-upgrade Job manifest (chart-version–keyed) and cluster RBAC that detects GPU driver migration state, optionally waits for node upgrade labels (≤15m), discovers the DRA kubelet-plugin DaemonSet, issues a rollout restart and waits (≤5m). The Job runs with a pinned image, restrictive securityContext, restart/backoff settings, and is registered in the base recipe overlay; the BOM/docs add the new container image. Flux bundler is updated to accept a parent chartVersion, normalize it, and pre-render manifests before writing Chart.yaml and manifests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

feat(gpu-operator-post): add Helm post-install hook to re-roll DRA kubelet plugin across all deployers #980: Addresses the same post-install/post-upgrade Job behavior to re-roll the DRA kubelet-plugin across deployers.

Possibly related PRs

NVIDIA/aicr#1017: Related changes to manifest rendering and Flux bundler used alongside chart-templating updates.

Suggested labels

area/bundler

Suggested reviewers

mchmarny

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding a gpu-operator-post Job that re-rolls the DRA kubelet plugin across all deployers, directly matching the changeset's primary objective.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description comprehensively explains the changeset, including motivation, implementation details, cross-deployer gating, known limitations, and testing status.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml`:
- Around line 178-182: The daemonset lookup uses a broad substring match
'/kubelet-plugin/' which can yield false positives or miss the intended DRA
daemonset; update the command that sets DRA_DS (the variable computed with
kubectl get daemonset and awk) to match a more specific/anchored name pattern
such as the expected full daemonset name (e.g.,
'nvidia-dra-driver-gpu-kubelet-plugin') or a stricter regex that anchors the
suffix/prefix, so that DRA_DS reliably selects only the intended DaemonSet in
the DRA_NS namespace before the emptiness check and early exit.
- Line 149: Replace the invalid image tag "docker.io/bitnami/kubectl:1.34" in
the dra-rollout hook manifest (the line starting with "image:
docker.io/bitnami/kubectl:1.34") with a valid bitnami/kubectl tag; verify the
correct tag on Docker Hub (or use a known working tag like the latest stable
bitnami/kubectl) and update the image field accordingly so the hook Job can pull
the image successfully.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d14cf4b7-4226-4113-9930-526ca9b649a7

📥 Commits

Reviewing files that changed from the base of the PR and between c210874 and b92ce59.

📒 Files selected for processing (2)

recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml
recipes/overlays/base.yaml

@sha256

Six findings from Codex review on PR NVIDIA#1030's initial commit (all verified against source): P1 — Use only supported manifest template functions pkg/manifest/render.go:98 (helmFuncMap) registers toYaml, nindent, toString, default, replace, trunc, trimSuffix. The Sprig "quote" function is not in the map, so v1's "{{ $draNamespace | quote }}" fails template parsing during aicr bundle. Replaced with a literal string in the script (the namespace is no longer hardcoded — see P2.1 below — so the inline "..." is gone too) and replaced the two label values that used "{{ .Chart.Version | quote }}" with YAML-syntactic quotes around the template expression. P1 — Ensure the DRA namespace exists before RBAC On a fresh cluster, gpu-operator-post installs immediately after gpu-operator-pre/gpu-operator but BEFORE nvidia-dra-driver-gpu has applied its chart with --create-namespace. v1's Role and RoleBinding in $draNamespace would therefore fail because the nvidia-dra-driver namespace does not exist yet. Collapsed the cross-namespace Role+RoleBinding into a single ClusterRole+ ClusterRoleBinding that grants daemonsets get/list/watch/patch cluster-wide; the cluster-scoped binding has no per-namespace dependency. Trade-off is slightly broader RBAC, but the SA is short-lived and scoped to this one Job. P1 — Preserve hook semantics through manifest bundling pkg/bundler/deployer/localformat/local_helm.go:165 unconditionally calls stripHelmHooks on every manifest rendered into a local-helm chart. stripHelmHooks removes helm.sh/hook* annotations for post-install / post-upgrade / pre-install / pre-upgrade so that Argo CD's syncPolicy.automated can apply the resource at all (PostSync hooks never fire on path-based sources under automated sync). v1 relied on those annotations and would have shipped as a regular Job that fires only once at first install, never on upgrades. Re-firing is now achieved by keying the Job name on .Chart.Version (the synthesized gpu-operator-post chart inherits the parent gpu-operator componentRef's version at pkg/bundler/deployer/localformat/writer.go:121), so each chart bump produces a new Job manifest that every deployer applies. helm.sh/resource-policy: keep prevents Helm's three-way merge from pruning prior-version Jobs mid-rollout; ttlSecondsAfterFinished: 600 lets Kubernetes self-clean succeeded Jobs after 10m so the namespace doesn't accumulate one per past chart bump. P2 — Derive the DRA namespace instead of hardcoding it The os-talos mixin places nvidia-dra-driver-gpu in privileged-nvidia-dra-driver-gpu, and any recipe can override the component namespace. v1's hardcoded "nvidia-dra-driver" both bound the RBAC to the wrong namespace and pointed DRA_NS at the wrong place. With RBAC collapsed to ClusterRole (P1.2 above), the remaining script-side concern is locating the DaemonSet at runtime. The script now discovers the DRA kubelet-plugin DaemonSet via kubectl get daemonset -A -l app.kubernetes.io/name=nvidia-dra-driver-gpu, filtering for the kubelet-plugin/dra-driver DaemonSet (not the controller Deployment), with a fallback that takes the first matched DaemonSet for chart variants that don't set a component label. P2 — Grant daemonsets/watch for kubectl rollout status kubectl rollout status daemonset/... uses Watch on the DaemonSet resource, not just Get on Pods. v1's Role granted only get/list/patch on daemonsets, so every hook run would get Forbidden on the watch and skip the readiness gate. Added "watch" to the daemonsets ClusterRole rule. The pods rule retains get/list/watch. P2 — Pin the kubectl image by digest Repo standard is digest-pinned images (validator catalog, on-tag.yaml, EKS images, renovate runner all use @sha256:…). Separately, Bitnami's catalog reshuffle removed all non-latest tags from bitnami/kubectl (only "latest" remains in the free namespace today; bitnamilegacy/kubectl tops out at 1.33.4), so the v1 reference "bitnami/kubectl:1.34" no longer resolves. Switched to docker.io/alpine/k8s:1.34.8 pinned by the multi-arch manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be. alpine/k8s ships kubectl plus a BusyBox shell, so the embedded script was re-targeted to POSIX /bin/sh (no [[…]], no pipefail). Refs NVIDIA#980, PR NVIDIA#1030 (draft)

Two Codex re-review findings on PR NVIDIA#1030: P1 — Drop ttlSecondsAfterFinished on the chart-version-keyed Job v2 set ttlSecondsAfterFinished: 600 so prior-version Jobs would self-clean. With the Job as a regular (non-hook) chart resource, it is now part of the GitOps desired state: Argo CD's self-heal and Helm's three-way merge both treat it as desired. A Kubernetes TTL delete makes the resource missing relative to the rendered manifest, so the next reconcile recreates it and re-runs the DRA rollout-restart on every cycle. Removed the TTL. Old-version Jobs persist as Succeeded records (with helm.sh/resource-policy: keep preserving them across upgrades); operators can garbage-collect by label or via a follow-up cleanup mechanism if they become operationally noisy. P2 — Regenerate docs/user/container-images.md to include alpine/k8s CLAUDE.md requires `make bom-docs` after any change that adds a new image reference. The new dra-rollout-hook manifest pins docker.io/alpine/k8s:1.34.8@sha256:4d352ba…, and the BOM tool picks it up from manifests. gpu-operator's per-component image count goes from 14 → 15, with the new image listed under the gpu-operator section. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

github-actions · 2026-05-26T17:29:09Z

🌿 Preview your docs: https://nvidia-preview-feat-980-dra-rollout-hook.docs.buildwithfern.com/aicr

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml`:
- Around line 119-126: Update the earlier header comment in
dra-rollout-hook.yaml (the lines that currently state TTL cleans up old Jobs) to
match the later rationale: explicitly note that ttlSecondsAfterFinished is
intentionally omitted because Argo CD/Helm treat chart-version-keyed Jobs as
desired state and a TTL-based deletion would cause GitOps to immediately
recreate the Job on reconcile; reference ttlSecondsAfterFinished, Job, Argo CD
and Helm in the revised header so both comments are consistent and avoid
contradictory operator guidance.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4ab151dc-817c-49ac-8ed0-5d1bb4d9020b

📥 Commits

Reviewing files that changed from the base of the PR and between e67e3ad and 59b88a9.

📒 Files selected for processing (2)

docs/user/container-images.md
recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml

Codex P3 re-review on PR NVIDIA#1030 noted two comments that hadn't kept up with the design pivot: dra-rollout-hook.yaml — the header still claimed ttlSecondsAfterFinished lets Kubernetes self-clean prior Jobs, but the spec intentionally omits TTL (per the prior fixup) and the inline annotation comment already explains why. Rewrote the header paragraph to match: TTL is deliberately absent because the Job is GitOps desired state and a TTL delete would trigger a recreate on every reconcile. base.yaml — the manifestFiles comment described this entry as a "Helm post-install,post-upgrade hook," but it ships as a regular chart-version-keyed Job because the local-helm bundler strips helm.sh/hook* for Argo CD compatibility. Rewrote to describe the actual mechanism (versioned Job name → re-fires on every chart bump) and note the hook-strip rationale. Comment-only change. No spec, RBAC, or script behavior diff. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

… parent PR NVIDIA#1030's Tier 1 KWOK flux-oci tests failed across every recipe with: Job.batch "aicr-dra-rollout-hook-0-1-0-537018ad0a4c" is invalid: metadata.labels: Invalid value: "0.1.0+537018ad0a4c": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', ... The flux deployer was writing manifest templates raw into the synthesized -pre / -post chart and leaving substitution to Helm. The synthesized chart's Chart.yaml hardcoded version: 0.1.0, so Helm resolved .Chart.Version to a value unrelated to the parent component — and under flux-oci, source-controller appended a +<artifact-sha> suffix that the K8s API server rejected as a label value. The chart-version-keyed Job name from NVIDIA#980 only worked under localformat (helm / helmfile / argocd-helm) because that deployer already calls pkg/manifest.Render at bundle time with the parent componentRef's version. Flux didn't, so the same template silently behaved differently across deployers. Fix: thread the parent's normalized chart version into generateManifestHelmChart and (a) pre-render every manifest via pkg/manifest.Render with ChartVersion=<parent>, (b) write Chart.yaml with version: <parent> too so any external reader of the synthesized chart's metadata sees the real identity. After the fix, rendered output is identical across deployers — the dra-rollout-hook Job name becomes aicr-dra-rollout-hook-26-3-1 (the gpu-operator chart pin) under every deployer, and the aicr.nvidia.com/gpu-operator-chart-version label carries "26.3.1" (valid label value, no '+'). Drive-by: refresh the dra-rollout-hook.yaml header comment that incorrectly attributed the .Chart.Version substitution to the synthesized chart's Chart.yaml — the substitution now happens at manifest.Render time, not at Helm install time. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/deployer/flux/helm.go`:
- Around line 233-247: The current call to manifest.Render(...) is consuming
deploy-time Helm values (Values: g.ComponentValues[compName]) which hard-codes
.Values into generated templates; remove/stop passing in component values to
manifest.Render so only chart metadata (ComponentName, Namespace, ChartName,
ChartVersion) are pre-rendered and leave .Values lookups intact for
runtime/Helm. Update the manifest.Render invocation in
pkg/bundler/deployer/flux/helm.go (the manifest.Render and manifest.RenderInput
usage) to omit or pass an empty map for Values, and if any manifests require
DynamicValues, explicitly detect and fail with a clear error until partial
rendering is implemented; ensure downstream HelmRelease spec.values/valuesFrom
wiring logic remains unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8e1e978f-d5f0-4dcb-ac6b-c8c82378c87b

📥 Commits

Reviewing files that changed from the base of the PR and between 85b5463 and aa478a2.

📒 Files selected for processing (3)

pkg/bundler/deployer/flux/flux.go
pkg/bundler/deployer/flux/helm.go
recipes/components/gpu-operator/manifests/dra-rollout-hook.yaml

…ifest pre-render Codex re-review on PR NVIDIA#1030 caught a contract violation in the flux/helm.go change from commit aa478a2: calling manifest.Render at bundle time substitutes every .Values.* expression in the manifest template with the static values, before splitDynamicPaths separates dynamic paths into the Flux ConfigMap. Operators editing configmap-values.yaml to override a dynamic path then have no effect on the rendered manifest — the static default has already been baked in. The cited test (TestGenerate_WithDynamicValues_ManifestComponent) only verifies ConfigMap wiring, not that .Values.* references survive into the rendered chart, so it passed despite the regression. Fix has two parts: 1. Revert the flux/helm.go manifest.Render call. Manifests are written raw again so Helm substitutes .Values.* at install time from the merged static + dynamic values, restoring the documented Flux dynamic-values behavior. Keep the chartVersion threading that sets Chart.yaml's version field to the parent's normalized value — that's cosmetic metadata only (no template uses .Chart.Version on this path anymore). 2. Plumb the parent gpu-operator chart version through the values map instead. DefaultBundler.injectDRAParentChartVersionValue writes gpu-operator componentValues[_aicrParentChartVersion] = NormalizeVersionWithDefault(ref.Version) at bundle generation, gated on both gpu-operator and nvidia-dra-driver-gpu being enabled in the filtered recipe. The dra-rollout-hook template reads it via `{{ index .Values "gpu-operator" "_aicrParentChartVersion" }}` and uses that for both the Job name slug and the aicr.nvidia.com/gpu-operator-chart-version label. Why the values-map path beats .Chart.Version on every deployer: - localformat / argocd-helm / helmfile already pre-render manifest templates via pkg/manifest.Render with the parent's wrapped values, so the injected key resolves at bundle time to the real gpu-operator version. - flux writes the manifest raw and Helm resolves at install time from the HelmRelease's spec.values (which carries the injected value verbatim, no source-controller '+<artifact-sha>' suffix that would otherwise have broken K8s label validation). - The synthesized chart's Chart.yaml-hardcoded `version: 0.1.0` no longer matters for the rollout-hook's identity. Tests added: bundler_dra_parent_chart_version_test.go covers the helper end-to-end — positive case, both gating branches (DRA disabled, gpu-operator disabled), user --set override is overridden by the bundler-derived value, nil-tolerant inputs. End-to-end verification: `helm template` on locally-bundled flux and helmfile outputs both produce Job name `aicr-dra-rollout-hook-26-3-1` and label value `"26.3.1"` — identical across deployers, no '+', valid against K8s label validation. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

Second Codex pass on PR NVIDIA#1030 flagged a gating mismatch: injectDRAParentChartVersionValue required BOTH gpu-operator and nvidia-dra-driver-gpu enabled, but the dra-rollout-hook manifest is wired into base.yaml's gpu-operator componentRef's manifestFiles and gets emitted whenever gpu-operator is enabled — independent of DRA. Repro per Codex: aicr bundle --recipe recipes/overlays/h100-eks-ubuntu-training.yaml \ --set nvidia-dra-driver-gpu:enabled=false ⇒ gpu-operator-post emits a Job with: metadata.name: aicr-dra-rollout-hook-<nil> aicr.nvidia.com/gpu-operator-chart-version: "<nil>" helm.sh/chart: gpu-operator-<nil> The K8s API server rejects the Job because <nil> isn't valid DNS-1123. Fix: drop the \!draEnabled half of the gate. The injection now runs whenever gpu-operator itself is enabled (which mirrors the manifest's emission condition). When DRA is in fact disabled at runtime, the hook script's existing "no DRA DaemonSet found → exit 0" branch keeps the Job a no-op — over-injecting the value is harmless. The unit test that previously asserted "no injection when DRA disabled" now asserts the inverse, matching the documented behavior. Verified locally: `aicr bundle --deployer flux --set nvidia-dra-driver-gpu:enabled=false` followed by `helm template` on the synthesized gpu-operator-post chart produces `aicr-dra-rollout-hook-26-3-1` Job name with label `"26.3.1"`, identical to the DRA-enabled rendering. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

Fourth Codex pass on PR NVIDIA#1030 flagged that the Job's pod spec has no tolerations, so on tainted CI clusters (KWOK nodes carry accelerator/system taints by default in the Tier-1 KWOK suite) the hook pod stays Pending with FailedScheduling. Flux's HelmRelease waits for the synthesized gpu-operator-post chart to become Ready; the un-schedulable Job pod keeps it in flight, every downstream release stalls on DependencyNotReady, and the cross-deployer rollout protection NVIDIA#980 was designed to provide silently doesn't. Fix: add `tolerations: [{operator: Exists}]` to the Job's pod template. The Job is short-lived (a few kubectl calls + an optional 15m wait for driver migration on managed-driver overlays) and exclusively talks to the kube-apiserver — it does NOT touch the GPU, so landing on a GPU-tainted node is harmless and a system node works equally well. Tolerating every taint lets the scheduler place it on whichever node has capacity first. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

The argocd-helm-oci wrapper script was passing the FULL bundle URL to `helm install --set repoURL=…` (including the per-recipe chart name at the end). That matched the pre-PR-NVIDIA#1032 contract where the parent Application's `source.chart` was hardcoded to `aicr-bundle`. PR NVIDIA#1032 (and NVIDIA#1035's reinforcement) changed the parent App template to expect the parent-namespace-only repoURL and to append .Chart.Name itself via the separate `source.chart` field. The wrapper script wasn't updated to match. Result on every PR with argocd-helm-oci Tier-1 KWOK coverage: the parent App resolves to `oci://registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>`, the OCI artifact lookup 404s, gpu-operator-post's Application can never sync, and the whole stack times out on `GitOps sync timeout strike 1/3`. The failure was masked on `main` because the most-recent KWOK Cluster Validation run on `main` (#26469449378 at 0d3e62d, success) ran *before* PR NVIDIA#1035 merged. After NVIDIA#1035 / NVIDIA#1036 / NVIDIA#1038 all landed on main, no fresh KWOK run has triggered on `main` yet — but the next one will fail the same way every open PR's argocd-helm-oci Tier-1 jobs are currently failing. Fix is a one-line drop of the per-recipe suffix from OCI_IN_CLUSTER_REF in the argocd-helm-oci branch of generate_bundle. The flux branch keeps the per-recipe suffix because flux's OCIRepository CR consumes the FULL artifact URL (recipe segment included). Updated the surrounding comment to point at the post-NVIDIA#1032 contract so the next reader understands the asymmetry. End-to-end check (verified from PR NVIDIA#1030's debug artifact at b3f2296): repo-server log shows `registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>: not found`, caused by the same double-append. With the recipe suffix dropped, Argo's resolution `<repoURL>/<chart>:<tag>` aligns with the pushed artifact at `oci://…/aicr/<recipe>:<tag>`. Refs PR NVIDIA#1030 (where this surfaced), PR NVIDIA#1032 (contract change), PR NVIDIA#1035 (parent App template enforcement).

yuanchen8911 · 2026-05-26T22:11:09Z

Tracked down the Tier-1 argocd-helm-oci CI failures — not a #980 bug, but a pre-existing wrapper-script issue surfaced by my PR's CI re-run.

Root cause: the kwok wrapper script kwok/scripts/validate-scheduling.sh was passing the FULL bundle URL via helm install --set repoURL=oci://…/aicr/<recipe> (the pre-PR-#1032 contract). PR #1032 changed the parent App template to expect the parent-namespace-only repoURL, with .Chart.Name appended via the separate source.chart field. The wrapper wasn't updated to match. Result on every PR with argocd-helm-oci coverage: parent App resolves to oci://…/aicr/<recipe>/<recipe>:<tag>, the OCI lookup 404s, and gpu-operator-post Application never syncs (debug-artifact evidence: failed to resolve revision: …/aicr/aks/aks:0.0.0-…-argocd-helm-oci: not found).

The failure was masked on main because the most-recent KWOK Cluster Validation run on main (#26469449378 at 0d3e62df) predates PR #1035. After #1035 / #1036 / #1038 all landed on main today, no new main KWOK run has fired yet — but it would fail the same way every open PR's argocd-helm-oci jobs are failing.

Fix in commit 5620000 — one-line drop of the per-recipe suffix from OCI_IN_CLUSTER_REF in the argocd-helm-oci branch of generate_bundle. The flux branch keeps the full URL because flux's OCIRepository consumes the full artifact reference. CI will re-run and the Tier-1 argocd-helm-oci jobs should go green alongside the already-passing flux-oci ones.

…d runs) Codex P3 follow-up on the previous trim hardening: two passes of `trimSuffix . | _ | -` still leave the label invalid when a SemVer pre-release lands a three-separator run like `.--` at the truncation boundary. Repro Codex provided: parent = 1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.--b → aicr.nvidia.com/gpu-operator-chart-version: "…aaaaaa." (rejected) pkg/manifest's helmFuncMap doesn't register a regex helper, so a chain of `trimSuffix` is the only suffix-stripping primitive available across both the bundler-time render path (limited funcMap) and the Helm install-time path. Going from 2 cycles to 5 cycles handles up to 15 trailing-separator chars of any combo, and tightening the trunc from 63 → 60 leaves 3 chars of headroom for the strip to consume without losing meaningful version digits. Applied the same treatment to $chartLabel which has the identical hazard (plus the `_` from upstream `replace "+" "_"`). Verified with Codex's three-separator repro: …aaaa.--b (trunc-boundary `.--` tail) → "1.2.3-aaaaaa…aaaa" (60 chars, ends alphanumeric) Normal 26.3.1 / 26.3.1+build.1 paths still render correctly. This is not formally bulletproof — a 16-separator tail would still slip through. A true fix needs either a regex helper added to pkg/manifest's helmFuncMap or a Go-side label sanitizer pre-injected via the values map (similar to _aicrParentChartVersion). Filed as a follow-up; the current chain handles every realistic SemVer input. Refs PR NVIDIA#1030.

@sha256

Six findings from Codex review on PR NVIDIA#1030's initial commit (all verified against source): P1 — Use only supported manifest template functions pkg/manifest/render.go:98 (helmFuncMap) registers toYaml, nindent, toString, default, replace, trunc, trimSuffix. The Sprig "quote" function is not in the map, so v1's "{{ $draNamespace | quote }}" fails template parsing during aicr bundle. Replaced with a literal string in the script (the namespace is no longer hardcoded — see P2.1 below — so the inline "..." is gone too) and replaced the two label values that used "{{ .Chart.Version | quote }}" with YAML-syntactic quotes around the template expression. P1 — Ensure the DRA namespace exists before RBAC On a fresh cluster, gpu-operator-post installs immediately after gpu-operator-pre/gpu-operator but BEFORE nvidia-dra-driver-gpu has applied its chart with --create-namespace. v1's Role and RoleBinding in $draNamespace would therefore fail because the nvidia-dra-driver namespace does not exist yet. Collapsed the cross-namespace Role+RoleBinding into a single ClusterRole+ ClusterRoleBinding that grants daemonsets get/list/watch/patch cluster-wide; the cluster-scoped binding has no per-namespace dependency. Trade-off is slightly broader RBAC, but the SA is short-lived and scoped to this one Job. P1 — Preserve hook semantics through manifest bundling pkg/bundler/deployer/localformat/local_helm.go:165 unconditionally calls stripHelmHooks on every manifest rendered into a local-helm chart. stripHelmHooks removes helm.sh/hook* annotations for post-install / post-upgrade / pre-install / pre-upgrade so that Argo CD's syncPolicy.automated can apply the resource at all (PostSync hooks never fire on path-based sources under automated sync). v1 relied on those annotations and would have shipped as a regular Job that fires only once at first install, never on upgrades. Re-firing is now achieved by keying the Job name on .Chart.Version (the synthesized gpu-operator-post chart inherits the parent gpu-operator componentRef's version at pkg/bundler/deployer/localformat/writer.go:121), so each chart bump produces a new Job manifest that every deployer applies. helm.sh/resource-policy: keep prevents Helm's three-way merge from pruning prior-version Jobs mid-rollout; ttlSecondsAfterFinished: 600 lets Kubernetes self-clean succeeded Jobs after 10m so the namespace doesn't accumulate one per past chart bump. P2 — Derive the DRA namespace instead of hardcoding it The os-talos mixin places nvidia-dra-driver-gpu in privileged-nvidia-dra-driver-gpu, and any recipe can override the component namespace. v1's hardcoded "nvidia-dra-driver" both bound the RBAC to the wrong namespace and pointed DRA_NS at the wrong place. With RBAC collapsed to ClusterRole (P1.2 above), the remaining script-side concern is locating the DaemonSet at runtime. The script now discovers the DRA kubelet-plugin DaemonSet via kubectl get daemonset -A -l app.kubernetes.io/name=nvidia-dra-driver-gpu, filtering for the kubelet-plugin/dra-driver DaemonSet (not the controller Deployment), with a fallback that takes the first matched DaemonSet for chart variants that don't set a component label. P2 — Grant daemonsets/watch for kubectl rollout status kubectl rollout status daemonset/... uses Watch on the DaemonSet resource, not just Get on Pods. v1's Role granted only get/list/patch on daemonsets, so every hook run would get Forbidden on the watch and skip the readiness gate. Added "watch" to the daemonsets ClusterRole rule. The pods rule retains get/list/watch. P2 — Pin the kubectl image by digest Repo standard is digest-pinned images (validator catalog, on-tag.yaml, EKS images, renovate runner all use @sha256:…). Separately, Bitnami's catalog reshuffle removed all non-latest tags from bitnami/kubectl (only "latest" remains in the free namespace today; bitnamilegacy/kubectl tops out at 1.33.4), so the v1 reference "bitnami/kubectl:1.34" no longer resolves. Switched to docker.io/alpine/k8s:1.34.8 pinned by the multi-arch manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be. alpine/k8s ships kubectl plus a BusyBox shell, so the embedded script was re-targeted to POSIX /bin/sh (no [[…]], no pipefail). Refs NVIDIA#980, PR NVIDIA#1030 (draft)

Two Codex re-review findings on PR NVIDIA#1030: P1 — Drop ttlSecondsAfterFinished on the chart-version-keyed Job v2 set ttlSecondsAfterFinished: 600 so prior-version Jobs would self-clean. With the Job as a regular (non-hook) chart resource, it is now part of the GitOps desired state: Argo CD's self-heal and Helm's three-way merge both treat it as desired. A Kubernetes TTL delete makes the resource missing relative to the rendered manifest, so the next reconcile recreates it and re-runs the DRA rollout-restart on every cycle. Removed the TTL. Old-version Jobs persist as Succeeded records (with helm.sh/resource-policy: keep preserving them across upgrades); operators can garbage-collect by label or via a follow-up cleanup mechanism if they become operationally noisy. P2 — Regenerate docs/user/container-images.md to include alpine/k8s CLAUDE.md requires `make bom-docs` after any change that adds a new image reference. The new dra-rollout-hook manifest pins docker.io/alpine/k8s:1.34.8@sha256:4d352ba…, and the BOM tool picks it up from manifests. gpu-operator's per-component image count goes from 14 → 15, with the new image listed under the gpu-operator section. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

Codex P3 re-review on PR NVIDIA#1030 noted two comments that hadn't kept up with the design pivot: dra-rollout-hook.yaml — the header still claimed ttlSecondsAfterFinished lets Kubernetes self-clean prior Jobs, but the spec intentionally omits TTL (per the prior fixup) and the inline annotation comment already explains why. Rewrote the header paragraph to match: TTL is deliberately absent because the Job is GitOps desired state and a TTL delete would trigger a recreate on every reconcile. base.yaml — the manifestFiles comment described this entry as a "Helm post-install,post-upgrade hook," but it ships as a regular chart-version-keyed Job because the local-helm bundler strips helm.sh/hook* for Argo CD compatibility. Rewrote to describe the actual mechanism (versioned Job name → re-fires on every chart bump) and note the hook-strip rationale. Comment-only change. No spec, RBAC, or script behavior diff. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

… parent PR NVIDIA#1030's Tier 1 KWOK flux-oci tests failed across every recipe with: Job.batch "aicr-dra-rollout-hook-0-1-0-537018ad0a4c" is invalid: metadata.labels: Invalid value: "0.1.0+537018ad0a4c": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', ... The flux deployer was writing manifest templates raw into the synthesized -pre / -post chart and leaving substitution to Helm. The synthesized chart's Chart.yaml hardcoded version: 0.1.0, so Helm resolved .Chart.Version to a value unrelated to the parent component — and under flux-oci, source-controller appended a +<artifact-sha> suffix that the K8s API server rejected as a label value. The chart-version-keyed Job name from NVIDIA#980 only worked under localformat (helm / helmfile / argocd-helm) because that deployer already calls pkg/manifest.Render at bundle time with the parent componentRef's version. Flux didn't, so the same template silently behaved differently across deployers. Fix: thread the parent's normalized chart version into generateManifestHelmChart and (a) pre-render every manifest via pkg/manifest.Render with ChartVersion=<parent>, (b) write Chart.yaml with version: <parent> too so any external reader of the synthesized chart's metadata sees the real identity. After the fix, rendered output is identical across deployers — the dra-rollout-hook Job name becomes aicr-dra-rollout-hook-26-3-1 (the gpu-operator chart pin) under every deployer, and the aicr.nvidia.com/gpu-operator-chart-version label carries "26.3.1" (valid label value, no '+'). Drive-by: refresh the dra-rollout-hook.yaml header comment that incorrectly attributed the .Chart.Version substitution to the synthesized chart's Chart.yaml — the substitution now happens at manifest.Render time, not at Helm install time. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

…ifest pre-render Codex re-review on PR NVIDIA#1030 caught a contract violation in the flux/helm.go change from commit aa478a2: calling manifest.Render at bundle time substitutes every .Values.* expression in the manifest template with the static values, before splitDynamicPaths separates dynamic paths into the Flux ConfigMap. Operators editing configmap-values.yaml to override a dynamic path then have no effect on the rendered manifest — the static default has already been baked in. The cited test (TestGenerate_WithDynamicValues_ManifestComponent) only verifies ConfigMap wiring, not that .Values.* references survive into the rendered chart, so it passed despite the regression. Fix has two parts: 1. Revert the flux/helm.go manifest.Render call. Manifests are written raw again so Helm substitutes .Values.* at install time from the merged static + dynamic values, restoring the documented Flux dynamic-values behavior. Keep the chartVersion threading that sets Chart.yaml's version field to the parent's normalized value — that's cosmetic metadata only (no template uses .Chart.Version on this path anymore). 2. Plumb the parent gpu-operator chart version through the values map instead. DefaultBundler.injectDRAParentChartVersionValue writes gpu-operator componentValues[_aicrParentChartVersion] = NormalizeVersionWithDefault(ref.Version) at bundle generation, gated on both gpu-operator and nvidia-dra-driver-gpu being enabled in the filtered recipe. The dra-rollout-hook template reads it via `{{ index .Values "gpu-operator" "_aicrParentChartVersion" }}` and uses that for both the Job name slug and the aicr.nvidia.com/gpu-operator-chart-version label. Why the values-map path beats .Chart.Version on every deployer: - localformat / argocd-helm / helmfile already pre-render manifest templates via pkg/manifest.Render with the parent's wrapped values, so the injected key resolves at bundle time to the real gpu-operator version. - flux writes the manifest raw and Helm resolves at install time from the HelmRelease's spec.values (which carries the injected value verbatim, no source-controller '+<artifact-sha>' suffix that would otherwise have broken K8s label validation). - The synthesized chart's Chart.yaml-hardcoded `version: 0.1.0` no longer matters for the rollout-hook's identity. Tests added: bundler_dra_parent_chart_version_test.go covers the helper end-to-end — positive case, both gating branches (DRA disabled, gpu-operator disabled), user --set override is overridden by the bundler-derived value, nil-tolerant inputs. End-to-end verification: `helm template` on locally-bundled flux and helmfile outputs both produce Job name `aicr-dra-rollout-hook-26-3-1` and label value `"26.3.1"` — identical across deployers, no '+', valid against K8s label validation. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

Second Codex pass on PR NVIDIA#1030 flagged a gating mismatch: injectDRAParentChartVersionValue required BOTH gpu-operator and nvidia-dra-driver-gpu enabled, but the dra-rollout-hook manifest is wired into base.yaml's gpu-operator componentRef's manifestFiles and gets emitted whenever gpu-operator is enabled — independent of DRA. Repro per Codex: aicr bundle --recipe recipes/overlays/h100-eks-ubuntu-training.yaml \ --set nvidia-dra-driver-gpu:enabled=false ⇒ gpu-operator-post emits a Job with: metadata.name: aicr-dra-rollout-hook-<nil> aicr.nvidia.com/gpu-operator-chart-version: "<nil>" helm.sh/chart: gpu-operator-<nil> The K8s API server rejects the Job because <nil> isn't valid DNS-1123. Fix: drop the \!draEnabled half of the gate. The injection now runs whenever gpu-operator itself is enabled (which mirrors the manifest's emission condition). When DRA is in fact disabled at runtime, the hook script's existing "no DRA DaemonSet found → exit 0" branch keeps the Job a no-op — over-injecting the value is harmless. The unit test that previously asserted "no injection when DRA disabled" now asserts the inverse, matching the documented behavior. Verified locally: `aicr bundle --deployer flux --set nvidia-dra-driver-gpu:enabled=false` followed by `helm template` on the synthesized gpu-operator-post chart produces `aicr-dra-rollout-hook-26-3-1` Job name with label `"26.3.1"`, identical to the DRA-enabled rendering. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

Fourth Codex pass on PR NVIDIA#1030 flagged that the Job's pod spec has no tolerations, so on tainted CI clusters (KWOK nodes carry accelerator/system taints by default in the Tier-1 KWOK suite) the hook pod stays Pending with FailedScheduling. Flux's HelmRelease waits for the synthesized gpu-operator-post chart to become Ready; the un-schedulable Job pod keeps it in flight, every downstream release stalls on DependencyNotReady, and the cross-deployer rollout protection NVIDIA#980 was designed to provide silently doesn't. Fix: add `tolerations: [{operator: Exists}]` to the Job's pod template. The Job is short-lived (a few kubectl calls + an optional 15m wait for driver migration on managed-driver overlays) and exclusively talks to the kube-apiserver — it does NOT touch the GPU, so landing on a GPU-tainted node is harmless and a system node works equally well. Tolerating every taint lets the scheduler place it on whichever node has capacity first. Refs NVIDIA#980, PR NVIDIA#1030 (draft)

…afe version) Two of three new P2 findings on PR NVIDIA#1030 addressed; the third is filed as a follow-up note in the PR conversation. P2 — Check for DRA before waiting on driver migration When nvidia-dra-driver-gpu was disabled or absent but gpu-operator managed drivers, the script entered the 15-minute gpu-driver-upgrade-state wait BEFORE the DRA-DaemonSet absence check at the end. That made every DRA-disabled deployment or upgrade block for up to 15 minutes on a no-op path. Restructured the script so the DRA discovery step runs FIRST; if no DRA kubelet plugin DaemonSet is found we exit 0 immediately, skipping the migration wait. The driver-daemonset check and the per-node upgrade-state wait now run only when there is a DRA DS to subsequently restart. P2 — Sanitize the chart-version label $parentVersion can legally carry semver build metadata (e.g. 26.3.1+build.1). The Job's name slug and the helm.sh/chart label already replace '+' with '-', but the aicr.nvidia.com/gpu-operator-chart-version label was emitting the raw value, which Kubernetes label validation would reject. Added a sibling $versionLabel that strips '+' (and trims to 63 chars / strips a trailing '-') and use it for both the metadata.labels and the spec.template.metadata.labels sites. Dots remain — they are valid in label values. Out of scope (filed as follow-up): P2 — Preserve the post hook on vendored upgrades When --vendor-charts is used with the helm / helmfile deployers, localformat folds manifestFiles into the vendored gpu-operator wrapper instead of emitting a separate gpu-operator-post chart, and injectPostInstallHooks annotates the result with only helm.sh/hook: post-install. The version-keyed Job would still re-fire on upgrades via its name (per the existing design), but the absence of helm.sh/hook: post-upgrade in the vendored path means Argo CD / Flux paths through that codepath miss the upgrade-time re-trigger. Tracking this as a follow-up issue — the fix is in pkg/bundler/deployer/localformat (or wherever injectPostInstallHooks lives), separate from this manifest. Refs PR NVIDIA#1030.

Codex P1 on PR NVIDIA#1030 at head 0375989: the `$versionLabel` Go-template block comment (`{{- /* … */}}`) breaks raw-YAML parsing of the manifest before Helm rendering. `recipes/manifest_images_test.go` (TestComponentManifestImagesAreFullyQualified / TestComponentManifestImagesAreDigestPinned) decodes every `components/*/manifests/*.yaml` file as YAML directly — the continuation lines inside the `/* */` block are not valid YAML content, so the test fails with `decode yaml: yaml: line 65: did not find expected node content`. Replace the block comment with ordinary YAML `#` comments at column 0 (before the `{{- $saName }}` definitions block) so it survives raw-YAML decode while still being preserved verbatim through Helm's whitespace-trim rules at render time. Verified locally: `go test ./recipes -run TestComponentManifestImagesAre` passes. Refs PR NVIDIA#1030.

Codex P2 on PR NVIDIA#1030 at head 162d1df: the Job's name is keyed only on $versionSlug (the resolved gpu-operator chart version). Job .spec.template is immutable in Kubernetes — any change to the image digest, inline shell script, tolerations, resources, or securityContext on this Job will, under a future AICR release that ships the SAME gpu-operator chart pin as the current bundle, produce an `Forbidden: field is immutable` rejection when Helm/Argo CD/Flux try to update the existing Job. helm.sh/resource-policy: keep prevents Helm-side prune but doesn't make the resource updatable. Fix: introduce a hand-bumped $hookRev counter that's appended to the Job name (e.g. aicr-dra-rollout-hook-26-3-1-r1). Maintainers bump this constant (1 → 2 → 3 → …) in lockstep with any material edit to the Job spec below; the bumped name makes the next install create a fresh Job that the deployer can apply cleanly, while prior Jobs persist as Succeeded records (already protected by helm.sh/resource-policy: keep). The constant lives in the same file as the spec it tracks, so the dependency is local; a future PR adding a content-derived hash via sha256sum on a named template can replace it without changing the Job-name shape. Documented the contract inline at the $hookRev definition and the intent of the trailing "-r<rev>" suffix is now self-explanatory in the rendered Job name. Refs PR NVIDIA#1030.

… cap Codex P2 on PR NVIDIA#1030 at head 9c5f52c: $versionSlug was truncated at 50 chars, then the assembled "aicr-dra-rollout-hook-{slug}-r{rev}" name was truncated at 63 chars. For a long parent-version string (e.g. "1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"), the outer trunc drops the entire "-r<rev>" tail, so a future $hookRev bump won't change the Job name and the immutable-pod-template failure mode returns. Budget the slug's truncation to reserve space for the suffix: "aicr-dra-rollout-hook-" (22) + <slug> + "-r<rev>" (≤ 6) ≤ 63 ⇒ <slug> ≤ 35 Reserve 6 chars for "-r<rev>" (covers up to -r9999, far more revisions than this hook will ever see). Trunc the slug at 35 and keep the outer trunc 63 as a defense-in-depth cap. Verified with the long-version repro Codex cited: parent = 1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa → name = aicr-dra-rollout-hook-1-2-3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaa-r1 (60 chars, suffix intact) The normal 26.3.1 / 26.3.1+build.1 paths still render correctly. Refs PR NVIDIA#1030.

Codex P2 on PR NVIDIA#1030 at head 1a06ad6: $versionLabel only trimmed a trailing '-' after the 63-char trunc, but a SemVer pre-release with a '.' at exactly position 63 leaves the rendered label value ending in '.', which Kubernetes label validation rejects (label values must start and end with [A-Za-z0-9]). $chartLabel has the same hazard with '_' because the upstream `replace "+" "_"` can leave the underscore at the truncation boundary. Repro Codex cited: parent = 1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.b → aicr.nvidia.com/gpu-operator-chart-version: "1.2.3-aaaaaaa…aaaa." (rejected: ends in '.') Fix: chain `trimSuffix` calls for each of '.', '_', '-' on both label sites, doubled-up to handle the rare case of two consecutive separators landing at the truncation boundary. The chained shape is intentionally verbose (no regex available in pkg/manifest's helmFuncMap, only the seven funcs registered there — `trimSuffix` is one of them and runs cleanly through both the localformat bundler-time render path and the Helm install-time path). Verified with the long-pre-release repro and the normal 26.3.1 / 26.3.1+build.1 paths: both labels now end alphanumeric. Refs PR NVIDIA#1030.

…d runs) Codex P3 follow-up on the previous trim hardening: two passes of `trimSuffix . | _ | -` still leave the label invalid when a SemVer pre-release lands a three-separator run like `.--` at the truncation boundary. Repro Codex provided: parent = 1.2.3-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.--b → aicr.nvidia.com/gpu-operator-chart-version: "…aaaaaa." (rejected) pkg/manifest's helmFuncMap doesn't register a regex helper, so a chain of `trimSuffix` is the only suffix-stripping primitive available across both the bundler-time render path (limited funcMap) and the Helm install-time path. Going from 2 cycles to 5 cycles handles up to 15 trailing-separator chars of any combo, and tightening the trunc from 63 → 60 leaves 3 chars of headroom for the strip to consume without losing meaningful version digits. Applied the same treatment to $chartLabel which has the identical hazard (plus the `_` from upstream `replace "+" "_"`). Verified with Codex's three-separator repro: …aaaa.--b (trunc-boundary `.--` tail) → "1.2.3-aaaaaa…aaaa" (60 chars, ends alphanumeric) Normal 26.3.1 / 26.3.1+build.1 paths still render correctly. This is not formally bulletproof — a 16-separator tail would still slip through. A true fix needs either a regex helper added to pkg/manifest's helmFuncMap or a Go-side label sanitizer pre-injected via the values map (similar to _aicrParentChartVersion). Filed as a follow-up; the current chain handles every realistic SemVer input. Refs PR NVIDIA#1030.

@sha256