fix(bundler): make undeploy.sh resilient to transient kubectl failures by yuanchen8911 · Pull Request #599 · NVIDIA/aicr

yuanchen8911 · 2026-04-16T23:41:25Z

Summary

Add a trailing || true to three kubectl | jq | while pipelines in the generated undeploy.sh so a transient API failure during post-uninstall cleanup no longer kills the script before namespace deletion.

Motivation / Context

With set -euo pipefail (already at the top of the generated script), a single kubectl get crd/kubectl get <kind> returning non-zero — even a brief EKS control-plane 502 or an auth refresh hiccup — fails the entire pipeline and exits the script. In a typical run with ~15 components the orphaned-CRD cleanup pipeline executes 15 times back-to-back, so any momentary control-plane flap kills undeploy.sh after the Helm uninstalls succeeded but before taint removal / orphan-CRD cleanup / namespace deletion. The cluster ends up half-cleaned and the user has to finish manually.

Hit live today on aicr-cuj2: all 15 Helm releases uninstalled, then the script exited 1 immediately after Removing skyhook.nvidia.com node taints…, leaving every component namespace Active.

The two helper functions delete_release_cluster_resources and force_clear_namespace_finalizers carry the same pipeline shape and the same exposure.

The adjacent ORPHANED_CRD_GROUPS loop and stuck_crds=$(…) extraction already use the … || true pattern; this PR brings the three remaining sites into line.

Fixes: N/A
Related: #477 (introduced the per-component CRD cleanup pipeline)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)

Implementation Notes

Three single-token additions (done → done || true) plus one || true on a single kubectl delete crd call, all in pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl:

delete_release_cluster_resources (line 103)
force_clear_namespace_finalizers (line 154)
per-component orphan-CRD cleanup (lines 301–302)

The delete_orphaned_webhooks_for_ns helper at line 117 already wraps its inner pipeline in { … || true; }, so it doesn't need the same treatment.

Trade-off: cleanup steps that fail due to a real (non-transient) API problem will now be silently skipped. The post-flight verification block at the end of the script already warns about leftover namespaces / stuck CRDs / stale webhooks, so a real cleanup failure surfaces there instead of as set -e exit 1 in the middle.

Testing

make lint   # go vet + golangci-lint clean; pre-existing yamllint errors are in untracked .codex-worktrees/
GOFLAGS=-mod=vendor go test -race -count=1 ./pkg/bundler/...

pkg/bundler/... tests pass with -race (all subpackages green).
Re-built the CLI, regenerated a bundle, and confirmed the four new || true guards appear at the expected positions in bundles/undeploy.sh.
Live re-test on aicr-cuj2 after the patch is pending — will run once the cluster's intermittent 502s settle.

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Generated-script change only. No API surface, no data migration. Reverting is a one-commit revert of this PR; existing bundles can either be regenerated or hand-patched.

Checklist

Tests pass locally (make test with -race on pkg/bundler/...)
Linter passes (make lint — go vet + golangci-lint clean; yamllint failures are in pre-existing untracked .codex-worktrees/)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (no new functionality; behavior change is "tolerate transient failure")
I updated docs if user-facing behavior changed (no user-facing behavior change beyond resilience)
Changes follow existing patterns in the codebase (matches the … || true pattern already used in the same file)
Commits are cryptographically signed (git commit -S)

Summary by CodeRabbit

Bug Fixes
- Undeployment now tolerates transient failures during resource and CRD cleanup: cleanup continues instead of aborting, and non-fatal issues emit warnings so undeploy completes more reliably.
Tests
- Added an integration-style test ensuring transient kubectl failures produce warnings and do not stop the undeploy flow.

coderabbitai · 2026-04-16T23:41:40Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Modified the undeploy script to prevent strict-shell (set -euo pipefail) from aborting undeploy pipelines by appending safe fallbacks (|| true and done || echo "Warning: ... " >&2) so transient failures emit warnings and cleanup continues; added a test that simulates kubectl failures and asserts warnings are produced.

Changes

Cohort / File(s)	Summary
Undeploy script (error handling) `pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`	Adjusted cleanup pipelines to tolerate transient pipeline failures: appended `done
Test added `pkg/bundler/deployer/helm/helm_test.go`	Added `TestUndeployScript_TransientFailureWarnsAndContinues`: generates `undeploy.sh`, stubs `kubectl` to fail for non-`api-resources` calls, runs extracted cleanup sections, and asserts successful exit plus expected `Warning:` stderr output. Imports updated (`bytes`, `os/exec`).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through loops where warnings glow,
I nudged the pipes so scripts can go,
A whispered echo, soft and bright,
Lets cleanup trundle through the night,
I nibble errors, then say "all right." 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: making undeploy.sh resilient to transient kubectl failures, which aligns perfectly with the core modification across both modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Line 103: Occurrences of the silent suppression pattern "done || true" hide
failures during cleanup; change each usage to capture the command exit status
and, on non-zero, emit a clear warning including the failing command/context and
its output (stderr) so operators see RBAC/auth/API/jq failures — e.g., run the
pipeline, store its exit code, and if non‑zero print a descriptive warning to
stderr (matching the style used by the post-flight checks for terminating
namespaces/stale webhooks/unavailable APIs) rather than swallowing errors;
update each "done || true" instance to this conditional-warn pattern and ensure
the warning contains the command name/context for easier debugging.
- Around line 301-302: Add a post‑flight verification that mirrors the deletion
block's Helm-annotation filtering: after the kubectl delete loop that targets
CRDs with the Helm annotation (the block containing kubectl delete crd "${name}"
--ignore-not-found --wait=false), scan remaining CRDs for the same Helm
release/annotation (the same annotation key used in the deletion loop) and
report or fail if any matching CRDs still exist—similar to how
ORPHANED_CRD_GROUPS is checked in the post‑flight section (the check around
ORPHANED_CRD_GROUPS); implement this by enumerating CRDs, filtering by the Helm
annotation used earlier, and surfacing any leftovers in the post‑flight check so
Helm-managed CRDs are covered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: af834a8f-0d2c-4397-9f56-b49f95e6f28a

📥 Commits

Reviewing files that changed from the base of the PR and between 1b25135 and cd380d6.

📒 Files selected for processing (1)

pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

coderabbitai · 2026-04-16T23:44:48Z

+      kubectl delete crd "${name}" --ignore-not-found --wait=false || true
+    done || true


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify annotation-based cleanup exists only in deletion path and not in post-flight verification. rg -n -C3 'meta\.helm\.sh/release-name|meta\.helm\.sh/release-namespace' pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl rg -n -C3 'orphaned_crds|ORPHANED_CRD_GROUPS|Post-flight' pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Repository: NVIDIA/aicr

Length of output: 3508

Add post-flight verification for Helm-annotated CRDs to match the deletion block coverage.

Lines 301-302 suppress both delete and enumeration failures. If this block fails silently, Helm-owned CRDs can remain undetected because post-flight checks (lines 425-435) only scan ORPHANED_CRD_GROUPS, which covers operator-created CRDs, not Helm-managed ones. The annotation-based filtering in the deletion block (lines 296-298) has no corresponding post-flight check.

Suggested post-flight check

+# Verify Helm-annotated CRDs from bundle releases are gone +remaining_release_crds="" +{{- range .ComponentsReversed }}{{ if .HasChart }} +left=$(kubectl get crd -o json 2>/dev/null \ + | jq -r --arg rel "{{ .Name }}" --arg ns "{{ .Namespace }}" \ + '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) +if [[ -n "${left}" ]]; then + remaining_release_crds="${remaining_release_crds} ${left}" +fi +{{- end }}{{ end }} +if [[ -n "${remaining_release_crds}" ]]; then + echo "WARNING: Helm-owned CRDs still present:${remaining_release_crds}" + postflight_issues=true +fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 301 - 302, Add a post‑flight verification that mirrors the deletion block's Helm-annotation filtering: after the kubectl delete loop that targets CRDs with the Helm annotation (the block containing kubectl delete crd "${name}" --ignore-not-found --wait=false), scan remaining CRDs for the same Helm release/annotation (the same annotation key used in the deletion loop) and report or fail if any matching CRDs still exist—similar to how ORPHANED_CRD_GROUPS is checked in the post‑flight section (the check around ORPHANED_CRD_GROUPS); implement this by enumerating CRDs, filtering by the Helm annotation used earlier, and surfacing any leftovers in the post‑flight check so Helm-managed CRDs are covered.

yuanchen8911 · 2026-04-16T23:49:52Z

Thanks for the review @coderabbitai.

Suggestion 1 (don't silently suppress): addressed in the force-push — each done || true is now done || echo "Warning: <kind/release/namespace/component context> failed (kubectl get / jq error); ..." >&2. Operators get a stderr signal naming the failing site instead of nothing, while the script still continues into post-flight verification.

Suggestion 2 (Helm-annotation post-flight check): deferring to a follow-up. It's a useful verification surface but adds new behavior beyond "stop the script from dying on transient flaps", and I'd like to keep this PR's blast radius minimal. Will file a separate issue/PR for the post-flight CRD-by-annotation check.

coderabbitai · 2026-04-16T23:50:36Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current state of the undeploy.sh.tmpl file to verify the warning echo changes
fd 'undeploy.sh.tmpl' --exec cat {}

Length of output: 21031

@yuanchen8911: Confirmed — the force-pushed version looks great. All three pipeline guards now carry meaningful, context-rich warning messages:

Site	Warning message
`delete_release_cluster_resources`	`Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed …; leftovers will surface in post-flight`
`force_clear_namespace_finalizers`	`Warning: finalizer-clear pipeline for ${kind} in ${ns} failed …; namespace may stay Terminating`
Orphan-CRD cleanup	`Warning: orphan-CRD cleanup for <component>/<ns> failed …; leftovers will surface in post-flight`

Each message names the specific failing site and the kind/release/namespace/component context, routes to stderr so it doesn't pollute stdout log parsing, and directs operators to post-flight verification — a clear improvement over silent suppression.

On Suggestion 2 — totally reasonable to keep this PR's scope tight. Would you like me to open a follow-up GitHub issue to track the Helm-annotation post-flight CRD check so it doesn't get lost?

coderabbitai

♻️ Duplicate comments (1)

pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)
103-103: ⚠️ Potential issue | 🟠 Major

Post-flight guarantee is overstated for Helm-owned CRDs.

These warnings say leftovers “will surface in post-flight”, but post-flight currently checks ORPHANED_CRD_GROUPS and not Helm-annotation-owned CRDs. A failed annotation-based cleanup can remain undetected.

Consider either adding a matching post-flight check for Helm-owned CRDs or softening the warning text to avoid a false guarantee.

Also applies to: 302-302
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` at line 103, The
warning text in undeploy.sh.tmpl (the echo line that starts "Warning: ${kind}
cleanup pipeline for release ${release}/${ns} failed...") promises leftovers
"will surface in post-flight" which is misleading because post-flight currently
only inspects ORPHANED_CRD_GROUPS and does not detect Helm-annotation-owned
CRDs; either change this message to a softer, accurate statement (e.g., remove
the guarantee about post-flight) or implement a complementary post-flight check
that scans for Helm-owned CRDs via their Helm annotations (e.g.,
meta.helm.sh/release-name/namespace annotation keys) and cross-references them
with release/${ns} so failures are detected. Update the echo in undeploy.sh.tmpl
if softening the message, or add the new post-flight check logic and ensure it
references ORPHANED_CRD_GROUPS and Helm annotation keys consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Line 103: The warning text in undeploy.sh.tmpl (the echo line that starts
"Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed...")
promises leftovers "will surface in post-flight" which is misleading because
post-flight currently only inspects ORPHANED_CRD_GROUPS and does not detect
Helm-annotation-owned CRDs; either change this message to a softer, accurate
statement (e.g., remove the guarantee about post-flight) or implement a
complementary post-flight check that scans for Helm-owned CRDs via their Helm
annotations (e.g., meta.helm.sh/release-name/namespace annotation keys) and
cross-references them with release/${ns} so failures are detected. Update the
echo in undeploy.sh.tmpl if softening the message, or add the new post-flight
check logic and ensure it references ORPHANED_CRD_GROUPS and Helm annotation
keys consistently.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: b3355203-b3ef-4e3e-a478-08c5838395fe

📥 Commits

Reviewing files that changed from the base of the PR and between cd380d6 and db18150.

📒 Files selected for processing (1)

pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by the same meta.helm.sh/release-name annotation used by the per-component CRD deletion loop, and surfaces any leftovers as a post-flight warning. Today the script's post-flight section only checks for terminating namespaces, unavailable API services, and ORPHANED_CRD_GROUPS-style operator-runtime CRDs. Helm-managed CRDs that survive the deletion loop go unnoticed until the next deploy.sh run rejects them. After PR NVIDIA#599 made the deletion pipelines warn-on- failure instead of script-killing, this check is the missing verification surface that confirms the warnings did not point to actual leftovers. The check mirrors the existing ORPHANED_CRD_GROUPS post-flight pattern: per- release kubectl|jq pipeline guarded with `|| true`, results accumulated into a helm_orphaned_crds variable, single warning emitted if non-empty. When cleanup succeeded normally the variable stays empty and the check is silent. Suggested by CodeRabbit review on PR NVIDIA#599.

Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by the same meta.helm.sh/release-name annotation used by the per-component CRD deletion loop, and surfaces any leftovers as a post-flight warning. Today the script's post-flight section only checks for terminating namespaces, unavailable API services, and ORPHANED_CRD_GROUPS-style operator-runtime CRDs. Helm-managed CRDs that survive the deletion loop go unnoticed until the next deploy.sh run rejects them. After PR NVIDIA#599 made the deletion pipelines warn-on- failure instead of script-killing, this check is the missing verification surface that confirms the warnings did not point to actual leftovers. The check mirrors the existing ORPHANED_CRD_GROUPS post-flight pattern: per- release kubectl|jq pipeline guarded with `|| true`, results accumulated into a helm_orphaned_crds variable, single warning emitted if non-empty. When cleanup succeeded normally the variable stays empty and the check is silent. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded from the rescan: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD still listed even though it is being cleaned up normally. Only truly stuck (not-yet-terminating) CRDs should be surfaced as a warning. Suggested by CodeRabbit review on PR NVIDIA#599; terminating-CRD filter added in response to review feedback on PR NVIDIA#600. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

The orphaned-CRD cleanup pipeline introduced in NVIDIA#477 and the two helper functions delete_release_cluster_resources and force_clear_namespace_finalizers all share the shape: kubectl get ... 2>/dev/null \ | jq -r '...' 2>/dev/null \ | while read -r name; do ... done With set -euo pipefail, a single transient kubectl failure (API 502, brief auth refresh, etc.) causes the pipeline to exit non-zero and terminates the script. In a typical run with ~15 components, this means any 100 ms control-plane hiccup can kill undeploy.sh after Helm uninstalls succeeded but before namespace cleanup, leaving the cluster in a partially-cleaned state that the user must finish by hand. Append a warn-on-failure handler to each of the three pipelines so transient API errors during cleanup are tolerated AND visible: the script keeps going into the post-flight verification phase, and operators see a clear stderr warning naming the failing context (kind / release / namespace / component) instead of nothing. The adjacent ORPHANED_CRD_GROUPS loop and stuck_crds extraction already use the "tolerate-and-continue" pattern; this PR brings the three remaining sites into line and adds context-rich warnings per CodeRabbit review.

yuanchen8911 · 2026-04-17T00:20:04Z

Addressed the test-coverage residual risk in the latest force-push (commit a5e873e, single squashed commit).

Added TestUndeployScript_TransientFailureWarnsAndContinues in pkg/bundler/deployer/helm/helm_test.go. The test:

Generates undeploy.sh from the template with a synthetic two-component bundle (cert-manager + gpu-operator).
Stubs kubectl on PATH to fail every call except api-resources (which succeeds with a minimal kind list, so the helpers reach the inner pipeline we want to exercise).
Sources each of the three sites and invokes them under set -euo pipefail:
- L97-L103 (delete_release_cluster_resources) — extracted via awk '/^delete_release_cluster_resources \{/,/^}/'
- L150-L154 (force_clear_namespace_finalizers) — same awk extraction pattern
- L296-L302 (per-component orphan-CRD inline loop) — extracted via sed -n '/Clean up orphaned CRDs that were owned/,/Clean up CRDs created by operators at runtime/p'
Asserts both:
- Wrapper exited 0 — proves set -e was not triggered (i.e., the done || … guard works).
- Warning: on stderr with the expected context — proves the warn-pattern is what's catching the failure (catches a hypothetical revert to the silent done || true form).

Both regression directions are covered:

Revert to bare done → wrapper exits non-zero → t.Fatalf fires.
Revert to silent done || true → no Warning: on stderr → t.Errorf fires.

Local results:

=== RUN   TestUndeployScript_TransientFailureWarnsAndContinues
=== RUN   TestUndeployScript_TransientFailureWarnsAndContinues/delete_release_cluster_resources
=== RUN   TestUndeployScript_TransientFailureWarnsAndContinues/force_clear_namespace_finalizers
=== RUN   TestUndeployScript_TransientFailureWarnsAndContinues/orphan_crd_inline_loop
--- PASS: TestUndeployScript_TransientFailureWarnsAndContinues (0.34s)
    --- PASS: TestUndeployScript_TransientFailureWarnsAndContinues/delete_release_cluster_resources (0.23s)
    --- PASS: TestUndeployScript_TransientFailureWarnsAndContinues/force_clear_namespace_finalizers (0.05s)
    --- PASS: TestUndeployScript_TransientFailureWarnsAndContinues/orphan_crd_inline_loop (0.04s)
PASS
ok  	github.com/NVIDIA/aicr/pkg/bundler/deployer/helm	1.939s

Full pkg/bundler/deployer/helm/... suite remains green with -race.

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

yuanchen8911 · 2026-04-17T00:24:56Z

Superseded by #602, which consolidates this change with the follow-up CRD-discovery pre-flight and post-flight work (#600, #601). Closing to keep review focused on the combined PR.

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

Combines three related hardening changes to the generated undeploy.sh (previously split across #599, #601, #600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was #599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was #601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was #600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: #477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

yuanchen8911 requested a review from a team as a code owner April 16, 2026 23:41

yuanchen8911 added the area/bundler label Apr 16, 2026

github-actions Bot added the size/XS label Apr 16, 2026

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

yuanchen8911 force-pushed the fix/undeploy-pipefail-resilience branch from cd380d6 to db18150 Compare April 16, 2026 23:49

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

This was referenced Apr 16, 2026

feat(bundler): add Helm-annotation post-flight check to undeploy.sh #600

Closed

fix(bundler): pre-flight should discover CRDs by Helm annotation, not just chart manifest #601

Closed

yuanchen8911 force-pushed the fix/undeploy-pipefail-resilience branch from db18150 to a5e873e Compare April 17, 2026 00:19

github-actions Bot added size/M and removed size/XS labels Apr 17, 2026

yuanchen8911 mentioned this pull request Apr 17, 2026

fix(bundler): fix undeploy template pre/post-flight checks #602

Merged

18 tasks

yuanchen8911 closed this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bundler): make undeploy.sh resilient to transient kubectl failures#599

fix(bundler): make undeploy.sh resilient to transient kubectl failures#599
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/undeploy-pipefail-resilience

yuanchen8911 commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 16, 2026

Uh oh!

yuanchen8911 commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

yuanchen8911 commented Apr 17, 2026

Uh oh!

yuanchen8911 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		kubectl delete crd "${name}" --ignore-not-found --wait=false \|\| true
		done \|\| true

Conversation

yuanchen8911 commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 commented Apr 17, 2026

Uh oh!

yuanchen8911 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuanchen8911 commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading