Skip to content

fix(bundler): make undeploy.sh resilient to transient kubectl failures#599

Closed
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/undeploy-pipefail-resilience
Closed

fix(bundler): make undeploy.sh resilient to transient kubectl failures#599
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/undeploy-pipefail-resilience

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 16, 2026

Summary

Add a trailing || true to three kubectl | jq | while pipelines in the generated undeploy.sh so a transient API failure during post-uninstall cleanup no longer kills the script before namespace deletion.

Motivation / Context

With set -euo pipefail (already at the top of the generated script), a single kubectl get crd/kubectl get <kind> returning non-zero — even a brief EKS control-plane 502 or an auth refresh hiccup — fails the entire pipeline and exits the script. In a typical run with ~15 components the orphaned-CRD cleanup pipeline executes 15 times back-to-back, so any momentary control-plane flap kills undeploy.sh after the Helm uninstalls succeeded but before taint removal / orphan-CRD cleanup / namespace deletion. The cluster ends up half-cleaned and the user has to finish manually.

Hit live today on aicr-cuj2: all 15 Helm releases uninstalled, then the script exited 1 immediately after Removing skyhook.nvidia.com node taints…, leaving every component namespace Active.

The two helper functions delete_release_cluster_resources and force_clear_namespace_finalizers carry the same pipeline shape and the same exposure.

The adjacent ORPHANED_CRD_GROUPS loop and stuck_crds=$(…) extraction already use the … || true pattern; this PR brings the three remaining sites into line.

Fixes: N/A
Related: #477 (introduced the per-component CRD cleanup pipeline)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)

Implementation Notes

Three single-token additions (donedone || true) plus one || true on a single kubectl delete crd call, all in pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl:

  • delete_release_cluster_resources (line 103)
  • force_clear_namespace_finalizers (line 154)
  • per-component orphan-CRD cleanup (lines 301–302)

The delete_orphaned_webhooks_for_ns helper at line 117 already wraps its inner pipeline in { … || true; }, so it doesn't need the same treatment.

Trade-off: cleanup steps that fail due to a real (non-transient) API problem will now be silently skipped. The post-flight verification block at the end of the script already warns about leftover namespaces / stuck CRDs / stale webhooks, so a real cleanup failure surfaces there instead of as set -e exit 1 in the middle.

Testing

make lint   # go vet + golangci-lint clean; pre-existing yamllint errors are in untracked .codex-worktrees/
GOFLAGS=-mod=vendor go test -race -count=1 ./pkg/bundler/...
  • pkg/bundler/... tests pass with -race (all subpackages green).
  • Re-built the CLI, regenerated a bundle, and confirmed the four new || true guards appear at the expected positions in bundles/undeploy.sh.
  • Live re-test on aicr-cuj2 after the patch is pending — will run once the cluster's intermittent 502s settle.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Generated-script change only. No API surface, no data migration. Reverting is a one-commit revert of this PR; existing bundles can either be regenerated or hand-patched.

Checklist

  • Tests pass locally (make test with -race on pkg/bundler/...)
  • Linter passes (make lint — go vet + golangci-lint clean; yamllint failures are in pre-existing untracked .codex-worktrees/)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (no new functionality; behavior change is "tolerate transient failure")
  • I updated docs if user-facing behavior changed (no user-facing behavior change beyond resilience)
  • Changes follow existing patterns in the codebase (matches the … || true pattern already used in the same file)
  • Commits are cryptographically signed (git commit -S)

Summary by CodeRabbit

  • Bug Fixes

    • Undeployment now tolerates transient failures during resource and CRD cleanup: cleanup continues instead of aborting, and non-fatal issues emit warnings so undeploy completes more reliably.
  • Tests

    • Added an integration-style test ensuring transient kubectl failures produce warnings and do not stop the undeploy flow.

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 16, 2026 23:41
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Modified the undeploy script to prevent strict-shell (set -euo pipefail) from aborting undeploy pipelines by appending safe fallbacks (|| true and done || echo "Warning: ... " >&2) so transient failures emit warnings and cleanup continues; added a test that simulates kubectl failures and asserts warnings are produced.

Changes

Cohort / File(s) Summary
Undeploy script (error handling)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Adjusted cleanup pipelines to tolerate transient pipeline failures: appended `done
Test added
pkg/bundler/deployer/helm/helm_test.go
Added TestUndeployScript_TransientFailureWarnsAndContinues: generates undeploy.sh, stubs kubectl to fail for non-api-resources calls, runs extracted cleanup sections, and asserts successful exit plus expected Warning: stderr output. Imports updated (bytes, os/exec).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through loops where warnings glow,
I nudged the pipes so scripts can go,
A whispered echo, soft and bright,
Lets cleanup trundle through the night,
I nibble errors, then say "all right." 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: making undeploy.sh resilient to transient kubectl failures, which aligns perfectly with the core modification across both modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Line 103: Occurrences of the silent suppression pattern "done || true" hide
failures during cleanup; change each usage to capture the command exit status
and, on non-zero, emit a clear warning including the failing command/context and
its output (stderr) so operators see RBAC/auth/API/jq failures — e.g., run the
pipeline, store its exit code, and if non‑zero print a descriptive warning to
stderr (matching the style used by the post-flight checks for terminating
namespaces/stale webhooks/unavailable APIs) rather than swallowing errors;
update each "done || true" instance to this conditional-warn pattern and ensure
the warning contains the command name/context for easier debugging.
- Around line 301-302: Add a post‑flight verification that mirrors the deletion
block's Helm-annotation filtering: after the kubectl delete loop that targets
CRDs with the Helm annotation (the block containing kubectl delete crd "${name}"
--ignore-not-found --wait=false), scan remaining CRDs for the same Helm
release/annotation (the same annotation key used in the deletion loop) and
report or fail if any matching CRDs still exist—similar to how
ORPHANED_CRD_GROUPS is checked in the post‑flight section (the check around
ORPHANED_CRD_GROUPS); implement this by enumerating CRDs, filtering by the Helm
annotation used earlier, and surfacing any leftovers in the post‑flight check so
Helm-managed CRDs are covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: af834a8f-0d2c-4397-9f56-b49f95e6f28a

📥 Commits

Reviewing files that changed from the base of the PR and between 1b25135 and cd380d6.

📒 Files selected for processing (1)
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl Outdated
Comment on lines +301 to +302
kubectl delete crd "${name}" --ignore-not-found --wait=false || true
done || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify annotation-based cleanup exists only in deletion path and not in post-flight verification.
rg -n -C3 'meta\.helm\.sh/release-name|meta\.helm\.sh/release-namespace' pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
rg -n -C3 'orphaned_crds|ORPHANED_CRD_GROUPS|Post-flight' pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Repository: NVIDIA/aicr

Length of output: 3508


Add post-flight verification for Helm-annotated CRDs to match the deletion block coverage.

Lines 301-302 suppress both delete and enumeration failures. If this block fails silently, Helm-owned CRDs can remain undetected because post-flight checks (lines 425-435) only scan ORPHANED_CRD_GROUPS, which covers operator-created CRDs, not Helm-managed ones. The annotation-based filtering in the deletion block (lines 296-298) has no corresponding post-flight check.

Suggested post-flight check
+# Verify Helm-annotated CRDs from bundle releases are gone
+remaining_release_crds=""
+{{- range .ComponentsReversed }}{{ if .HasChart }}
+left=$(kubectl get crd -o json 2>/dev/null \
+  | jq -r --arg rel "{{ .Name }}" --arg ns "{{ .Namespace }}" \
+    '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
+if [[ -n "${left}" ]]; then
+  remaining_release_crds="${remaining_release_crds} ${left}"
+fi
+{{- end }}{{ end }}
+if [[ -n "${remaining_release_crds}" ]]; then
+  echo "WARNING: Helm-owned CRDs still present:${remaining_release_crds}"
+  postflight_issues=true
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 301 - 302,
Add a post‑flight verification that mirrors the deletion block's Helm-annotation
filtering: after the kubectl delete loop that targets CRDs with the Helm
annotation (the block containing kubectl delete crd "${name}" --ignore-not-found
--wait=false), scan remaining CRDs for the same Helm release/annotation (the
same annotation key used in the deletion loop) and report or fail if any
matching CRDs still exist—similar to how ORPHANED_CRD_GROUPS is checked in the
post‑flight section (the check around ORPHANED_CRD_GROUPS); implement this by
enumerating CRDs, filtering by the Helm annotation used earlier, and surfacing
any leftovers in the post‑flight check so Helm-managed CRDs are covered.

@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-pipefail-resilience branch from cd380d6 to db18150 Compare April 16, 2026 23:49
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Thanks for the review @coderabbitai.

Suggestion 1 (don't silently suppress): addressed in the force-push — each done || true is now done || echo "Warning: <kind/release/namespace/component context> failed (kubectl get / jq error); ..." >&2. Operators get a stderr signal naming the failing site instead of nothing, while the script still continues into post-flight verification.

Suggestion 2 (Helm-annotation post-flight check): deferring to a follow-up. It's a useful verification surface but adds new behavior beyond "stop the script from dying on transient flaps", and I'd like to keep this PR's blast radius minimal. Will file a separate issue/PR for the post-flight CRD-by-annotation check.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current state of the undeploy.sh.tmpl file to verify the warning echo changes
fd 'undeploy.sh.tmpl' --exec cat {}

Length of output: 21031


@yuanchen8911: Confirmed — the force-pushed version looks great. All three pipeline guards now carry meaningful, context-rich warning messages:

Site Warning message
delete_release_cluster_resources Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed …; leftovers will surface in post-flight
force_clear_namespace_finalizers Warning: finalizer-clear pipeline for ${kind} in ${ns} failed …; namespace may stay Terminating
Orphan-CRD cleanup Warning: orphan-CRD cleanup for <component>/<ns> failed …; leftovers will surface in post-flight

Each message names the specific failing site and the kind/release/namespace/component context, routes to stderr so it doesn't pollute stdout log parsing, and directs operators to post-flight verification — a clear improvement over silent suppression.

On Suggestion 2 — totally reasonable to keep this PR's scope tight. Would you like me to open a follow-up GitHub issue to track the Helm-annotation post-flight CRD check so it doesn't get lost?

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)

103-103: ⚠️ Potential issue | 🟠 Major

Post-flight guarantee is overstated for Helm-owned CRDs.

These warnings say leftovers “will surface in post-flight”, but post-flight currently checks ORPHANED_CRD_GROUPS and not Helm-annotation-owned CRDs. A failed annotation-based cleanup can remain undetected.

Consider either adding a matching post-flight check for Helm-owned CRDs or softening the warning text to avoid a false guarantee.

Also applies to: 302-302

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` at line 103, The
warning text in undeploy.sh.tmpl (the echo line that starts "Warning: ${kind}
cleanup pipeline for release ${release}/${ns} failed...") promises leftovers
"will surface in post-flight" which is misleading because post-flight currently
only inspects ORPHANED_CRD_GROUPS and does not detect Helm-annotation-owned
CRDs; either change this message to a softer, accurate statement (e.g., remove
the guarantee about post-flight) or implement a complementary post-flight check
that scans for Helm-owned CRDs via their Helm annotations (e.g.,
meta.helm.sh/release-name/namespace annotation keys) and cross-references them
with release/${ns} so failures are detected. Update the echo in undeploy.sh.tmpl
if softening the message, or add the new post-flight check logic and ensure it
references ORPHANED_CRD_GROUPS and Helm annotation keys consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Line 103: The warning text in undeploy.sh.tmpl (the echo line that starts
"Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed...")
promises leftovers "will surface in post-flight" which is misleading because
post-flight currently only inspects ORPHANED_CRD_GROUPS and does not detect
Helm-annotation-owned CRDs; either change this message to a softer, accurate
statement (e.g., remove the guarantee about post-flight) or implement a
complementary post-flight check that scans for Helm-owned CRDs via their Helm
annotations (e.g., meta.helm.sh/release-name/namespace annotation keys) and
cross-references them with release/${ns} so failures are detected. Update the
echo in undeploy.sh.tmpl if softening the message, or add the new post-flight
check logic and ensure it references ORPHANED_CRD_GROUPS and Helm annotation
keys consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: b3355203-b3ef-4e3e-a478-08c5838395fe

📥 Commits

Reviewing files that changed from the base of the PR and between cd380d6 and db18150.

📒 Files selected for processing (1)
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 16, 2026
Add a verification pass at the end of undeploy.sh that re-enumerates CRDs
filtered by the same meta.helm.sh/release-name annotation used by the
per-component CRD deletion loop, and surfaces any leftovers as a post-flight
warning.

Today the script's post-flight section only checks for terminating namespaces,
unavailable API services, and ORPHANED_CRD_GROUPS-style operator-runtime CRDs.
Helm-managed CRDs that survive the deletion loop go unnoticed until the next
deploy.sh run rejects them. After PR NVIDIA#599 made the deletion pipelines warn-on-
failure instead of script-killing, this check is the missing verification surface
that confirms the warnings did not point to actual leftovers.

The check mirrors the existing ORPHANED_CRD_GROUPS post-flight pattern: per-
release kubectl|jq pipeline guarded with `|| true`, results accumulated into a
helm_orphaned_crds variable, single warning emitted if non-empty. When cleanup
succeeded normally the variable stays empty and the check is silent.

Suggested by CodeRabbit review on PR NVIDIA#599.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Add a verification pass at the end of undeploy.sh that re-enumerates CRDs
filtered by the same meta.helm.sh/release-name annotation used by the
per-component CRD deletion loop, and surfaces any leftovers as a post-flight
warning.

Today the script's post-flight section only checks for terminating namespaces,
unavailable API services, and ORPHANED_CRD_GROUPS-style operator-runtime CRDs.
Helm-managed CRDs that survive the deletion loop go unnoticed until the next
deploy.sh run rejects them. After PR NVIDIA#599 made the deletion pipelines warn-on-
failure instead of script-killing, this check is the missing verification
surface that confirms the warnings did not point to actual leftovers.

The check mirrors the existing ORPHANED_CRD_GROUPS post-flight pattern: per-
release kubectl|jq pipeline guarded with `|| true`, results accumulated into
a helm_orphaned_crds variable, single warning emitted if non-empty. When
cleanup succeeded normally the variable stays empty and the check is silent.

CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded
from the rescan: the earlier `kubectl delete crd ... --wait=false` returns
immediately, so a slow finalizer can leave the CRD still listed even though
it is being cleaned up normally. Only truly stuck (not-yet-terminating) CRDs
should be surfaced as a warning.

Suggested by CodeRabbit review on PR NVIDIA#599; terminating-CRD filter added in
response to review feedback on PR NVIDIA#600.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
The orphaned-CRD cleanup pipeline introduced in NVIDIA#477 and the two helper
functions delete_release_cluster_resources and force_clear_namespace_finalizers
all share the shape:

    kubectl get ... 2>/dev/null \
      | jq -r '...' 2>/dev/null \
      | while read -r name; do ... done

With set -euo pipefail, a single transient kubectl failure (API 502, brief
auth refresh, etc.) causes the pipeline to exit non-zero and terminates the
script. In a typical run with ~15 components, this means any 100 ms control-plane
hiccup can kill undeploy.sh after Helm uninstalls succeeded but before
namespace cleanup, leaving the cluster in a partially-cleaned state that
the user must finish by hand.

Append a warn-on-failure handler to each of the three pipelines so transient
API errors during cleanup are tolerated AND visible: the script keeps going
into the post-flight verification phase, and operators see a clear stderr
warning naming the failing context (kind / release / namespace / component)
instead of nothing. The adjacent ORPHANED_CRD_GROUPS loop and stuck_crds
extraction already use the "tolerate-and-continue" pattern; this PR brings
the three remaining sites into line and adds context-rich warnings per
CodeRabbit review.
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-pipefail-resilience branch from db18150 to a5e873e Compare April 17, 2026 00:19
@github-actions github-actions Bot added size/M and removed size/XS labels Apr 17, 2026
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Addressed the test-coverage residual risk in the latest force-push (commit a5e873e, single squashed commit).

Added TestUndeployScript_TransientFailureWarnsAndContinues in pkg/bundler/deployer/helm/helm_test.go. The test:

  1. Generates undeploy.sh from the template with a synthetic two-component bundle (cert-manager + gpu-operator).
  2. Stubs kubectl on PATH to fail every call except api-resources (which succeeds with a minimal kind list, so the helpers reach the inner pipeline we want to exercise).
  3. Sources each of the three sites and invokes them under set -euo pipefail:
    • L97-L103 (delete_release_cluster_resources) — extracted via awk '/^delete_release_cluster_resources\(\) \{/,/^}/'
    • L150-L154 (force_clear_namespace_finalizers) — same awk extraction pattern
    • L296-L302 (per-component orphan-CRD inline loop) — extracted via sed -n '/Clean up orphaned CRDs that were owned/,/Clean up CRDs created by operators at runtime/p'
  4. Asserts both:
    • Wrapper exited 0 — proves set -e was not triggered (i.e., the done || … guard works).
    • Warning: on stderr with the expected context — proves the warn-pattern is what's catching the failure (catches a hypothetical revert to the silent done || true form).

Both regression directions are covered:

  • Revert to bare done → wrapper exits non-zero → t.Fatalf fires.
  • Revert to silent done || true → no Warning: on stderr → t.Errorf fires.

Local results:

=== RUN   TestUndeployScript_TransientFailureWarnsAndContinues
=== RUN   TestUndeployScript_TransientFailureWarnsAndContinues/delete_release_cluster_resources
=== RUN   TestUndeployScript_TransientFailureWarnsAndContinues/force_clear_namespace_finalizers
=== RUN   TestUndeployScript_TransientFailureWarnsAndContinues/orphan_crd_inline_loop
--- PASS: TestUndeployScript_TransientFailureWarnsAndContinues (0.34s)
    --- PASS: TestUndeployScript_TransientFailureWarnsAndContinues/delete_release_cluster_resources (0.23s)
    --- PASS: TestUndeployScript_TransientFailureWarnsAndContinues/force_clear_namespace_finalizers (0.05s)
    --- PASS: TestUndeployScript_TransientFailureWarnsAndContinues/orphan_crd_inline_loop (0.04s)
PASS
ok  	github.com/NVIDIA/aicr/pkg/bundler/deployer/helm	1.939s

Full pkg/bundler/deployer/helm/... suite remains green with -race.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Superseded by #602, which consolidates this change with the follow-up CRD-discovery pre-flight and post-flight work (#600, #601). Closing to keep review focused on the combined PR.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across #599, #601, #600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was #599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was #601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was #600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: #477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant