fix(bundler): make undeploy.sh resilient to transient kubectl failures#599
fix(bundler): make undeploy.sh resilient to transient kubectl failures#599yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughModified the undeploy script to prevent strict-shell ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Line 103: Occurrences of the silent suppression pattern "done || true" hide
failures during cleanup; change each usage to capture the command exit status
and, on non-zero, emit a clear warning including the failing command/context and
its output (stderr) so operators see RBAC/auth/API/jq failures — e.g., run the
pipeline, store its exit code, and if non‑zero print a descriptive warning to
stderr (matching the style used by the post-flight checks for terminating
namespaces/stale webhooks/unavailable APIs) rather than swallowing errors;
update each "done || true" instance to this conditional-warn pattern and ensure
the warning contains the command name/context for easier debugging.
- Around line 301-302: Add a post‑flight verification that mirrors the deletion
block's Helm-annotation filtering: after the kubectl delete loop that targets
CRDs with the Helm annotation (the block containing kubectl delete crd "${name}"
--ignore-not-found --wait=false), scan remaining CRDs for the same Helm
release/annotation (the same annotation key used in the deletion loop) and
report or fail if any matching CRDs still exist—similar to how
ORPHANED_CRD_GROUPS is checked in the post‑flight section (the check around
ORPHANED_CRD_GROUPS); implement this by enumerating CRDs, filtering by the Helm
annotation used earlier, and surfacing any leftovers in the post‑flight check so
Helm-managed CRDs are covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: af834a8f-0d2c-4397-9f56-b49f95e6f28a
📒 Files selected for processing (1)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
| kubectl delete crd "${name}" --ignore-not-found --wait=false || true | ||
| done || true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify annotation-based cleanup exists only in deletion path and not in post-flight verification.
rg -n -C3 'meta\.helm\.sh/release-name|meta\.helm\.sh/release-namespace' pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
rg -n -C3 'orphaned_crds|ORPHANED_CRD_GROUPS|Post-flight' pkg/bundler/deployer/helm/templates/undeploy.sh.tmplRepository: NVIDIA/aicr
Length of output: 3508
Add post-flight verification for Helm-annotated CRDs to match the deletion block coverage.
Lines 301-302 suppress both delete and enumeration failures. If this block fails silently, Helm-owned CRDs can remain undetected because post-flight checks (lines 425-435) only scan ORPHANED_CRD_GROUPS, which covers operator-created CRDs, not Helm-managed ones. The annotation-based filtering in the deletion block (lines 296-298) has no corresponding post-flight check.
Suggested post-flight check
+# Verify Helm-annotated CRDs from bundle releases are gone
+remaining_release_crds=""
+{{- range .ComponentsReversed }}{{ if .HasChart }}
+left=$(kubectl get crd -o json 2>/dev/null \
+ | jq -r --arg rel "{{ .Name }}" --arg ns "{{ .Namespace }}" \
+ '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
+if [[ -n "${left}" ]]; then
+ remaining_release_crds="${remaining_release_crds} ${left}"
+fi
+{{- end }}{{ end }}
+if [[ -n "${remaining_release_crds}" ]]; then
+ echo "WARNING: Helm-owned CRDs still present:${remaining_release_crds}"
+ postflight_issues=true
+fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 301 - 302,
Add a post‑flight verification that mirrors the deletion block's Helm-annotation
filtering: after the kubectl delete loop that targets CRDs with the Helm
annotation (the block containing kubectl delete crd "${name}" --ignore-not-found
--wait=false), scan remaining CRDs for the same Helm release/annotation (the
same annotation key used in the deletion loop) and report or fail if any
matching CRDs still exist—similar to how ORPHANED_CRD_GROUPS is checked in the
post‑flight section (the check around ORPHANED_CRD_GROUPS); implement this by
enumerating CRDs, filtering by the Helm annotation used earlier, and surfacing
any leftovers in the post‑flight check so Helm-managed CRDs are covered.
cd380d6 to
db18150
Compare
|
Thanks for the review @coderabbitai. Suggestion 1 (don't silently suppress): addressed in the force-push — each Suggestion 2 (Helm-annotation post-flight check): deferring to a follow-up. It's a useful verification surface but adds new behavior beyond "stop the script from dying on transient flaps", and I'd like to keep this PR's blast radius minimal. Will file a separate issue/PR for the post-flight CRD-by-annotation check. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 21031
Each message names the specific failing site and the kind/release/namespace/component context, routes to stderr so it doesn't pollute stdout log parsing, and directs operators to post-flight verification — a clear improvement over silent suppression. On Suggestion 2 — totally reasonable to keep this PR's scope tight. Would you like me to open a follow-up GitHub issue to track the Helm-annotation post-flight CRD check so it doesn't get lost? |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)
103-103:⚠️ Potential issue | 🟠 MajorPost-flight guarantee is overstated for Helm-owned CRDs.
These warnings say leftovers “will surface in post-flight”, but post-flight currently checks
ORPHANED_CRD_GROUPSand not Helm-annotation-owned CRDs. A failed annotation-based cleanup can remain undetected.Consider either adding a matching post-flight check for Helm-owned CRDs or softening the warning text to avoid a false guarantee.
Also applies to: 302-302
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` at line 103, The warning text in undeploy.sh.tmpl (the echo line that starts "Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed...") promises leftovers "will surface in post-flight" which is misleading because post-flight currently only inspects ORPHANED_CRD_GROUPS and does not detect Helm-annotation-owned CRDs; either change this message to a softer, accurate statement (e.g., remove the guarantee about post-flight) or implement a complementary post-flight check that scans for Helm-owned CRDs via their Helm annotations (e.g., meta.helm.sh/release-name/namespace annotation keys) and cross-references them with release/${ns} so failures are detected. Update the echo in undeploy.sh.tmpl if softening the message, or add the new post-flight check logic and ensure it references ORPHANED_CRD_GROUPS and Helm annotation keys consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Line 103: The warning text in undeploy.sh.tmpl (the echo line that starts
"Warning: ${kind} cleanup pipeline for release ${release}/${ns} failed...")
promises leftovers "will surface in post-flight" which is misleading because
post-flight currently only inspects ORPHANED_CRD_GROUPS and does not detect
Helm-annotation-owned CRDs; either change this message to a softer, accurate
statement (e.g., remove the guarantee about post-flight) or implement a
complementary post-flight check that scans for Helm-owned CRDs via their Helm
annotations (e.g., meta.helm.sh/release-name/namespace annotation keys) and
cross-references them with release/${ns} so failures are detected. Update the
echo in undeploy.sh.tmpl if softening the message, or add the new post-flight
check logic and ensure it references ORPHANED_CRD_GROUPS and Helm annotation
keys consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: b3355203-b3ef-4e3e-a478-08c5838395fe
📒 Files selected for processing (1)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by the same meta.helm.sh/release-name annotation used by the per-component CRD deletion loop, and surfaces any leftovers as a post-flight warning. Today the script's post-flight section only checks for terminating namespaces, unavailable API services, and ORPHANED_CRD_GROUPS-style operator-runtime CRDs. Helm-managed CRDs that survive the deletion loop go unnoticed until the next deploy.sh run rejects them. After PR NVIDIA#599 made the deletion pipelines warn-on- failure instead of script-killing, this check is the missing verification surface that confirms the warnings did not point to actual leftovers. The check mirrors the existing ORPHANED_CRD_GROUPS post-flight pattern: per- release kubectl|jq pipeline guarded with `|| true`, results accumulated into a helm_orphaned_crds variable, single warning emitted if non-empty. When cleanup succeeded normally the variable stays empty and the check is silent. Suggested by CodeRabbit review on PR NVIDIA#599.
Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by the same meta.helm.sh/release-name annotation used by the per-component CRD deletion loop, and surfaces any leftovers as a post-flight warning. Today the script's post-flight section only checks for terminating namespaces, unavailable API services, and ORPHANED_CRD_GROUPS-style operator-runtime CRDs. Helm-managed CRDs that survive the deletion loop go unnoticed until the next deploy.sh run rejects them. After PR NVIDIA#599 made the deletion pipelines warn-on- failure instead of script-killing, this check is the missing verification surface that confirms the warnings did not point to actual leftovers. The check mirrors the existing ORPHANED_CRD_GROUPS post-flight pattern: per- release kubectl|jq pipeline guarded with `|| true`, results accumulated into a helm_orphaned_crds variable, single warning emitted if non-empty. When cleanup succeeded normally the variable stays empty and the check is silent. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded from the rescan: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD still listed even though it is being cleaned up normally. Only truly stuck (not-yet-terminating) CRDs should be surfaced as a warning. Suggested by CodeRabbit review on PR NVIDIA#599; terminating-CRD filter added in response to review feedback on PR NVIDIA#600. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
The orphaned-CRD cleanup pipeline introduced in NVIDIA#477 and the two helper functions delete_release_cluster_resources and force_clear_namespace_finalizers all share the shape: kubectl get ... 2>/dev/null \ | jq -r '...' 2>/dev/null \ | while read -r name; do ... done With set -euo pipefail, a single transient kubectl failure (API 502, brief auth refresh, etc.) causes the pipeline to exit non-zero and terminates the script. In a typical run with ~15 components, this means any 100 ms control-plane hiccup can kill undeploy.sh after Helm uninstalls succeeded but before namespace cleanup, leaving the cluster in a partially-cleaned state that the user must finish by hand. Append a warn-on-failure handler to each of the three pipelines so transient API errors during cleanup are tolerated AND visible: the script keeps going into the post-flight verification phase, and operators see a clear stderr warning naming the failing context (kind / release / namespace / component) instead of nothing. The adjacent ORPHANED_CRD_GROUPS loop and stuck_crds extraction already use the "tolerate-and-continue" pattern; this PR brings the three remaining sites into line and adds context-rich warnings per CodeRabbit review.
db18150 to
a5e873e
Compare
|
Addressed the test-coverage residual risk in the latest force-push (commit a5e873e, single squashed commit). Added
Both regression directions are covered:
Local results: Full |
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across #599, #601, #600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was #599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was #601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was #600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: #477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Summary
Add a trailing
|| trueto threekubectl | jq | whilepipelines in the generatedundeploy.shso a transient API failure during post-uninstall cleanup no longer kills the script before namespace deletion.Motivation / Context
With
set -euo pipefail(already at the top of the generated script), a singlekubectl get crd/kubectl get <kind>returning non-zero — even a brief EKS control-plane 502 or an auth refresh hiccup — fails the entire pipeline and exits the script. In a typical run with ~15 components the orphaned-CRD cleanup pipeline executes 15 times back-to-back, so any momentary control-plane flap killsundeploy.shafter the Helm uninstalls succeeded but before taint removal / orphan-CRD cleanup / namespace deletion. The cluster ends up half-cleaned and the user has to finish manually.Hit live today on
aicr-cuj2: all 15 Helm releases uninstalled, then the script exited 1 immediately afterRemoving skyhook.nvidia.com node taints…, leaving every component namespaceActive.The two helper functions
delete_release_cluster_resourcesandforce_clear_namespace_finalizerscarry the same pipeline shape and the same exposure.The adjacent
ORPHANED_CRD_GROUPSloop andstuck_crds=$(…)extraction already use the… || truepattern; this PR brings the three remaining sites into line.Fixes: N/A
Related: #477 (introduced the per-component CRD cleanup pipeline)
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Three single-token additions (
done→done || true) plus one|| trueon a singlekubectl delete crdcall, all inpkg/bundler/deployer/helm/templates/undeploy.sh.tmpl:delete_release_cluster_resources(line 103)force_clear_namespace_finalizers(line 154)The
delete_orphaned_webhooks_for_nshelper at line 117 already wraps its inner pipeline in{ … || true; }, so it doesn't need the same treatment.Trade-off: cleanup steps that fail due to a real (non-transient) API problem will now be silently skipped. The post-flight verification block at the end of the script already warns about leftover namespaces / stuck CRDs / stale webhooks, so a real cleanup failure surfaces there instead of as
set -eexit 1 in the middle.Testing
pkg/bundler/...tests pass with-race(all subpackages green).|| trueguards appear at the expected positions inbundles/undeploy.sh.aicr-cuj2after the patch is pending — will run once the cluster's intermittent 502s settle.Risk Assessment
Rollout notes: Generated-script change only. No API surface, no data migration. Reverting is a one-commit revert of this PR; existing bundles can either be regenerated or hand-patched.
Checklist
make testwith-raceonpkg/bundler/...)make lint— go vet + golangci-lint clean; yamllint failures are in pre-existing untracked.codex-worktrees/)… || truepattern already used in the same file)git commit -S)Summary by CodeRabbit
Bug Fixes
Tests