fix(bundler): fix undeploy template pre/post-flight checks by yuanchen8911 · Pull Request #602 · NVIDIA/aicr

yuanchen8911 · 2026-04-17T00:24:48Z

Summary

Fixes bugs in pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl, which generates the bundle undeploy.sh, by hardening pre-flight checks, cleanup behavior, and post-flight diagnostics.

Motivation / Context

The generated undeploy.sh had a few real correctness gaps:

transient kubectl / jq failures in cleanup could abort the script mid-run under set -euo pipefail
pre-flight could miss some release-owned CRDs, or falsely block safe undeploys for bundle-managed resources that are deleted before controller uninstall
post-flight did not clearly surface all leftover Helm-owned CRDs after uninstall

This PR fixes those behaviors in the undeploy.sh template so newly generated bundles get the corrected undeploy flow.

Fixes: N/A
Related:

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

Bundlers (pkg/bundler, pkg/component/*)
Docs/examples (docs/, examples/)

Implementation Notes

1. Cleanup pipelines are now best-effort under `set -euo pipefail`

The template now emits warning-and-continue wrappers for cleanup pipelines that were previously able to abort undeploy.sh mid-run on a single transient API failure.

That applies to:

delete_release_cluster_resources
force_clear_namespace_finalizers
the inline Helm-owned orphan-CRD cleanup loop

This keeps undeploy moving into later cleanup phases and post-flight reporting instead of exiting halfway through cleanup.

2. Pre-flight CRD discovery is release-scoped and explicit

The final pre-flight design intentionally stays narrow. The template now checks CRDs from only these sources:

CRDs present in helm get manifest
CRDs annotated to the Helm release via meta.helm.sh/release-*
a small explicit extra_crds_for_release() list for known crds/-installed operator CRDs that matter for undeploy safety

The explicit override list is currently limited to:

gpu-operator
kai-scheduler
k8s-nim-operator
kubeflow-trainer
kube-prometheus-stack
dynamo-platform
network-operator

This preserves the safety value of pre-flight without reintroducing broader API-group ownership inference across the cluster.

3. Pre-flight now skips known manifest-managed false-positive cases

Some releases in the bundle create custom resources from manifests/ that undeploy deletes before uninstalling the controller. Those resources can safely carry controller finalizers during undeploy, so pre-flight should not block on them.

The template now handles those cases with an explicit skip_preflight_for_release() helper instead of trying to infer manifest ownership dynamically:

skyhook-operator
kgateway

That is the deliberate simplification: explicit known skips instead of broader manifest-parsing / ownership inference.

4. Retained pre-flight checks now fail closed and are easier to diagnose

For the remaining pre-flight checks, the generated script now:

caches kubectl get crd -o json once per run
preserves successful kubectl ... -o json stderr warnings without poisoning JSON parsing
surfaces discovery failures instead of silently treating them as "no CRDs"
prints per-release progress during pre-flight

The intent is: if the script cannot safely determine whether retained release-owned CRDs still have stuck finalizers, it should say so clearly instead of silently passing pre-flight.

5. Shared-cluster cleanup is intentionally conservative

The template no longer emits destructive cleanup that deletes unannotated CRDs based only on API-group matching.

Destructive cleanup is limited to CRDs that can be attributed to a just-uninstalled Helm release. Post-flight then reports leftover Helm-owned CRDs still present after uninstall.

That is the intended shared-cluster safety tradeoff: avoid deleting foreign-tenant CRDs just because they share an API group.

6. Post-flight now uses one cached CRD listing for both warning paths

The post-flight section now fetches CRDs once and reuses that cached JSON for:

leftover Helm-annotated CRD warnings
leftover explicit-name crds/ CRD warnings

That keeps post-flight cheaper and avoids repeating the same cluster-wide CRD fetch inside the component loop.

7. Namespace finalizer clearing now warns on discovery failure

force_clear_namespace_finalizers no longer silently returns when kubectl api-resources fails. It now emits a warning so operators can see why a namespace may remain stuck in Terminating.

8. `--skip-preflight` is now documented

The CLI reference now documents --skip-preflight, which the script already supports and which fail-closed pre-flight errors point operators to explicitly.

Testing

go test -race ./pkg/bundler/deployer/helm/...
make qualify

go test -race ./pkg/bundler/deployer/helm/... passed.
make qualify was rerun in a clean temporary worktree and passed through tests, coverage, vet, lint, and e2e.
The final grype vulnerability scan was still running when I stopped waiting on that earlier clean-worktree run, so I am not claiming a fully completed make qualify for the code changes.
Subsequent PR title/body-only edits did not rerun full make qualify, because they did not change the repo code.
pkg/bundler/deployer/helm coverage remains 88.1%, unchanged from origin/main.

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: This changes undeploy safety checks and cleanup behavior in the generated bundle script, but it does not change the bundle format or deploy path. The main user-visible differences are: safer handling of transient cleanup failures, better pre-flight coverage for release-owned CRDs, explicit skipping of known manifest-managed cases, and clearer post-flight leftover diagnostics.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-04-17T00:25:05Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a unit test that stubs kubectl to ensure the generated undeploy script warns on transient kubectl pipeline failures and continues; updates the undeploy script template to improve CRD discovery (manifest + live cluster), add release_groups(), warn on pipeline failures, and add post‑flight Helm-annotation CRD checks.

Changes

Cohort / File(s)	Summary
Test Addition `pkg/bundler/deployer/helm/helm_test.go`	Added `TestUndeployScript_TransientFailureWarnsAndContinues`: generates `undeploy.sh`, installs a temporary `kubectl` shim in `PATH`, executes extracted cleanup sections under `bash -euo pipefail`, and asserts each section exits successfully while emitting `Warning:` on simulated transient failures. Added `bytes`, `os/exec`, and `time` imports for subprocess handling and timeouts.
Undeploy Script Enhancements `pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`	Added `release_groups()` helper; changed CRD discovery in `check_release_for_stuck_crds()` to aggregate CRDs from `helm get manifest` and live annotated CRDs (deduplicate via `sort -u`); updated cleanup loops (`delete_release_cluster_resources()`, `force_clear_namespace_finalizers()`, orphan-CRD deletion) to append explicit `Warning:` on pipeline failures while preserving per-item `

Sequence Diagram(s)

sequenceDiagram
  participant Test as Test Runner
  participant Script as Generated undeploy.sh
  participant Kubectl as kubectl shim
  participant KubeAPI as Kubernetes API

  Test->>Script: generate & extract cleanup snippets
  Test->>Kubectl: install shim in PATH (stubbed behavior)
  Test->>Script: execute snippet under "bash -euo pipefail"
  Script->>Kubectl: kubectl api-resources / get / delete / patch / helm get manifest
  Kubectl->>KubeAPI: forward allowed calls (api-resources succeeds)
  KubeAPI-->>Kubectl: resource responses
  Kubectl-->>Script: non-zero exit + stderr for simulated transient commands
  Script-->>Test: emit "Warning:" to stderr and continue (exit 0)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hop through manifests and live API streams,

I whisper "Warning" when a pipeline misteers,
I gather CRDs from charts and sky,
I nudge finalizers and let deletes try,
A tidy undeploy — soft rabbit cheers!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title claims to fix 'pre/post-flight checks' but the PR primarily addresses transient-failure resilience, CRD discovery hardening, and orphan-CRD detection in undeploy.sh—only loosely related to 'pre/post-flight checks'.	Consider a more precise title like 'fix(bundler): harden undeploy.sh resilience and CRD discovery' or 'fix(bundler): improve undeploy script robustness and orphan-CRD detection' to better reflect the actual scope.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 18-21: The test invokes exec.CommandContext with
context.Background() (see the CommandContext call in helm_test.go) which can
hang; update the test to create a cancellable timeout context via
context.WithTimeout (add "time" to imports), use the returned ctx for
exec.CommandContext, defer cancel(), and ensure the subprocess is killed if the
timeout elapses (handle context.DeadlineExceeded appropriately in the test
assertions or error handling).

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 204-210: The current manifest capture line uses "manifest=$(helm
get manifest \"${release}\" -n \"${ns}\" 2>/dev/null) || return 0" which aborts
the function on helm failure and prevents computing annotated_crds; change the
failure handling so helm errors do not return (e.g. set manifest to empty on
failure) so that manifest_crds is empty but annotated_crds is still computed;
update the "manifest=$(helm get manifest ... )" invocation to avoid "return 0"
on failure and ensure annotated_crds (from kubectl/jq) still runs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: eaf6e389-a682-44de-ae94-9057814881a7

📥 Commits

Reviewing files that changed from the base of the PR and between b82af6f and 97a5ac7.

📒 Files selected for processing (2)

pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 2305-2306: The test currently creates a context with ctx :=
context.Background() and uses it with exec.CommandContext (in helm_test.go),
which can hang; replace it with a cancellable timeout context using
context.WithTimeout (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>)) and defer
cancel() so subprocess calls via exec.CommandContext will be killed after the
timeout; update any test helpers or variables that depend on ctx accordingly and
choose a duration appropriate for CI (e.g., 10–30s) to prevent wedged bash
invocations from hanging go test -race.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 4fe412ac-ee75-49f2-a1d3-13aad8c3ba27

📥 Commits

Reviewing files that changed from the base of the PR and between 97a5ac7 and 2f3a1c8.

📒 Files selected for processing (2)

pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

coderabbitai

♻️ Duplicate comments (2)

pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)

229-234: ⚠️ Potential issue | 🟠 Major

helm get manifest failure still bypasses annotation-based discovery.

Line 229 uses || return 0, which exits the function before annotated_crds (line 232) and group_crds (line 239) are computed. This defeats the three-source discovery design: if Helm is flaky but CRDs exist with annotations or matching API groups, the pre-flight check will miss them.

Suggested fix

-  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
+  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true)
   manifest_crds=$(echo "${manifest}" \
     | awk '/^kind:/{kind=$2} /^  name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
   annotated_crds=$(kubectl get crd -o json 2>/dev/null \
     | jq -r --arg rel "${release}" --arg ns "${ns}" \
       '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
+  # Skip if release has no CRDs from any source
+  if [[ -z "${manifest_crds}" && -z "${annotated_crds}" && -z "${group_crds}" ]]; then
+    return 0
+  fi

Move the early return after all three discovery sources are computed, and only return if all are empty.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 229 - 234,
The early return after running helm get manifest prevents annotated_crds and
group_crds discovery from running; update the undeploy logic so that you still
capture manifest (variable manifest) and compute manifest_crds, annotated_crds
(from kubectl/jq) and group_crds before deciding to return: move the current
"return 0" so it only executes when all three discovery results (manifest_crds,
annotated_crds, group_crds) are empty, ensuring annotation-based and
API-group-based CRD detection (variables manifest_crds, annotated_crds,
group_crds) still run even if helm get manifest fails.

pkg/bundler/deployer/helm/helm_test.go (1)

2305-2306: ⚠️ Potential issue | 🟡 Minor

Add timeout to subprocess execution context.

ctx := context.Background() is used with exec.CommandContext at line 2381. A wedged bash invocation will hang go test -race indefinitely.
Suggested fix

Add "time" to imports, then:
-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
 	outputDir := t.TempDir()
As per coding guidelines: "Use context.WithTimeout() for all I/O operations."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/helm_test.go` around lines 2305 - 2306, The test
uses a background context (ctx := context.Background()) passed into
exec.CommandContext which can hang; change to use context.WithTimeout to bound
subprocess execution: import "time", create ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>) and defer
cancel(), then pass that ctx into exec.CommandContext (the same variable used in
the test and referenced where exec.CommandContext is invoked) so the subprocess
is killed on timeout.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 2305-2306: The test uses a background context (ctx :=
context.Background()) passed into exec.CommandContext which can hang; change to
use context.WithTimeout to bound subprocess execution: import "time", create
ctx, cancel := context.WithTimeout(context.Background(), <reasonable duration>)
and defer cancel(), then pass that ctx into exec.CommandContext (the same
variable used in the test and referenced where exec.CommandContext is invoked)
so the subprocess is killed on timeout.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 229-234: The early return after running helm get manifest prevents
annotated_crds and group_crds discovery from running; update the undeploy logic
so that you still capture manifest (variable manifest) and compute
manifest_crds, annotated_crds (from kubectl/jq) and group_crds before deciding
to return: move the current "return 0" so it only executes when all three
discovery results (manifest_crds, annotated_crds, group_crds) are empty,
ensuring annotation-based and API-group-based CRD detection (variables
manifest_crds, annotated_crds, group_crds) still run even if helm get manifest
fails.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 2f28835a-86de-4f9d-ab57-f59795447b9c

📥 Commits

Reviewing files that changed from the base of the PR and between 2f3a1c8 and b6bde3b.

📒 Files selected for processing (2)

pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

coderabbitai

♻️ Duplicate comments (2)

pkg/bundler/deployer/helm/helm_test.go (1)

2370-2371: ⚠️ Potential issue | 🟡 Minor

Add timeout to subprocess execution context.

ctx := context.Background() is used with exec.CommandContext at line 2446. A wedged bash invocation will hang go test -race indefinitely.
-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
Also add "time" to imports.

As per coding guidelines: "Use context.WithTimeout() for all I/O operations."
,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/helm_test.go` around lines 2370 - 2371, Replace the
background context used for subprocesses with a cancellable timeout context:
instead of ctx := context.Background() create a ctx, cancel :=
context.WithTimeout(context.Background(), 10*time.Second) and defer cancel() so
exec.CommandContext uses a timeout; update the import list to include "time".
Ensure references to ctx (used with exec.CommandContext) remain unchanged and
that cancel() is deferred in the same scope where ctx is created.

pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)

229-250: ⚠️ Potential issue | 🟠 Major

Don't let helm get manifest disable the annotation-based fallback.

Line 233's || return 0 exits the function on any helm failure, preventing annotated_crds from being computed. This reintroduces false-negative pre-flight behavior for cases where Helm is flaky but CRDs with Helm annotations are still present.

-  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
+  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true)
   manifest_crds=$(echo "${manifest}" \
     | awk '/^kind:/{kind=$2} /^  name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
   annotated_crds=$(kubectl get crd -o json 2>/dev/null \
     | jq -r --arg rel "${release}" --arg ns "${ns}" \
       '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
+  # Exit early only if BOTH sources are empty
+  if [[ -z "${manifest_crds}" && -z "${annotated_crds}" ]]; then
+    return 0
+  fi

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 229 - 250,
In check_release_for_stuck_crds change the helm manifest fetch so a failure
doesn't short-circuit the function: replace the current manifest assignment that
uses "helm get manifest ... 2>/dev/null || return 0" with a form that swallows
errors but continues (e.g., "manifest=$(helm get manifest ... 2>/dev/null ||
true)"), so annotated_crds and the fallback annotation-based logic still run;
keep the rest of the function (manifest_crds, annotated_crds, sort -u, while
loop, and check_crd_for_stuck_resources) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 2370-2371: Replace the background context used for subprocesses
with a cancellable timeout context: instead of ctx := context.Background()
create a ctx, cancel := context.WithTimeout(context.Background(),
10*time.Second) and defer cancel() so exec.CommandContext uses a timeout; update
the import list to include "time". Ensure references to ctx (used with
exec.CommandContext) remain unchanged and that cancel() is deferred in the same
scope where ctx is created.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 229-250: In check_release_for_stuck_crds change the helm manifest
fetch so a failure doesn't short-circuit the function: replace the current
manifest assignment that uses "helm get manifest ... 2>/dev/null || return 0"
with a form that swallows errors but continues (e.g., "manifest=$(helm get
manifest ... 2>/dev/null || true)"), so annotated_crds and the fallback
annotation-based logic still run; keep the rest of the function (manifest_crds,
annotated_crds, sort -u, while loop, and check_crd_for_stuck_resources)
unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 221f0650-259b-49c6-a811-064333c9dd68

📥 Commits

Reviewing files that changed from the base of the PR and between b6bde3b and 527871e.

📒 Files selected for processing (2)

pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 240-245: The pipeline that sets annotated_crds can fail and
currently yields an empty string which makes the subsequent fast-path treat
failure as "no CRDs"; modify the logic around annotated_crds to detect pipeline
failure (e.g., enable pipefail or inspect the pipeline exit status) and only
consider annotated_crds empty when the command succeeded and produced no
results—if the kubectl|jq pipeline fails, surface an error or set a non-empty
sentinel so the [[ -z "${manifest_crds}" && -z "${annotated_crds}" ]] fast-path
is not taken; update the block that defines annotated_crds and the following
conditional to bail or continue safely when the discovery command fails
(referencing annotated_crds, manifest_crds and the kubectl/jq pipeline).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 180f99cb-afd1-4bdf-b5fc-d158435a5404

📥 Commits

Reviewing files that changed from the base of the PR and between 527871e and 7191b51.

📒 Files selected for processing (2)

pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

github-actions · 2026-04-17T02:21:15Z

🌿 Preview your docs: https://nvidia-preview-fix-undeploy-hardening-combined.docs.buildwithfern.com/aicr

mchmarny

Solid hardening of the undeploy template. The key design decisions are well-reasoned: fail-closed on API errors (with clear --skip-preflight escape hatch), explicit CRD override lists over dynamic group inference (shared-cluster safety), and best-effort cleanup pipelines that warn instead of aborting under set -euo pipefail. The removal of ORPHANED_CRD_GROUPS group-based deletion is the most important safety fix — it eliminates a real cross-tenant CRD deletion risk on shared clusters.

Test coverage is thorough: 11 new shell-behavior tests with kubectl/helm stubs that exercise transient failures, annotation discovery, explicit CRD discovery, skip lists, foreign-annotation safety, fail-closed behavior, stderr-preserving JSON parsing, post-flight warnings, kustomize-only bundles, and dynamo platform CRDs.

Two nits on maintainability (the explicit CRD/skip lists need coordinated updates when new operators are added), nothing blocking. CI is green. Note: PR is behind main (mergeable_state: behind) — will need a rebase before merge.

lockwobr

Thanks for the careful work here — the bug analysis and test coverage are solid, and I can see each change maps to a real failure mode.

My main hesitation is about the cumulative complexity of undeploy.sh.tmpl. This is roughly the 7th PR hardening this template (#253, #282, #364, #416, #477, #561, now this one), and each round understandably adds more branches, helpers, and component-specific knowledge. Net of this PR we're at ~700 lines of templated bash with two hardcoded component tables and a custom stderr-capture helper.

Being a deploy tool is a non-goal for this project, but each round of hardening here inches us closer to becoming one. At what point do we say this has crossed the line? Worth opening that conversation before we do round 8.

Left a few inline comments along those lines. Mostly suggestions, not blockers.

PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm release annotation, so release-scoped CRDs installed outside Helm discovery (chart crds/ subchart, runtime operator, out-of-band) are only surfaced as post-flight warnings. That default remains the correct shared-cluster safety choice. This adds --delete-orphan-crds for single-tenant teardowns. The rendered script iterates the existing extra_crds_for_release() list, force-clears residual finalizers on any remaining CRs (so the delete does not hang on resource-in-use markers from controllers that are already gone), and calls kubectl delete crd. Default stays off; caller must opt in. The opt-in path reuses the same foreign-release guard that pre-flight discovery uses (release-name AND release-namespace annotation match), so --delete-orphan-crds cannot remove a CRD that another Helm release on the cluster owns. Unannotated CRDs (runtime-installed by an operator, chart crds/ without annotations) are fair game — that is the entire point of the explicit list. Also: - adds the missing nvidia-dra-driver-gpu entry to extra_crds_for_release() (computedomains, computedomaincliques) - guards the read < <(jsonpath) regression: jsonpath output had no trailing newline, so `read` returned 1 at EOF and `|| return 0` swallowed the delete. Fixed by switching to a single `kubectl get crd -o json` + jq parse, which also centralizes the foreign-release guard and the plural/group/scope lookup. - malformed metadata no longer aborts cleanup silently: the helper warns to stderr and attempts a direct kubectl delete crd instead - test is now table-driven and covers: cluster-scope with own release, namespaced-scope with own release (patch -n), foreign release skip, unannotated (fair game), and malformed metadata fallback - updates docs/user/cli-reference.md: both meta.helm.sh/release-name AND meta.helm.sh/release-namespace must match for default deletion; the foreign-release guard also applies under --delete-orphan-crds

PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm release annotation, so release-scoped CRDs installed outside Helm discovery (chart crds/ subchart, runtime operator, out-of-band) are only surfaced as post-flight warnings. That default remains the correct shared-cluster safety choice. This adds --delete-orphan-crds for single-tenant teardowns. The rendered script iterates the existing extra_crds_for_release() list, force-clears residual finalizers on any remaining CRs (so the delete does not hang on resource-in-use markers from controllers that are already gone), and calls kubectl delete crd. Default stays off; caller must opt in. The same foreign-release guard that pre-flight discovery uses (release-name AND release-namespace annotation match) is applied in all three places that can surface bundle-named CRDs: pre-flight warnings (unchanged from PR NVIDIA#602), the new opt-in deletion path (force_clear_and_delete_crd helper), and the post-flight exact-name warning. A CRD annotated to a foreign Helm release is now skipped at every stage, so --delete-orphan-crds cannot remove another tenant's CRD and post-flight cannot mis-suggest one for manual `kubectl delete crd` removal. Unannotated CRDs (runtime-installed by an operator, chart crds/ without annotations) remain fair game — that is the entire point of the explicit list. Also: - adds the missing nvidia-dra-driver-gpu entry to extra_crds_for_release() (computedomains, computedomaincliques) - guards the read < <(jsonpath) regression: jsonpath output had no trailing newline, so `read` returned 1 at EOF and `|| return 0` swallowed the delete. Fixed by switching to a single `kubectl get crd -o json` + jq parse, which also centralizes the foreign-release guard and the plural/group/scope lookup. - malformed metadata no longer aborts cleanup silently: the helper warns to stderr and attempts a direct kubectl delete crd instead - test is now table-driven and covers: cluster-scope with own release, namespaced-scope with own release (patch -n), foreign release skip, unannotated (fair game), and malformed metadata fallback; plus marker assertions that the helper guard and the post-flight ownership filter are both wired - updates docs/user/cli-reference.md: both meta.helm.sh/release-name AND meta.helm.sh/release-namespace must match for default deletion; the foreign-release guard also applies under --delete-orphan-crds

yuanchen8911 added enhancement New feature or request area/bundler labels Apr 17, 2026

yuanchen8911 requested a review from a team as a code owner April 17, 2026 00:24

yuanchen8911 mentioned this pull request Apr 17, 2026

fix(bundler): make undeploy.sh resilient to transient kubectl failures #599

Closed

24 tasks

github-actions Bot added the size/M label Apr 17, 2026

This was referenced Apr 17, 2026

feat(bundler): add Helm-annotation post-flight check to undeploy.sh #600

Closed

fix(bundler): pre-flight should discover CRDs by Helm annotation, not just chart manifest #601

Closed

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from 97a5ac7 to 2f3a1c8 Compare April 17, 2026 00:33

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread pkg/bundler/deployer/helm/helm_test.go

Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl Outdated

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread pkg/bundler/deployer/helm/helm_test.go

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from 2f3a1c8 to b6bde3b Compare April 17, 2026 00:43

github-actions Bot added size/L and removed size/M labels Apr 17, 2026

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch 3 times, most recently from 527871e to 7191b51 Compare April 17, 2026 01:04

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from 7191b51 to a66afd3 Compare April 17, 2026 01:12

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread pkg/bundler/deployer/helm/helm_test.go

Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl Outdated

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch 2 times, most recently from cc2b6a3 to cabc255 Compare April 17, 2026 01:28

github-actions Bot added size/XL and removed size/L labels Apr 17, 2026

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch 2 times, most recently from 7296352 to 9497ad0 Compare April 17, 2026 01:44

github-actions Bot added the area/docs label Apr 17, 2026

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from e0b7b97 to 1f18857 Compare April 17, 2026 02:47

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch 2 times, most recently from 919ffac to bb9462c Compare April 17, 2026 05:23

yuanchen8911 marked this pull request as ready for review April 17, 2026 05:59

yuanchen8911 requested review from ayuskauskas, lockwobr and mchmarny April 17, 2026 06:01

yuanchen8911 changed the title ~~fix(bundler): harden undeploy.sh and simplify preflight~~ fix(bundler): fix undeploy.sh pre/post-flight checks Apr 17, 2026

yuanchen8911 changed the title ~~fix(bundler): fix undeploy.sh pre/post-flight checks~~ fix(bundler): fix undeploy template pre/post-flight checks Apr 17, 2026

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from bb9462c to 73f09b9 Compare April 17, 2026 07:16

mchmarny assigned yuanchen8911 Apr 17, 2026

mchmarny approved these changes Apr 17, 2026

View reviewed changes

yuanchen8911 mentioned this pull request Apr 17, 2026

Enhance deployment-phase validation to reflect post-install workload readiness #607

Closed

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from 73f09b9 to 98d2a3f Compare April 17, 2026 17:12

lockwobr reviewed Apr 17, 2026

View reviewed changes

Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Comment thread pkg/bundler/deployer/helm/helm_test.go

fix(bundler): harden undeploy.sh and simplify preflight

e1b2513

yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from 98d2a3f to e1b2513 Compare April 17, 2026 18:38

yuanchen8911 mentioned this pull request Apr 17, 2026

script(bundler): clarify deploy.sh completion != readiness #609

Merged

12 tasks

yuanchen8911 requested a review from lockwobr April 17, 2026 19:25

yuanchen8911 enabled auto-merge (squash) April 17, 2026 19:39

yuanchen8911 merged commit c12c783 into NVIDIA:main Apr 17, 2026
33 checks passed

lockwobr mentioned this pull request Apr 17, 2026

Bridge Jobs for deploy/undeploy lifecycle flow control #610

Open

yuanchen8911 mentioned this pull request Apr 18, 2026

enhance(validator): add targeted post-deployment GPU readiness checks #611

Merged

25 tasks

lockwobr mentioned this pull request Apr 21, 2026

[Feature]: Helmfile deployer for declarative, bash-free deployments #632

Open

yuanchen8911 mentioned this pull request Apr 23, 2026

feat(bundler): add --delete-orphan-crds opt-in flag to undeploy.sh #661

Open

12 tasks

Conversation

yuanchen8911 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

1. Cleanup pipelines are now best-effort under set -euo pipefail

2. Pre-flight CRD discovery is release-scoped and explicit

3. Pre-flight now skips known manifest-managed false-positive cases

4. Retained pre-flight checks now fail closed and are easier to diagnose

5. Shared-cluster cleanup is intentionally conservative

6. Post-flight now uses one cached CRD listing for both warning paths

7. Namespace finalizer clearing now warns on discovery failure

8. --skip-preflight is now documented

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lockwobr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 commented Apr 17, 2026 •

edited

Loading

1. Cleanup pipelines are now best-effort under `set -euo pipefail`

8. `--skip-preflight` is now documented

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading