fix(bundler): fix undeploy template pre/post-flight checks#602
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a unit test that stubs Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Runner
participant Script as Generated undeploy.sh
participant Kubectl as kubectl shim
participant KubeAPI as Kubernetes API
Test->>Script: generate & extract cleanup snippets
Test->>Kubectl: install shim in PATH (stubbed behavior)
Test->>Script: execute snippet under "bash -euo pipefail"
Script->>Kubectl: kubectl api-resources / get / delete / patch / helm get manifest
Kubectl->>KubeAPI: forward allowed calls (api-resources succeeds)
KubeAPI-->>Kubectl: resource responses
Kubectl-->>Script: non-zero exit + stderr for simulated transient commands
Script-->>Test: emit "Warning:" to stderr and continue (exit 0)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
97a5ac7 to
2f3a1c8
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 18-21: The test invokes exec.CommandContext with
context.Background() (see the CommandContext call in helm_test.go) which can
hang; update the test to create a cancellable timeout context via
context.WithTimeout (add "time" to imports), use the returned ctx for
exec.CommandContext, defer cancel(), and ensure the subprocess is killed if the
timeout elapses (handle context.DeadlineExceeded appropriately in the test
assertions or error handling).
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 204-210: The current manifest capture line uses "manifest=$(helm
get manifest \"${release}\" -n \"${ns}\" 2>/dev/null) || return 0" which aborts
the function on helm failure and prevents computing annotated_crds; change the
failure handling so helm errors do not return (e.g. set manifest to empty on
failure) so that manifest_crds is empty but annotated_crds is still computed;
update the "manifest=$(helm get manifest ... )" invocation to avoid "return 0"
on failure and ensure annotated_crds (from kubectl/jq) still runs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: eaf6e389-a682-44de-ae94-9057814881a7
📒 Files selected for processing (2)
pkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 2305-2306: The test currently creates a context with ctx :=
context.Background() and uses it with exec.CommandContext (in helm_test.go),
which can hang; replace it with a cancellable timeout context using
context.WithTimeout (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>)) and defer
cancel() so subprocess calls via exec.CommandContext will be killed after the
timeout; update any test helpers or variables that depend on ctx accordingly and
choose a duration appropriate for CI (e.g., 10–30s) to prevent wedged bash
invocations from hanging go test -race.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 4fe412ac-ee75-49f2-a1d3-13aad8c3ba27
📒 Files selected for processing (2)
pkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
2f3a1c8 to
b6bde3b
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)
229-234:⚠️ Potential issue | 🟠 Major
helm get manifestfailure still bypasses annotation-based discovery.Line 229 uses
|| return 0, which exits the function beforeannotated_crds(line 232) andgroup_crds(line 239) are computed. This defeats the three-source discovery design: if Helm is flaky but CRDs exist with annotations or matching API groups, the pre-flight check will miss them.Suggested fix
- manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0 + manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true) manifest_crds=$(echo "${manifest}" \ | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') annotated_crds=$(kubectl get crd -o json 2>/dev/null \ | jq -r --arg rel "${release}" --arg ns "${ns}" \ '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) + # Skip if release has no CRDs from any source + if [[ -z "${manifest_crds}" && -z "${annotated_crds}" && -z "${group_crds}" ]]; then + return 0 + fiMove the early return after all three discovery sources are computed, and only return if all are empty.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 229 - 234, The early return after running helm get manifest prevents annotated_crds and group_crds discovery from running; update the undeploy logic so that you still capture manifest (variable manifest) and compute manifest_crds, annotated_crds (from kubectl/jq) and group_crds before deciding to return: move the current "return 0" so it only executes when all three discovery results (manifest_crds, annotated_crds, group_crds) are empty, ensuring annotation-based and API-group-based CRD detection (variables manifest_crds, annotated_crds, group_crds) still run even if helm get manifest fails.pkg/bundler/deployer/helm/helm_test.go (1)
2305-2306:⚠️ Potential issue | 🟡 MinorAdd timeout to subprocess execution context.
ctx := context.Background()is used withexec.CommandContextat line 2381. A wedgedbashinvocation will hanggo test -raceindefinitely.Suggested fix
Add
"time"to imports, then:- ctx := context.Background() + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() outputDir := t.TempDir()As per coding guidelines: "Use
context.WithTimeout()for all I/O operations."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/helm_test.go` around lines 2305 - 2306, The test uses a background context (ctx := context.Background()) passed into exec.CommandContext which can hang; change to use context.WithTimeout to bound subprocess execution: import "time", create ctx, cancel := context.WithTimeout(context.Background(), <reasonable duration>) and defer cancel(), then pass that ctx into exec.CommandContext (the same variable used in the test and referenced where exec.CommandContext is invoked) so the subprocess is killed on timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 2305-2306: The test uses a background context (ctx :=
context.Background()) passed into exec.CommandContext which can hang; change to
use context.WithTimeout to bound subprocess execution: import "time", create
ctx, cancel := context.WithTimeout(context.Background(), <reasonable duration>)
and defer cancel(), then pass that ctx into exec.CommandContext (the same
variable used in the test and referenced where exec.CommandContext is invoked)
so the subprocess is killed on timeout.
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 229-234: The early return after running helm get manifest prevents
annotated_crds and group_crds discovery from running; update the undeploy logic
so that you still capture manifest (variable manifest) and compute
manifest_crds, annotated_crds (from kubectl/jq) and group_crds before deciding
to return: move the current "return 0" so it only executes when all three
discovery results (manifest_crds, annotated_crds, group_crds) are empty,
ensuring annotation-based and API-group-based CRD detection (variables
manifest_crds, annotated_crds, group_crds) still run even if helm get manifest
fails.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 2f28835a-86de-4f9d-ab57-f59795447b9c
📒 Files selected for processing (2)
pkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
527871e to
7191b51
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/bundler/deployer/helm/helm_test.go (1)
2370-2371:⚠️ Potential issue | 🟡 MinorAdd timeout to subprocess execution context.
ctx := context.Background()is used withexec.CommandContextat line 2446. A wedgedbashinvocation will hanggo test -raceindefinitely.- ctx := context.Background() + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel()Also add
"time"to imports.As per coding guidelines: "Use
context.WithTimeout()for all I/O operations."
,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/helm_test.go` around lines 2370 - 2371, Replace the background context used for subprocesses with a cancellable timeout context: instead of ctx := context.Background() create a ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) and defer cancel() so exec.CommandContext uses a timeout; update the import list to include "time". Ensure references to ctx (used with exec.CommandContext) remain unchanged and that cancel() is deferred in the same scope where ctx is created.pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)
229-250:⚠️ Potential issue | 🟠 MajorDon't let
helm get manifestdisable the annotation-based fallback.Line 233's
|| return 0exits the function on any helm failure, preventingannotated_crdsfrom being computed. This reintroduces false-negative pre-flight behavior for cases where Helm is flaky but CRDs with Helm annotations are still present.- manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0 + manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true) manifest_crds=$(echo "${manifest}" \ | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') annotated_crds=$(kubectl get crd -o json 2>/dev/null \ | jq -r --arg rel "${release}" --arg ns "${ns}" \ '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) + # Exit early only if BOTH sources are empty + if [[ -z "${manifest_crds}" && -z "${annotated_crds}" ]]; then + return 0 + fi,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 229 - 250, In check_release_for_stuck_crds change the helm manifest fetch so a failure doesn't short-circuit the function: replace the current manifest assignment that uses "helm get manifest ... 2>/dev/null || return 0" with a form that swallows errors but continues (e.g., "manifest=$(helm get manifest ... 2>/dev/null || true)"), so annotated_crds and the fallback annotation-based logic still run; keep the rest of the function (manifest_crds, annotated_crds, sort -u, while loop, and check_crd_for_stuck_resources) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 2370-2371: Replace the background context used for subprocesses
with a cancellable timeout context: instead of ctx := context.Background()
create a ctx, cancel := context.WithTimeout(context.Background(),
10*time.Second) and defer cancel() so exec.CommandContext uses a timeout; update
the import list to include "time". Ensure references to ctx (used with
exec.CommandContext) remain unchanged and that cancel() is deferred in the same
scope where ctx is created.
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 229-250: In check_release_for_stuck_crds change the helm manifest
fetch so a failure doesn't short-circuit the function: replace the current
manifest assignment that uses "helm get manifest ... 2>/dev/null || return 0"
with a form that swallows errors but continues (e.g., "manifest=$(helm get
manifest ... 2>/dev/null || true)"), so annotated_crds and the fallback
annotation-based logic still run; keep the rest of the function (manifest_crds,
annotated_crds, sort -u, while loop, and check_crd_for_stuck_resources)
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 221f0650-259b-49c6-a811-064333c9dd68
📒 Files selected for processing (2)
pkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
7191b51 to
a66afd3
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 240-245: The pipeline that sets annotated_crds can fail and
currently yields an empty string which makes the subsequent fast-path treat
failure as "no CRDs"; modify the logic around annotated_crds to detect pipeline
failure (e.g., enable pipefail or inspect the pipeline exit status) and only
consider annotated_crds empty when the command succeeded and produced no
results—if the kubectl|jq pipeline fails, surface an error or set a non-empty
sentinel so the [[ -z "${manifest_crds}" && -z "${annotated_crds}" ]] fast-path
is not taken; update the block that defines annotated_crds and the following
conditional to bail or continue safely when the discovery command fails
(referencing annotated_crds, manifest_crds and the kubectl/jq pipeline).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 180f99cb-afd1-4bdf-b5fc-d158435a5404
📒 Files selected for processing (2)
pkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
cc2b6a3 to
cabc255
Compare
7296352 to
9497ad0
Compare
|
🌿 Preview your docs: https://nvidia-preview-fix-undeploy-hardening-combined.docs.buildwithfern.com/aicr |
e0b7b97 to
1f18857
Compare
919ffac to
bb9462c
Compare
bb9462c to
73f09b9
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Solid hardening of the undeploy template. The key design decisions are well-reasoned: fail-closed on API errors (with clear --skip-preflight escape hatch), explicit CRD override lists over dynamic group inference (shared-cluster safety), and best-effort cleanup pipelines that warn instead of aborting under set -euo pipefail. The removal of ORPHANED_CRD_GROUPS group-based deletion is the most important safety fix — it eliminates a real cross-tenant CRD deletion risk on shared clusters.
Test coverage is thorough: 11 new shell-behavior tests with kubectl/helm stubs that exercise transient failures, annotation discovery, explicit CRD discovery, skip lists, foreign-annotation safety, fail-closed behavior, stderr-preserving JSON parsing, post-flight warnings, kustomize-only bundles, and dynamo platform CRDs.
Two nits on maintainability (the explicit CRD/skip lists need coordinated updates when new operators are added), nothing blocking. CI is green. Note: PR is behind main (mergeable_state: behind) — will need a rebase before merge.
73f09b9 to
98d2a3f
Compare
lockwobr
left a comment
There was a problem hiding this comment.
Thanks for the careful work here — the bug analysis and test coverage are solid, and I can see each change maps to a real failure mode.
My main hesitation is about the cumulative complexity of undeploy.sh.tmpl. This is roughly the 7th PR hardening this template (#253, #282, #364, #416, #477, #561, now this one), and each round understandably adds more branches, helpers, and component-specific knowledge. Net of this PR we're at ~700 lines of templated bash with two hardcoded component tables and a custom stderr-capture helper.
Being a deploy tool is a non-goal for this project, but each round of hardening here inches us closer to becoming one. At what point do we say this has crossed the line? Worth opening that conversation before we do round 8.
Left a few inline comments along those lines. Mostly suggestions, not blockers.
98d2a3f to
e1b2513
Compare
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm release annotation, so release-scoped CRDs installed outside Helm discovery (chart crds/ subchart, runtime operator, out-of-band) are only surfaced as post-flight warnings. That default remains the correct shared-cluster safety choice. This adds --delete-orphan-crds for single-tenant teardowns. The rendered script iterates the existing extra_crds_for_release() list, force-clears residual finalizers on any remaining CRs (so the delete does not hang on resource-in-use markers from controllers that are already gone), and calls kubectl delete crd. Default stays off; caller must opt in. The opt-in path reuses the same foreign-release guard that pre-flight discovery uses (release-name AND release-namespace annotation match), so --delete-orphan-crds cannot remove a CRD that another Helm release on the cluster owns. Unannotated CRDs (runtime-installed by an operator, chart crds/ without annotations) are fair game — that is the entire point of the explicit list. Also: - adds the missing nvidia-dra-driver-gpu entry to extra_crds_for_release() (computedomains, computedomaincliques) - guards the read < <(jsonpath) regression: jsonpath output had no trailing newline, so `read` returned 1 at EOF and `|| return 0` swallowed the delete. Fixed by switching to a single `kubectl get crd -o json` + jq parse, which also centralizes the foreign-release guard and the plural/group/scope lookup. - malformed metadata no longer aborts cleanup silently: the helper warns to stderr and attempts a direct kubectl delete crd instead - test is now table-driven and covers: cluster-scope with own release, namespaced-scope with own release (patch -n), foreign release skip, unannotated (fair game), and malformed metadata fallback - updates docs/user/cli-reference.md: both meta.helm.sh/release-name AND meta.helm.sh/release-namespace must match for default deletion; the foreign-release guard also applies under --delete-orphan-crds
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm release annotation, so release-scoped CRDs installed outside Helm discovery (chart crds/ subchart, runtime operator, out-of-band) are only surfaced as post-flight warnings. That default remains the correct shared-cluster safety choice. This adds --delete-orphan-crds for single-tenant teardowns. The rendered script iterates the existing extra_crds_for_release() list, force-clears residual finalizers on any remaining CRs (so the delete does not hang on resource-in-use markers from controllers that are already gone), and calls kubectl delete crd. Default stays off; caller must opt in. The opt-in path reuses the same foreign-release guard that pre-flight discovery uses (release-name AND release-namespace annotation match), so --delete-orphan-crds cannot remove a CRD that another Helm release on the cluster owns. Unannotated CRDs (runtime-installed by an operator, chart crds/ without annotations) are fair game — that is the entire point of the explicit list. Also: - adds the missing nvidia-dra-driver-gpu entry to extra_crds_for_release() (computedomains, computedomaincliques) - guards the read < <(jsonpath) regression: jsonpath output had no trailing newline, so `read` returned 1 at EOF and `|| return 0` swallowed the delete. Fixed by switching to a single `kubectl get crd -o json` + jq parse, which also centralizes the foreign-release guard and the plural/group/scope lookup. - malformed metadata no longer aborts cleanup silently: the helper warns to stderr and attempts a direct kubectl delete crd instead - test is now table-driven and covers: cluster-scope with own release, namespaced-scope with own release (patch -n), foreign release skip, unannotated (fair game), and malformed metadata fallback - updates docs/user/cli-reference.md: both meta.helm.sh/release-name AND meta.helm.sh/release-namespace must match for default deletion; the foreign-release guard also applies under --delete-orphan-crds
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm release annotation, so release-scoped CRDs installed outside Helm discovery (chart crds/ subchart, runtime operator, out-of-band) are only surfaced as post-flight warnings. That default remains the correct shared-cluster safety choice. This adds --delete-orphan-crds for single-tenant teardowns. The rendered script iterates the existing extra_crds_for_release() list, force-clears residual finalizers on any remaining CRs (so the delete does not hang on resource-in-use markers from controllers that are already gone), and calls kubectl delete crd. Default stays off; caller must opt in. The same foreign-release guard that pre-flight discovery uses (release-name AND release-namespace annotation match) is applied in all three places that can surface bundle-named CRDs: pre-flight warnings (unchanged from PR NVIDIA#602), the new opt-in deletion path (force_clear_and_delete_crd helper), and the post-flight exact-name warning. A CRD annotated to a foreign Helm release is now skipped at every stage, so --delete-orphan-crds cannot remove another tenant's CRD and post-flight cannot mis-suggest one for manual `kubectl delete crd` removal. Unannotated CRDs (runtime-installed by an operator, chart crds/ without annotations) remain fair game — that is the entire point of the explicit list. Also: - adds the missing nvidia-dra-driver-gpu entry to extra_crds_for_release() (computedomains, computedomaincliques) - guards the read < <(jsonpath) regression: jsonpath output had no trailing newline, so `read` returned 1 at EOF and `|| return 0` swallowed the delete. Fixed by switching to a single `kubectl get crd -o json` + jq parse, which also centralizes the foreign-release guard and the plural/group/scope lookup. - malformed metadata no longer aborts cleanup silently: the helper warns to stderr and attempts a direct kubectl delete crd instead - test is now table-driven and covers: cluster-scope with own release, namespaced-scope with own release (patch -n), foreign release skip, unannotated (fair game), and malformed metadata fallback; plus marker assertions that the helper guard and the post-flight ownership filter are both wired - updates docs/user/cli-reference.md: both meta.helm.sh/release-name AND meta.helm.sh/release-namespace must match for default deletion; the foreign-release guard also applies under --delete-orphan-crds
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm release annotation, so release-scoped CRDs installed outside Helm discovery (chart crds/ subchart, runtime operator, out-of-band) are only surfaced as post-flight warnings. That default remains the correct shared-cluster safety choice. This adds --delete-orphan-crds for single-tenant teardowns. The rendered script iterates the existing extra_crds_for_release() list, force-clears residual finalizers on any remaining CRs (so the delete does not hang on resource-in-use markers from controllers that are already gone), and calls kubectl delete crd. Default stays off; caller must opt in. The same foreign-release guard that pre-flight discovery uses (release-name AND release-namespace annotation match) is applied in all three places that can surface bundle-named CRDs: pre-flight warnings (unchanged from PR NVIDIA#602), the new opt-in deletion path (force_clear_and_delete_crd helper), and the post-flight exact-name warning. A CRD annotated to a foreign Helm release is now skipped at every stage, so --delete-orphan-crds cannot remove another tenant's CRD and post-flight cannot mis-suggest one for manual `kubectl delete crd` removal. Unannotated CRDs (runtime-installed by an operator, chart crds/ without annotations) remain fair game — that is the entire point of the explicit list. Also: - adds the missing nvidia-dra-driver-gpu entry to extra_crds_for_release() (computedomains, computedomaincliques) - guards the read < <(jsonpath) regression: jsonpath output had no trailing newline, so `read` returned 1 at EOF and `|| return 0` swallowed the delete. Fixed by switching to a single `kubectl get crd -o json` + jq parse, which also centralizes the foreign-release guard and the plural/group/scope lookup. - malformed metadata no longer aborts cleanup silently: the helper warns to stderr and attempts a direct kubectl delete crd instead - test is now table-driven and covers: cluster-scope with own release, namespaced-scope with own release (patch -n), foreign release skip, unannotated (fair game), and malformed metadata fallback; plus marker assertions that the helper guard and the post-flight ownership filter are both wired - updates docs/user/cli-reference.md: both meta.helm.sh/release-name AND meta.helm.sh/release-namespace must match for default deletion; the foreign-release guard also applies under --delete-orphan-crds
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm release annotation, so release-scoped CRDs installed outside Helm discovery (chart crds/ subchart, runtime operator, out-of-band) are only surfaced as post-flight warnings. That default remains the correct shared-cluster safety choice. This adds --delete-orphan-crds for single-tenant teardowns. The rendered script iterates the existing extra_crds_for_release() list, force-clears residual finalizers on any remaining CRs (so the delete does not hang on resource-in-use markers from controllers that are already gone), and calls kubectl delete crd. Default stays off; caller must opt in. The same foreign-release guard that pre-flight discovery uses (release-name AND release-namespace annotation match) is applied in all three places that can surface bundle-named CRDs: pre-flight warnings (unchanged from PR NVIDIA#602), the new opt-in deletion path (force_clear_and_delete_crd helper), and the post-flight exact-name warning. A CRD annotated to a foreign Helm release is now skipped at every stage, so --delete-orphan-crds cannot remove another tenant's CRD and post-flight cannot mis-suggest one for manual `kubectl delete crd` removal. Unannotated CRDs (runtime-installed by an operator, chart crds/ without annotations) remain fair game — that is the entire point of the explicit list. Also: - adds the missing nvidia-dra-driver-gpu entry to extra_crds_for_release() (computedomains, computedomaincliques) - guards the read < <(jsonpath) regression: jsonpath output had no trailing newline, so `read` returned 1 at EOF and `|| return 0` swallowed the delete. Fixed by switching to a single `kubectl get crd -o json` + jq parse, which also centralizes the foreign-release guard and the plural/group/scope lookup. - malformed metadata no longer aborts cleanup silently: the helper warns to stderr and attempts a direct kubectl delete crd instead - test is now table-driven and covers: cluster-scope with own release, namespaced-scope with own release (patch -n), foreign release skip, unannotated (fair game), and malformed metadata fallback; plus marker assertions that the helper guard and the post-flight ownership filter are both wired - updates docs/user/cli-reference.md: both meta.helm.sh/release-name AND meta.helm.sh/release-namespace must match for default deletion; the foreign-release guard also applies under --delete-orphan-crds
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm release annotation, so release-scoped CRDs installed outside Helm discovery (chart crds/ subchart, runtime operator, out-of-band) are only surfaced as post-flight warnings. That default remains the correct shared-cluster safety choice. This adds --delete-orphan-crds for single-tenant teardowns. The rendered script iterates the existing extra_crds_for_release() list, force-clears residual finalizers on any remaining CRs (so the delete does not hang on resource-in-use markers from controllers that are already gone), and calls kubectl delete crd. Default stays off; caller must opt in. The same foreign-release guard that pre-flight discovery uses (release-name AND release-namespace annotation match) is applied in all three places that can surface bundle-named CRDs: pre-flight warnings (unchanged from PR NVIDIA#602), the new opt-in deletion path (force_clear_and_delete_crd helper), and the post-flight exact-name warning. A CRD annotated to a foreign Helm release is now skipped at every stage, so --delete-orphan-crds cannot remove another tenant's CRD and post-flight cannot mis-suggest one for manual `kubectl delete crd` removal. Unannotated CRDs (runtime-installed by an operator, chart crds/ without annotations) remain fair game — that is the entire point of the explicit list. Also: - adds the missing nvidia-dra-driver-gpu entry to extra_crds_for_release() (computedomains, computedomaincliques) - guards the read < <(jsonpath) regression: jsonpath output had no trailing newline, so `read` returned 1 at EOF and `|| return 0` swallowed the delete. Fixed by switching to a single `kubectl get crd -o json` + jq parse, which also centralizes the foreign-release guard and the plural/group/scope lookup. - malformed metadata no longer aborts cleanup silently: the helper warns to stderr and attempts a direct kubectl delete crd instead - test is now table-driven and covers: cluster-scope with own release, namespaced-scope with own release (patch -n), foreign release skip, unannotated (fair game), and malformed metadata fallback; plus marker assertions that the helper guard and the post-flight ownership filter are both wired - updates docs/user/cli-reference.md: both meta.helm.sh/release-name AND meta.helm.sh/release-namespace must match for default deletion; the foreign-release guard also applies under --delete-orphan-crds
Summary
Fixes bugs in
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl, which generates the bundleundeploy.sh, by hardening pre-flight checks, cleanup behavior, and post-flight diagnostics.Motivation / Context
The generated
undeploy.shhad a few real correctness gaps:kubectl/jqfailures in cleanup could abort the script mid-run underset -euo pipefailThis PR fixes those behaviors in the
undeploy.shtemplate so newly generated bundles get the corrected undeploy flow.Fixes: N/A
Related:
#406: [Bug]: undeploy.sh hangs when CRs have finalizers and controller is already uninstalled #406#561: fix(bundler): add pre-flight finalizer check to undeploy.sh (#406) #561Type of Change
Component(s) Affected
pkg/bundler,pkg/component/*)docs/,examples/)Implementation Notes
1. Cleanup pipelines are now best-effort under
set -euo pipefailThe template now emits warning-and-continue wrappers for cleanup pipelines that were previously able to abort
undeploy.shmid-run on a single transient API failure.That applies to:
delete_release_cluster_resourcesforce_clear_namespace_finalizersThis keeps undeploy moving into later cleanup phases and post-flight reporting instead of exiting halfway through cleanup.
2. Pre-flight CRD discovery is release-scoped and explicit
The final pre-flight design intentionally stays narrow. The template now checks CRDs from only these sources:
helm get manifestmeta.helm.sh/release-*extra_crds_for_release()list for knowncrds/-installed operator CRDs that matter for undeploy safetyThe explicit override list is currently limited to:
gpu-operatorkai-schedulerk8s-nim-operatorkubeflow-trainerkube-prometheus-stackdynamo-platformnetwork-operatorThis preserves the safety value of pre-flight without reintroducing broader API-group ownership inference across the cluster.
3. Pre-flight now skips known manifest-managed false-positive cases
Some releases in the bundle create custom resources from
manifests/that undeploy deletes before uninstalling the controller. Those resources can safely carry controller finalizers during undeploy, so pre-flight should not block on them.The template now handles those cases with an explicit
skip_preflight_for_release()helper instead of trying to infer manifest ownership dynamically:skyhook-operatorkgatewayThat is the deliberate simplification: explicit known skips instead of broader manifest-parsing / ownership inference.
4. Retained pre-flight checks now fail closed and are easier to diagnose
For the remaining pre-flight checks, the generated script now:
kubectl get crd -o jsononce per runkubectl ... -o jsonstderr warnings without poisoning JSON parsingThe intent is: if the script cannot safely determine whether retained release-owned CRDs still have stuck finalizers, it should say so clearly instead of silently passing pre-flight.
5. Shared-cluster cleanup is intentionally conservative
The template no longer emits destructive cleanup that deletes unannotated CRDs based only on API-group matching.
Destructive cleanup is limited to CRDs that can be attributed to a just-uninstalled Helm release. Post-flight then reports leftover Helm-owned CRDs still present after uninstall.
That is the intended shared-cluster safety tradeoff: avoid deleting foreign-tenant CRDs just because they share an API group.
6. Post-flight now uses one cached CRD listing for both warning paths
The post-flight section now fetches CRDs once and reuses that cached JSON for:
crds/CRD warningsThat keeps post-flight cheaper and avoids repeating the same cluster-wide CRD fetch inside the component loop.
7. Namespace finalizer clearing now warns on discovery failure
force_clear_namespace_finalizersno longer silently returns whenkubectl api-resourcesfails. It now emits a warning so operators can see why a namespace may remain stuck inTerminating.8.
--skip-preflightis now documentedThe CLI reference now documents
--skip-preflight, which the script already supports and which fail-closed pre-flight errors point operators to explicitly.Testing
go test -race ./pkg/bundler/deployer/helm/... make qualifygo test -race ./pkg/bundler/deployer/helm/...passed.make qualifywas rerun in a clean temporary worktree and passed through tests, coverage, vet, lint, and e2e.grypevulnerability scan was still running when I stopped waiting on that earlier clean-worktree run, so I am not claiming a fully completedmake qualifyfor the code changes.make qualify, because they did not change the repo code.pkg/bundler/deployer/helmcoverage remains88.1%, unchanged fromorigin/main.Risk Assessment
Rollout notes: This changes undeploy safety checks and cleanup behavior in the generated bundle script, but it does not change the bundle format or deploy path. The main user-visible differences are: safer handling of transient cleanup failures, better pre-flight coverage for release-owned CRDs, explicit skipping of known manifest-managed cases, and clearer post-flight leftover diagnostics.
Checklist
make testwith-race)make lint)git commit -S)