Skip to content

fix(bundler): fix undeploy template pre/post-flight checks#602

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/undeploy-hardening-combined
Apr 17, 2026
Merged

fix(bundler): fix undeploy template pre/post-flight checks#602
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/undeploy-hardening-combined

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 17, 2026

Summary

Fixes bugs in pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl, which generates the bundle undeploy.sh, by hardening pre-flight checks, cleanup behavior, and post-flight diagnostics.

Motivation / Context

The generated undeploy.sh had a few real correctness gaps:

  • transient kubectl / jq failures in cleanup could abort the script mid-run under set -euo pipefail
  • pre-flight could miss some release-owned CRDs, or falsely block safe undeploys for bundle-managed resources that are deleted before controller uninstall
  • post-flight did not clearly surface all leftover Helm-owned CRDs after uninstall

This PR fixes those behaviors in the undeploy.sh template so newly generated bundles get the corrected undeploy flow.

Fixes: N/A
Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*)
  • Docs/examples (docs/, examples/)

Implementation Notes

1. Cleanup pipelines are now best-effort under set -euo pipefail

The template now emits warning-and-continue wrappers for cleanup pipelines that were previously able to abort undeploy.sh mid-run on a single transient API failure.

That applies to:

  • delete_release_cluster_resources
  • force_clear_namespace_finalizers
  • the inline Helm-owned orphan-CRD cleanup loop

This keeps undeploy moving into later cleanup phases and post-flight reporting instead of exiting halfway through cleanup.

2. Pre-flight CRD discovery is release-scoped and explicit

The final pre-flight design intentionally stays narrow. The template now checks CRDs from only these sources:

  • CRDs present in helm get manifest
  • CRDs annotated to the Helm release via meta.helm.sh/release-*
  • a small explicit extra_crds_for_release() list for known crds/-installed operator CRDs that matter for undeploy safety

The explicit override list is currently limited to:

  • gpu-operator
  • kai-scheduler
  • k8s-nim-operator
  • kubeflow-trainer
  • kube-prometheus-stack
  • dynamo-platform
  • network-operator

This preserves the safety value of pre-flight without reintroducing broader API-group ownership inference across the cluster.

3. Pre-flight now skips known manifest-managed false-positive cases

Some releases in the bundle create custom resources from manifests/ that undeploy deletes before uninstalling the controller. Those resources can safely carry controller finalizers during undeploy, so pre-flight should not block on them.

The template now handles those cases with an explicit skip_preflight_for_release() helper instead of trying to infer manifest ownership dynamically:

  • skyhook-operator
  • kgateway

That is the deliberate simplification: explicit known skips instead of broader manifest-parsing / ownership inference.

4. Retained pre-flight checks now fail closed and are easier to diagnose

For the remaining pre-flight checks, the generated script now:

  • caches kubectl get crd -o json once per run
  • preserves successful kubectl ... -o json stderr warnings without poisoning JSON parsing
  • surfaces discovery failures instead of silently treating them as "no CRDs"
  • prints per-release progress during pre-flight

The intent is: if the script cannot safely determine whether retained release-owned CRDs still have stuck finalizers, it should say so clearly instead of silently passing pre-flight.

5. Shared-cluster cleanup is intentionally conservative

The template no longer emits destructive cleanup that deletes unannotated CRDs based only on API-group matching.

Destructive cleanup is limited to CRDs that can be attributed to a just-uninstalled Helm release. Post-flight then reports leftover Helm-owned CRDs still present after uninstall.

That is the intended shared-cluster safety tradeoff: avoid deleting foreign-tenant CRDs just because they share an API group.

6. Post-flight now uses one cached CRD listing for both warning paths

The post-flight section now fetches CRDs once and reuses that cached JSON for:

  • leftover Helm-annotated CRD warnings
  • leftover explicit-name crds/ CRD warnings

That keeps post-flight cheaper and avoids repeating the same cluster-wide CRD fetch inside the component loop.

7. Namespace finalizer clearing now warns on discovery failure

force_clear_namespace_finalizers no longer silently returns when kubectl api-resources fails. It now emits a warning so operators can see why a namespace may remain stuck in Terminating.

8. --skip-preflight is now documented

The CLI reference now documents --skip-preflight, which the script already supports and which fail-closed pre-flight errors point operators to explicitly.

Testing

go test -race ./pkg/bundler/deployer/helm/...
make qualify
  • go test -race ./pkg/bundler/deployer/helm/... passed.
  • make qualify was rerun in a clean temporary worktree and passed through tests, coverage, vet, lint, and e2e.
  • The final grype vulnerability scan was still running when I stopped waiting on that earlier clean-worktree run, so I am not claiming a fully completed make qualify for the code changes.
  • Subsequent PR title/body-only edits did not rerun full make qualify, because they did not change the repo code.
  • pkg/bundler/deployer/helm coverage remains 88.1%, unchanged from origin/main.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: This changes undeploy safety checks and cleanup behavior in the generated bundle script, but it does not change the bundle format or deploy path. The main user-visible differences are: safer handling of transient cleanup failures, better pre-flight coverage for release-owned CRDs, explicit skipping of known manifest-managed cases, and clearer post-flight leftover diagnostics.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a unit test that stubs kubectl to ensure the generated undeploy script warns on transient kubectl pipeline failures and continues; updates the undeploy script template to improve CRD discovery (manifest + live cluster), add release_groups(), warn on pipeline failures, and add post‑flight Helm-annotation CRD checks.

Changes

Cohort / File(s) Summary
Test Addition
pkg/bundler/deployer/helm/helm_test.go
Added TestUndeployScript_TransientFailureWarnsAndContinues: generates undeploy.sh, installs a temporary kubectl shim in PATH, executes extracted cleanup sections under bash -euo pipefail, and asserts each section exits successfully while emitting Warning: on simulated transient failures. Added bytes, os/exec, and time imports for subprocess handling and timeouts.
Undeploy Script Enhancements
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Added release_groups() helper; changed CRD discovery in check_release_for_stuck_crds() to aggregate CRDs from helm get manifest and live annotated CRDs (deduplicate via sort -u); updated cleanup loops (delete_release_cluster_resources(), force_clear_namespace_finalizers(), orphan-CRD deletion) to append explicit Warning: on pipeline failures while preserving per-item `

Sequence Diagram(s)

sequenceDiagram
  participant Test as Test Runner
  participant Script as Generated undeploy.sh
  participant Kubectl as kubectl shim
  participant KubeAPI as Kubernetes API

  Test->>Script: generate & extract cleanup snippets
  Test->>Kubectl: install shim in PATH (stubbed behavior)
  Test->>Script: execute snippet under "bash -euo pipefail"
  Script->>Kubectl: kubectl api-resources / get / delete / patch / helm get manifest
  Kubectl->>KubeAPI: forward allowed calls (api-resources succeeds)
  KubeAPI-->>Kubectl: resource responses
  Kubectl-->>Script: non-zero exit + stderr for simulated transient commands
  Script-->>Test: emit "Warning:" to stderr and continue (exit 0)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hop through manifests and live API streams,

I whisper "Warning" when a pipeline misteers,
I gather CRDs from charts and sky,
I nudge finalizers and let deletes try,
A tidy undeploy — soft rabbit cheers!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title claims to fix 'pre/post-flight checks' but the PR primarily addresses transient-failure resilience, CRD discovery hardening, and orphan-CRD detection in undeploy.sh—only loosely related to 'pre/post-flight checks'. Consider a more precise title like 'fix(bundler): harden undeploy.sh resilience and CRD discovery' or 'fix(bundler): improve undeploy script robustness and orphan-CRD detection' to better reflect the actual scope.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from 97a5ac7 to 2f3a1c8 Compare April 17, 2026 00:33
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 18-21: The test invokes exec.CommandContext with
context.Background() (see the CommandContext call in helm_test.go) which can
hang; update the test to create a cancellable timeout context via
context.WithTimeout (add "time" to imports), use the returned ctx for
exec.CommandContext, defer cancel(), and ensure the subprocess is killed if the
timeout elapses (handle context.DeadlineExceeded appropriately in the test
assertions or error handling).

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 204-210: The current manifest capture line uses "manifest=$(helm
get manifest \"${release}\" -n \"${ns}\" 2>/dev/null) || return 0" which aborts
the function on helm failure and prevents computing annotated_crds; change the
failure handling so helm errors do not return (e.g. set manifest to empty on
failure) so that manifest_crds is empty but annotated_crds is still computed;
update the "manifest=$(helm get manifest ... )" invocation to avoid "return 0"
on failure and ensure annotated_crds (from kubectl/jq) still runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: eaf6e389-a682-44de-ae94-9057814881a7

📥 Commits

Reviewing files that changed from the base of the PR and between b82af6f and 97a5ac7.

📒 Files selected for processing (2)
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Comment thread pkg/bundler/deployer/helm/helm_test.go
Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 2305-2306: The test currently creates a context with ctx :=
context.Background() and uses it with exec.CommandContext (in helm_test.go),
which can hang; replace it with a cancellable timeout context using
context.WithTimeout (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>)) and defer
cancel() so subprocess calls via exec.CommandContext will be killed after the
timeout; update any test helpers or variables that depend on ctx accordingly and
choose a duration appropriate for CI (e.g., 10–30s) to prevent wedged bash
invocations from hanging go test -race.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 4fe412ac-ee75-49f2-a1d3-13aad8c3ba27

📥 Commits

Reviewing files that changed from the base of the PR and between 97a5ac7 and 2f3a1c8.

📒 Files selected for processing (2)
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Comment thread pkg/bundler/deployer/helm/helm_test.go
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from 2f3a1c8 to b6bde3b Compare April 17, 2026 00:43
@github-actions github-actions Bot added size/L and removed size/M labels Apr 17, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)

229-234: ⚠️ Potential issue | 🟠 Major

helm get manifest failure still bypasses annotation-based discovery.

Line 229 uses || return 0, which exits the function before annotated_crds (line 232) and group_crds (line 239) are computed. This defeats the three-source discovery design: if Helm is flaky but CRDs exist with annotations or matching API groups, the pre-flight check will miss them.

Suggested fix
-  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
+  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true)
   manifest_crds=$(echo "${manifest}" \
     | awk '/^kind:/{kind=$2} /^  name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
   annotated_crds=$(kubectl get crd -o json 2>/dev/null \
     | jq -r --arg rel "${release}" --arg ns "${ns}" \
       '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
+  # Skip if release has no CRDs from any source
+  if [[ -z "${manifest_crds}" && -z "${annotated_crds}" && -z "${group_crds}" ]]; then
+    return 0
+  fi

Move the early return after all three discovery sources are computed, and only return if all are empty.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 229 - 234,
The early return after running helm get manifest prevents annotated_crds and
group_crds discovery from running; update the undeploy logic so that you still
capture manifest (variable manifest) and compute manifest_crds, annotated_crds
(from kubectl/jq) and group_crds before deciding to return: move the current
"return 0" so it only executes when all three discovery results (manifest_crds,
annotated_crds, group_crds) are empty, ensuring annotation-based and
API-group-based CRD detection (variables manifest_crds, annotated_crds,
group_crds) still run even if helm get manifest fails.
pkg/bundler/deployer/helm/helm_test.go (1)

2305-2306: ⚠️ Potential issue | 🟡 Minor

Add timeout to subprocess execution context.

ctx := context.Background() is used with exec.CommandContext at line 2381. A wedged bash invocation will hang go test -race indefinitely.

Suggested fix

Add "time" to imports, then:

-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
 	outputDir := t.TempDir()

As per coding guidelines: "Use context.WithTimeout() for all I/O operations."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/helm_test.go` around lines 2305 - 2306, The test
uses a background context (ctx := context.Background()) passed into
exec.CommandContext which can hang; change to use context.WithTimeout to bound
subprocess execution: import "time", create ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>) and defer
cancel(), then pass that ctx into exec.CommandContext (the same variable used in
the test and referenced where exec.CommandContext is invoked) so the subprocess
is killed on timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 2305-2306: The test uses a background context (ctx :=
context.Background()) passed into exec.CommandContext which can hang; change to
use context.WithTimeout to bound subprocess execution: import "time", create
ctx, cancel := context.WithTimeout(context.Background(), <reasonable duration>)
and defer cancel(), then pass that ctx into exec.CommandContext (the same
variable used in the test and referenced where exec.CommandContext is invoked)
so the subprocess is killed on timeout.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 229-234: The early return after running helm get manifest prevents
annotated_crds and group_crds discovery from running; update the undeploy logic
so that you still capture manifest (variable manifest) and compute
manifest_crds, annotated_crds (from kubectl/jq) and group_crds before deciding
to return: move the current "return 0" so it only executes when all three
discovery results (manifest_crds, annotated_crds, group_crds) are empty,
ensuring annotation-based and API-group-based CRD detection (variables
manifest_crds, annotated_crds, group_crds) still run even if helm get manifest
fails.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 2f28835a-86de-4f9d-ab57-f59795447b9c

📥 Commits

Reviewing files that changed from the base of the PR and between 2f3a1c8 and b6bde3b.

📒 Files selected for processing (2)
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch 3 times, most recently from 527871e to 7191b51 Compare April 17, 2026 01:04
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/bundler/deployer/helm/helm_test.go (1)

2370-2371: ⚠️ Potential issue | 🟡 Minor

Add timeout to subprocess execution context.

ctx := context.Background() is used with exec.CommandContext at line 2446. A wedged bash invocation will hang go test -race indefinitely.

-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()

Also add "time" to imports.

As per coding guidelines: "Use context.WithTimeout() for all I/O operations."
,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/helm_test.go` around lines 2370 - 2371, Replace the
background context used for subprocesses with a cancellable timeout context:
instead of ctx := context.Background() create a ctx, cancel :=
context.WithTimeout(context.Background(), 10*time.Second) and defer cancel() so
exec.CommandContext uses a timeout; update the import list to include "time".
Ensure references to ctx (used with exec.CommandContext) remain unchanged and
that cancel() is deferred in the same scope where ctx is created.
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)

229-250: ⚠️ Potential issue | 🟠 Major

Don't let helm get manifest disable the annotation-based fallback.

Line 233's || return 0 exits the function on any helm failure, preventing annotated_crds from being computed. This reintroduces false-negative pre-flight behavior for cases where Helm is flaky but CRDs with Helm annotations are still present.

-  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
+  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true)
   manifest_crds=$(echo "${manifest}" \
     | awk '/^kind:/{kind=$2} /^  name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
   annotated_crds=$(kubectl get crd -o json 2>/dev/null \
     | jq -r --arg rel "${release}" --arg ns "${ns}" \
       '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
+  # Exit early only if BOTH sources are empty
+  if [[ -z "${manifest_crds}" && -z "${annotated_crds}" ]]; then
+    return 0
+  fi

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 229 - 250,
In check_release_for_stuck_crds change the helm manifest fetch so a failure
doesn't short-circuit the function: replace the current manifest assignment that
uses "helm get manifest ... 2>/dev/null || return 0" with a form that swallows
errors but continues (e.g., "manifest=$(helm get manifest ... 2>/dev/null ||
true)"), so annotated_crds and the fallback annotation-based logic still run;
keep the rest of the function (manifest_crds, annotated_crds, sort -u, while
loop, and check_crd_for_stuck_resources) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/bundler/deployer/helm/helm_test.go`:
- Around line 2370-2371: Replace the background context used for subprocesses
with a cancellable timeout context: instead of ctx := context.Background()
create a ctx, cancel := context.WithTimeout(context.Background(),
10*time.Second) and defer cancel() so exec.CommandContext uses a timeout; update
the import list to include "time". Ensure references to ctx (used with
exec.CommandContext) remain unchanged and that cancel() is deferred in the same
scope where ctx is created.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 229-250: In check_release_for_stuck_crds change the helm manifest
fetch so a failure doesn't short-circuit the function: replace the current
manifest assignment that uses "helm get manifest ... 2>/dev/null || return 0"
with a form that swallows errors but continues (e.g., "manifest=$(helm get
manifest ... 2>/dev/null || true)"), so annotated_crds and the fallback
annotation-based logic still run; keep the rest of the function (manifest_crds,
annotated_crds, sort -u, while loop, and check_crd_for_stuck_resources)
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 221f0650-259b-49c6-a811-064333c9dd68

📥 Commits

Reviewing files that changed from the base of the PR and between b6bde3b and 527871e.

📒 Files selected for processing (2)
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from 7191b51 to a66afd3 Compare April 17, 2026 01:12
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 240-245: The pipeline that sets annotated_crds can fail and
currently yields an empty string which makes the subsequent fast-path treat
failure as "no CRDs"; modify the logic around annotated_crds to detect pipeline
failure (e.g., enable pipefail or inspect the pipeline exit status) and only
consider annotated_crds empty when the command succeeded and produced no
results—if the kubectl|jq pipeline fails, surface an error or set a non-empty
sentinel so the [[ -z "${manifest_crds}" && -z "${annotated_crds}" ]] fast-path
is not taken; update the block that defines annotated_crds and the following
conditional to bail or continue safely when the discovery command fails
(referencing annotated_crds, manifest_crds and the kubectl/jq pipeline).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 180f99cb-afd1-4bdf-b5fc-d158435a5404

📥 Commits

Reviewing files that changed from the base of the PR and between 527871e and 7191b51.

📒 Files selected for processing (2)
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Comment thread pkg/bundler/deployer/helm/helm_test.go
Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl Outdated
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch 2 times, most recently from cc2b6a3 to cabc255 Compare April 17, 2026 01:28
@github-actions github-actions Bot added size/XL and removed size/L labels Apr 17, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch 2 times, most recently from 7296352 to 9497ad0 Compare April 17, 2026 01:44
@github-actions
Copy link
Copy Markdown

@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from e0b7b97 to 1f18857 Compare April 17, 2026 02:47
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch 2 times, most recently from 919ffac to bb9462c Compare April 17, 2026 05:23
@yuanchen8911 yuanchen8911 marked this pull request as ready for review April 17, 2026 05:59
@yuanchen8911 yuanchen8911 changed the title fix(bundler): harden undeploy.sh and simplify preflight fix(bundler): fix undeploy.sh pre/post-flight checks Apr 17, 2026
@yuanchen8911 yuanchen8911 changed the title fix(bundler): fix undeploy.sh pre/post-flight checks fix(bundler): fix undeploy template pre/post-flight checks Apr 17, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from bb9462c to 73f09b9 Compare April 17, 2026 07:16
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid hardening of the undeploy template. The key design decisions are well-reasoned: fail-closed on API errors (with clear --skip-preflight escape hatch), explicit CRD override lists over dynamic group inference (shared-cluster safety), and best-effort cleanup pipelines that warn instead of aborting under set -euo pipefail. The removal of ORPHANED_CRD_GROUPS group-based deletion is the most important safety fix — it eliminates a real cross-tenant CRD deletion risk on shared clusters.

Test coverage is thorough: 11 new shell-behavior tests with kubectl/helm stubs that exercise transient failures, annotation discovery, explicit CRD discovery, skip lists, foreign-annotation safety, fail-closed behavior, stderr-preserving JSON parsing, post-flight warnings, kustomize-only bundles, and dynamo platform CRDs.

Two nits on maintainability (the explicit CRD/skip lists need coordinated updates when new operators are added), nothing blocking. CI is green. Note: PR is behind main (mergeable_state: behind) — will need a rebase before merge.

Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Comment thread pkg/bundler/deployer/helm/helm_test.go
Copy link
Copy Markdown
Contributor

@lockwobr lockwobr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the careful work here — the bug analysis and test coverage are solid, and I can see each change maps to a real failure mode.

My main hesitation is about the cumulative complexity of undeploy.sh.tmpl. This is roughly the 7th PR hardening this template (#253, #282, #364, #416, #477, #561, now this one), and each round understandably adds more branches, helpers, and component-specific knowledge. Net of this PR we're at ~700 lines of templated bash with two hardcoded component tables and a custom stderr-capture helper.

Being a deploy tool is a non-goal for this project, but each round of hardening here inches us closer to becoming one. At what point do we say this has crossed the line? Worth opening that conversation before we do round 8.

Left a few inline comments along those lines. Mostly suggestions, not blockers.

Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Comment thread pkg/bundler/deployer/helm/helm_test.go
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-hardening-combined branch from 98d2a3f to e1b2513 Compare April 17, 2026 18:38
@yuanchen8911 yuanchen8911 requested a review from lockwobr April 17, 2026 19:25
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) April 17, 2026 19:39
@yuanchen8911 yuanchen8911 merged commit c12c783 into NVIDIA:main Apr 17, 2026
33 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 23, 2026
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm
release annotation, so release-scoped CRDs installed outside Helm
discovery (chart crds/ subchart, runtime operator, out-of-band) are
only surfaced as post-flight warnings. That default remains the
correct shared-cluster safety choice.

This adds --delete-orphan-crds for single-tenant teardowns. The
rendered script iterates the existing extra_crds_for_release() list,
force-clears residual finalizers on any remaining CRs (so the
delete does not hang on resource-in-use markers from controllers
that are already gone), and calls kubectl delete crd. Default stays
off; caller must opt in.

The opt-in path reuses the same foreign-release guard that
pre-flight discovery uses (release-name AND release-namespace
annotation match), so --delete-orphan-crds cannot remove a CRD
that another Helm release on the cluster owns. Unannotated CRDs
(runtime-installed by an operator, chart crds/ without annotations)
are fair game — that is the entire point of the explicit list.

Also:
 - adds the missing nvidia-dra-driver-gpu entry to
   extra_crds_for_release() (computedomains, computedomaincliques)
 - guards the read < <(jsonpath) regression: jsonpath output had no
   trailing newline, so `read` returned 1 at EOF and `|| return 0`
   swallowed the delete. Fixed by switching to a single
   `kubectl get crd -o json` + jq parse, which also centralizes
   the foreign-release guard and the plural/group/scope lookup.
 - malformed metadata no longer aborts cleanup silently: the helper
   warns to stderr and attempts a direct kubectl delete crd instead
 - test is now table-driven and covers: cluster-scope with own
   release, namespaced-scope with own release (patch -n), foreign
   release skip, unannotated (fair game), and malformed metadata
   fallback
 - updates docs/user/cli-reference.md: both meta.helm.sh/release-name
   AND meta.helm.sh/release-namespace must match for default
   deletion; the foreign-release guard also applies under
   --delete-orphan-crds
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm
release annotation, so release-scoped CRDs installed outside Helm
discovery (chart crds/ subchart, runtime operator, out-of-band) are
only surfaced as post-flight warnings. That default remains the
correct shared-cluster safety choice.

This adds --delete-orphan-crds for single-tenant teardowns. The
rendered script iterates the existing extra_crds_for_release() list,
force-clears residual finalizers on any remaining CRs (so the
delete does not hang on resource-in-use markers from controllers
that are already gone), and calls kubectl delete crd. Default stays
off; caller must opt in.

The opt-in path reuses the same foreign-release guard that
pre-flight discovery uses (release-name AND release-namespace
annotation match), so --delete-orphan-crds cannot remove a CRD
that another Helm release on the cluster owns. Unannotated CRDs
(runtime-installed by an operator, chart crds/ without annotations)
are fair game — that is the entire point of the explicit list.

Also:
 - adds the missing nvidia-dra-driver-gpu entry to
   extra_crds_for_release() (computedomains, computedomaincliques)
 - guards the read < <(jsonpath) regression: jsonpath output had no
   trailing newline, so `read` returned 1 at EOF and `|| return 0`
   swallowed the delete. Fixed by switching to a single
   `kubectl get crd -o json` + jq parse, which also centralizes
   the foreign-release guard and the plural/group/scope lookup.
 - malformed metadata no longer aborts cleanup silently: the helper
   warns to stderr and attempts a direct kubectl delete crd instead
 - test is now table-driven and covers: cluster-scope with own
   release, namespaced-scope with own release (patch -n), foreign
   release skip, unannotated (fair game), and malformed metadata
   fallback
 - updates docs/user/cli-reference.md: both meta.helm.sh/release-name
   AND meta.helm.sh/release-namespace must match for default
   deletion; the foreign-release guard also applies under
   --delete-orphan-crds
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm
release annotation, so release-scoped CRDs installed outside Helm
discovery (chart crds/ subchart, runtime operator, out-of-band) are
only surfaced as post-flight warnings. That default remains the
correct shared-cluster safety choice.

This adds --delete-orphan-crds for single-tenant teardowns. The
rendered script iterates the existing extra_crds_for_release() list,
force-clears residual finalizers on any remaining CRs (so the
delete does not hang on resource-in-use markers from controllers
that are already gone), and calls kubectl delete crd. Default stays
off; caller must opt in.

The same foreign-release guard that pre-flight discovery uses
(release-name AND release-namespace annotation match) is applied in
all three places that can surface bundle-named CRDs: pre-flight
warnings (unchanged from PR NVIDIA#602), the new opt-in deletion path
(force_clear_and_delete_crd helper), and the post-flight exact-name
warning. A CRD annotated to a foreign Helm release is now skipped
at every stage, so --delete-orphan-crds cannot remove another
tenant's CRD and post-flight cannot mis-suggest one for manual
`kubectl delete crd` removal. Unannotated CRDs (runtime-installed
by an operator, chart crds/ without annotations) remain fair game —
that is the entire point of the explicit list.

Also:
 - adds the missing nvidia-dra-driver-gpu entry to
   extra_crds_for_release() (computedomains, computedomaincliques)
 - guards the read < <(jsonpath) regression: jsonpath output had no
   trailing newline, so `read` returned 1 at EOF and `|| return 0`
   swallowed the delete. Fixed by switching to a single
   `kubectl get crd -o json` + jq parse, which also centralizes
   the foreign-release guard and the plural/group/scope lookup.
 - malformed metadata no longer aborts cleanup silently: the helper
   warns to stderr and attempts a direct kubectl delete crd instead
 - test is now table-driven and covers: cluster-scope with own
   release, namespaced-scope with own release (patch -n), foreign
   release skip, unannotated (fair game), and malformed metadata
   fallback; plus marker assertions that the helper guard and the
   post-flight ownership filter are both wired
 - updates docs/user/cli-reference.md: both meta.helm.sh/release-name
   AND meta.helm.sh/release-namespace must match for default
   deletion; the foreign-release guard also applies under
   --delete-orphan-crds
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm
release annotation, so release-scoped CRDs installed outside Helm
discovery (chart crds/ subchart, runtime operator, out-of-band) are
only surfaced as post-flight warnings. That default remains the
correct shared-cluster safety choice.

This adds --delete-orphan-crds for single-tenant teardowns. The
rendered script iterates the existing extra_crds_for_release() list,
force-clears residual finalizers on any remaining CRs (so the
delete does not hang on resource-in-use markers from controllers
that are already gone), and calls kubectl delete crd. Default stays
off; caller must opt in.

The same foreign-release guard that pre-flight discovery uses
(release-name AND release-namespace annotation match) is applied in
all three places that can surface bundle-named CRDs: pre-flight
warnings (unchanged from PR NVIDIA#602), the new opt-in deletion path
(force_clear_and_delete_crd helper), and the post-flight exact-name
warning. A CRD annotated to a foreign Helm release is now skipped
at every stage, so --delete-orphan-crds cannot remove another
tenant's CRD and post-flight cannot mis-suggest one for manual
`kubectl delete crd` removal. Unannotated CRDs (runtime-installed
by an operator, chart crds/ without annotations) remain fair game —
that is the entire point of the explicit list.

Also:
 - adds the missing nvidia-dra-driver-gpu entry to
   extra_crds_for_release() (computedomains, computedomaincliques)
 - guards the read < <(jsonpath) regression: jsonpath output had no
   trailing newline, so `read` returned 1 at EOF and `|| return 0`
   swallowed the delete. Fixed by switching to a single
   `kubectl get crd -o json` + jq parse, which also centralizes
   the foreign-release guard and the plural/group/scope lookup.
 - malformed metadata no longer aborts cleanup silently: the helper
   warns to stderr and attempts a direct kubectl delete crd instead
 - test is now table-driven and covers: cluster-scope with own
   release, namespaced-scope with own release (patch -n), foreign
   release skip, unannotated (fair game), and malformed metadata
   fallback; plus marker assertions that the helper guard and the
   post-flight ownership filter are both wired
 - updates docs/user/cli-reference.md: both meta.helm.sh/release-name
   AND meta.helm.sh/release-namespace must match for default
   deletion; the foreign-release guard also applies under
   --delete-orphan-crds
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm
release annotation, so release-scoped CRDs installed outside Helm
discovery (chart crds/ subchart, runtime operator, out-of-band) are
only surfaced as post-flight warnings. That default remains the
correct shared-cluster safety choice.

This adds --delete-orphan-crds for single-tenant teardowns. The
rendered script iterates the existing extra_crds_for_release() list,
force-clears residual finalizers on any remaining CRs (so the
delete does not hang on resource-in-use markers from controllers
that are already gone), and calls kubectl delete crd. Default stays
off; caller must opt in.

The same foreign-release guard that pre-flight discovery uses
(release-name AND release-namespace annotation match) is applied in
all three places that can surface bundle-named CRDs: pre-flight
warnings (unchanged from PR NVIDIA#602), the new opt-in deletion path
(force_clear_and_delete_crd helper), and the post-flight exact-name
warning. A CRD annotated to a foreign Helm release is now skipped
at every stage, so --delete-orphan-crds cannot remove another
tenant's CRD and post-flight cannot mis-suggest one for manual
`kubectl delete crd` removal. Unannotated CRDs (runtime-installed
by an operator, chart crds/ without annotations) remain fair game —
that is the entire point of the explicit list.

Also:
 - adds the missing nvidia-dra-driver-gpu entry to
   extra_crds_for_release() (computedomains, computedomaincliques)
 - guards the read < <(jsonpath) regression: jsonpath output had no
   trailing newline, so `read` returned 1 at EOF and `|| return 0`
   swallowed the delete. Fixed by switching to a single
   `kubectl get crd -o json` + jq parse, which also centralizes
   the foreign-release guard and the plural/group/scope lookup.
 - malformed metadata no longer aborts cleanup silently: the helper
   warns to stderr and attempts a direct kubectl delete crd instead
 - test is now table-driven and covers: cluster-scope with own
   release, namespaced-scope with own release (patch -n), foreign
   release skip, unannotated (fair game), and malformed metadata
   fallback; plus marker assertions that the helper guard and the
   post-flight ownership filter are both wired
 - updates docs/user/cli-reference.md: both meta.helm.sh/release-name
   AND meta.helm.sh/release-namespace must match for default
   deletion; the foreign-release guard also applies under
   --delete-orphan-crds
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 24, 2026
PR NVIDIA#602 intentionally scoped auto-deletion to CRDs carrying a Helm
release annotation, so release-scoped CRDs installed outside Helm
discovery (chart crds/ subchart, runtime operator, out-of-band) are
only surfaced as post-flight warnings. That default remains the
correct shared-cluster safety choice.

This adds --delete-orphan-crds for single-tenant teardowns. The
rendered script iterates the existing extra_crds_for_release() list,
force-clears residual finalizers on any remaining CRs (so the
delete does not hang on resource-in-use markers from controllers
that are already gone), and calls kubectl delete crd. Default stays
off; caller must opt in.

The same foreign-release guard that pre-flight discovery uses
(release-name AND release-namespace annotation match) is applied in
all three places that can surface bundle-named CRDs: pre-flight
warnings (unchanged from PR NVIDIA#602), the new opt-in deletion path
(force_clear_and_delete_crd helper), and the post-flight exact-name
warning. A CRD annotated to a foreign Helm release is now skipped
at every stage, so --delete-orphan-crds cannot remove another
tenant's CRD and post-flight cannot mis-suggest one for manual
`kubectl delete crd` removal. Unannotated CRDs (runtime-installed
by an operator, chart crds/ without annotations) remain fair game —
that is the entire point of the explicit list.

Also:
 - adds the missing nvidia-dra-driver-gpu entry to
   extra_crds_for_release() (computedomains, computedomaincliques)
 - guards the read < <(jsonpath) regression: jsonpath output had no
   trailing newline, so `read` returned 1 at EOF and `|| return 0`
   swallowed the delete. Fixed by switching to a single
   `kubectl get crd -o json` + jq parse, which also centralizes
   the foreign-release guard and the plural/group/scope lookup.
 - malformed metadata no longer aborts cleanup silently: the helper
   warns to stderr and attempts a direct kubectl delete crd instead
 - test is now table-driven and covers: cluster-scope with own
   release, namespaced-scope with own release (patch -n), foreign
   release skip, unannotated (fair game), and malformed metadata
   fallback; plus marker assertions that the helper guard and the
   post-flight ownership filter are both wired
 - updates docs/user/cli-reference.md: both meta.helm.sh/release-name
   AND meta.helm.sh/release-namespace must match for default
   deletion; the foreign-release guard also applies under
   --delete-orphan-crds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants