feat: add explicit uninstall to support uninstalls that require the config information by ayuskauskas · Pull Request #200 · NVIDIA/nodewright

ayuskauskas · 2026-04-21T00:16:43Z

Description

Closes: https://github.com/NVIDIA/nodewright/issues/186Closes: #186

Checklist

I am familiar with the Contributing Guidelines.
My commits are signed off (git commit -s) per the DCO.
New or existing tests cover these changes.
The documentation is up to date with these changes.

coveralls · 2026-04-21T16:46:37Z

Coverage Report for CI Build 25223484198

Coverage increased (+0.4%) to 81.782%

Details

Coverage increased (+0.4%) from the base build.
Patch coverage: 112 uncovered changes across 5 files (619 of 731 lines covered, 84.68%).
12 coverage regressions across 3 files.

Uncovered Changes

File	Changed	Covered	%
operator/internal/controller/skyhook_controller.go	366	285	77.87%
operator/api/v1alpha1/skyhook_webhook.go	61	50	81.97%
operator/api/v1alpha1/zz_generated.deepcopy.go	15	5	33.33%
operator/internal/controller/pod_controller.go	30	23	76.67%
operator/internal/controller/cluster_state_v2.go	182	179	98.35%

Coverage Regressions

12 previously-covered lines in 3 files lost coverage.

File	Lines Losing Coverage	Coverage
operator/internal/controller/skyhook_controller.go	9	83.3%
operator/api/v1alpha1/skyhook_types.go	2	89.81%
operator/internal/controller/pod_controller.go	1	78.26%

Coverage Stats


Relevant Lines:	9463
Covered Lines:	7739
Line Coverage:	81.78%
Coverage Strength:	3.93 hits per line

💛 - Coveralls

lockwobr · 2026-04-22T18:40:14Z

@coderabbitai full review

coderabbitai · 2026-04-22T18:40:20Z

✅ Actions performed

Full review triggered.

coderabbitai · 2026-04-22T18:50:20Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Introduces an explicit per-package uninstall API (spec.packages[*].uninstall with enabled/apply) and implements the operator-side behavior: CRD/schema, Go API types, deepcopy, node-state helpers, admission webhook validation/warnings and downgrade gating, controller reconcilers for explicit uninstall requests/cancellation, StageUninstallInterrupt, finalizer-driven uninstall on CR deletion, node metadata cleanup, blocked/uninstall/malformed Skyhook conditions, interrupt/pod lifecycle updates, wrapper/node cleanup, extensive unit tests and many Chainsaw e2e tests/fixtures, plus documentation and minor metadata/.gitignore tweaks.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add explicit uninstall to support uninstalls that require the config information' clearly and specifically summarizes the main change: implementing explicit uninstall support for packages needing ConfigMap and environment variable information.
Description check	✅ Passed	The PR description references issue `#186`, provides a complete checklist affirming contributor guidelines compliance, DCO sign-off, test coverage, and documentation updates, demonstrating proper context for the uninstall enhancement.
Linked Issues check	✅ Passed	The PR successfully implements all core requirements from issue `#186`: explicit uninstall support with full ConfigMap/env variable access (new `uninstall` config fields, multiple test scenarios validating uninstall pod creation with proper env/config volumes), webhook enforcement (preventing apply without enabled), and comprehensive documentation plus tests covering uninstall lifecycle.
Out of Scope Changes check	✅ Passed	All changes directly support explicit uninstall functionality: API schema extensions (uninstall fields), controller logic for uninstall orchestration and state management, comprehensive test scenarios validating uninstall workflows, documentation, and helpers—with no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/uninstall-enhancement-collapsed

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

operator/internal/controller/skyhook_controller.go (1)

600-666: ⚠️ Potential issue | 🔴 Critical

Scope uninstall work to the current node.

toExplicitUninstall is collected once for the whole Skyhook and then prepended to every selected node's queue. On a node where that package is already absent, ApplyPackage() falls back to StageApply, so another node's uninstall request can re-install the package here.

Suggested fix

-		// prepend the uninstall packages so they are ran first
-		toRun = append(toUninstall, toRun...)
+		nodeUninstall := make([]*v1alpha1.Package, 0, len(toUninstall))
+		for _, pkg := range toUninstall {
+			if nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) {
+				nodeUninstall = append(nodeUninstall, pkg)
+			}
+		}
+
+		// prepend only the uninstall packages that are active on this node
+		toRun = append(nodeUninstall, toRun...)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/skyhook_controller.go` around lines 600 - 666,
The global toExplicitUninstall list is being prepended to every node's toRun,
causing uninstalls from other nodes to be applied here; instead compute a
node-scoped uninstall list per selected node by filtering toExplicitUninstall
against the current nodeState: for each pkg in toExplicitUninstall include it
only if the nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) is false
and not (pkg.IsUninstalling() && nodeState.IsUninstalled(pkg.GetUniqueName())),
then prepend that node-specific list (rather than the global
toExplicitUninstall) to toRun; update the code around node.State(),
toExplicitUninstall and the final toRun = append(...) to use this filtered list.

🧹 Nitpick comments (8)

k8s-tests/chainsaw/skyhook/uninstall-on-delete/skyhook.yaml (1)

34-36: Make the disabled uninstall intent explicit for disabled-pkg.

Relying on defaults here makes the test more fragile if CRD defaults change later.

Proposed fixture hardening

     disabled-pkg:
       version: "2.1.4"
       image: ghcr.io/nvidia/skyhook/agentless
+      uninstall:
+        enabled: false

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/uninstall-on-delete/skyhook.yaml` around lines 34
- 36, The test fixture relies on a default uninstall intent for the package
entry named "disabled-pkg"; make this explicit by adding an uninstall-intent
field set to "disabled" under the disabled-pkg mapping in skyhook.yaml (e.g.,
add uninstall-intent: "disabled" alongside version and image for the
disabled-pkg entry) so the uninstall behavior no longer depends on CRD defaults.

k8s-tests/chainsaw/skyhook/explicit-uninstall/update-remove-package.yaml (1)

27-27: Use an explicit empty object for packages removal intent.

Line 27 currently sets packages: with no value (null). Prefer packages: {} so “remove all packages” is unambiguous across validation and apply behavior.

Proposed change

-  packages:
+  packages: {}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/explicit-uninstall/update-remove-package.yaml` at
line 27, The top-level YAML key packages is currently set with no value (null);
change it to an explicit empty object by replacing "packages:" with "packages:
{}" so the intended "remove all packages" intent is unambiguous to validators
and apply logic; update the same key in update-remove-package.yaml (the packages
field) and ensure no trailing comments or whitespace cause it to be interpreted
as null.

k8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yaml (1)

68-71: Replace fixed sleep with condition-based wait.

Line 68’s fixed 15s delay is timing-sensitive and can make this scenario flaky under cluster load variance.

⏱️ Suggested stabilization

-    - sleep:
-        duration: 15s
+    - script:
+        ## wait until uninstall stage is observed, then proceed to cancel
+        content: |
+          for i in $(seq 1 30); do
+            if kubectl get node -l skyhook.nvidia.com/test-node=skyhooke2e -o jsonpath='{.items[0].metadata.annotations.skyhook\.nvidia\.com/nodeState_uninstall-cancel}' | grep -q '"stage":"uninstall"'; then
+              exit 0
+            fi
+            sleep 1
+          done
+          echo "ERROR: uninstall stage not observed before cancel"
+          exit 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yaml` around lines
68 - 71, The fixed 15s sleep step makes the uninstall-cancel test flaky; replace
the "- sleep: duration: 15s" step with a condition-based wait that polls the
cluster (e.g., wait for a Pod/Deployment/Job status or specific resource
annotation/label) before proceeding to the "- update: file:
update-cancel-uninstall.yaml" step; update the test to use a deterministic
readiness check (pod/deployment ready, job completed, or a custom kubectl/wait
condition) with a sensible timeout and error reporting so the update only runs
when the cluster is actually in the expected state.

k8s-tests/chainsaw/helm/helm-webhook-test/chainsaw-test.yaml (1)

216-219: Avoid masking cleanup failures in teardown.

Line 218 uses || true, which can hide stuck-finalizer regressions and leave residue for later suites.

🧹 Suggested cleanup tweak

     - script:
         content: |
-          kubectl delete skyhook uninstall-webhook-reject --timeout=120s || true
+          kubectl delete skyhook uninstall-webhook-reject --ignore-not-found=true --timeout=120s

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/helm/helm-webhook-test/chainsaw-test.yaml` around lines
216 - 219, The teardown currently masks failures by appending "|| true" to the
kubectl delete command in the script block that deletes the
"uninstall-webhook-reject" Skyhook resource; remove the "|| true" and instead
make the deletion explicit so failures surface (either let the step fail on
non-zero exit or implement a small retry loop that attempts "kubectl delete
skyhook uninstall-webhook-reject --timeout=120s" a few times with short sleeps
and then fails/report the final error), updating the script that contains the
kubectl delete invocation so stuck-finalizer or other delete errors are visible
to the suite logs.

k8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yaml (1)

102-110: Assert that UninstallFailed is cleared after recovery.

This step says the condition should be gone, but the assertion only checks status: complete. A stale skyhook.nvidia.com/UninstallFailed condition would still pass this test, so the recovery path is under-verified right now.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yaml` around
lines 102 - 110, The test currently only asserts status: complete for the
Skyhook resource named uninstall-fix-config; add an explicit assertion that the
skyhook.nvidia.com/UninstallFailed condition is cleared by verifying that the
resource's status.conditions does not contain a condition with type:
UninstallFailed and status: "True" (or that a condition with type
UninstallFailed is absent or has status: "False"); update the assert block for
the Skyhook (apiVersion: skyhook.nvidia.com/v1alpha1, kind: Skyhook,
metadata.name: uninstall-fix-config) to include this additional conditions check
so the recovery path is fully verified.

k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yaml (1)

82-91: Prefer stable-field checks over full JSON equality.

This assertion hard-codes the entire node-state payload shape. If PackageStatus includes optional fields like containerSHA, or gains new ones later, the downgrade flow will start failing for the wrong reason. Matching just name/version/image/stage/state is safer here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yaml`
around lines 82 - 91, The assertion currently compares the entire JSON payload
under the annotation "skyhook.nvidia.com/nodeState_downgrade-after-uninstall",
which will break if PackageStatus adds optional fields; instead modify the test
to assert only the important fields (name, version, image, stage, state) from
parse_json("skyhook.nvidia.com/nodeState_downgrade-after-uninstall") for key
"mypkg|1.2.3"—e.g., access that object's properties and compare name == "mypkg",
version == "1.2.3", image == "ghcr.io/nvidia/skyhook/agentless", stage ==
"config", and state == "complete" rather than requiring full JSON equality.

k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yaml (1)

87-96: Assert only the stable node-state fields here.

This exact object match will get brittle as soon as the annotation includes optional PackageStatus fields like containerSHA, or if the status schema grows. Using the same field-by-field predicates you use in the last step would make this test much less flaky while still covering the intended behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yaml`
around lines 87 - 96, The test currently does an exact object match against the
annotation "skyhook.nvidia.com/nodeState_uninstall-mixed-packages" (the
"pkg-keep|2.1.4" entry), which is brittle; change the assertion to only validate
stable fields (e.g., name, version, stage, state) using field-by-field
predicates rather than matching the whole PackageStatus object so optional
fields like containerSHA or future additions don't break the test—locate the
assertion that references
"skyhook.nvidia.com/nodeState_uninstall-mixed-packages" and replace the strict
equality check with individual predicates/assertions for .name == "pkg-keep",
.version == "2.1.4", .stage == "config", and .state == "complete".

operator/internal/controller/skyhook_controller_test.go (1)

1856-2938: Keep the new controller tests in Ginkgo specs.

These additions introduce a large new block of testing.T suites under operator/**/*_test.go, which keeps pushing this file farther from the repo’s test convention and makes shared setup/filtering less consistent. Please fold the new cases into Describe/Context/It blocks instead of adding more standalone Test... functions. As per coding guidelines, operator/**/*_test.go: Use Ginkgo/Gomega framework for Go unit tests.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/skyhook_controller_test.go` around lines 1856 -
2938, The new tests were added as standard testing.T functions (e.g.
TestHandleUninstallRequests, TestHandleVersionChange_WI3,
TestHandleCompletePod_WI4, TestHandleCancelledUninstalls,
TestHandleUninstallRequests_FinalizerPath,
TestHandleUninstallRequests_InstallCompleteGuard, TestUpdateBlockedCondition,
TestHasUninstallWork, TestHandleCompletePod_VersionComparison,
TestHandleVersionChange_DowngradeIsNoOp) but the project uses Ginkgo/Gomega;
convert each Test* function into Ginkgo specs by folding their cases into
Describe/Context/It blocks (preserve the individual t.Run scenarios as It or
Context), replace testing.T-based assertions with Gomega (NewWithT -> Gomega's
Expect within It), and update imports accordingly; ensure helpers that reference
symbols like HandleUninstallRequests, HandleVersionChange, HandleCompletePod,
HandleCancelledUninstalls, updateBlockedCondition, hasUninstallWork, and
SkyhookReconciler tests are moved into the existing Ginkgo suite file so shared
setup/filters are reused and remove the standalone Test* functions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/uninstall.md`:
- Around line 125-128: The migration note is backwards: update the paragraph
describing removal behavior for spec.packages so it matches the PR behavior —
when uninstall is disabled (enabled: false) removing a package from spec should
preserve the node annotation/old host state (do not call RemoveState or clear
node annotations), and when uninstall is enabled (enabled: true) the webhook
should block removal until the uninstall completes; revise the text to reflect
that enabled: false preserves node-state entries and enabled: true blocks
removal via the webhook until uninstall finishes.
- Around line 103-106: Update the rejection text for the table row that
currently reads "`apply: true` with `enabled: false` — **Rejected** — apply
requires enabled`" to a clearer phrase that spells out the rule, e.g. "`Rejected
— apply requires uninstall.enabled: true`" or "`Rejected — apply requires
enabled to be true`", replacing the ambiguous "apply requires enabled" wording;
ensure the strings `apply: true`, `enabled: false`, and the new clearer
rejection message are used in the table entry so readers see the exact
requirement.

In `@k8s-tests/chainsaw/skyhook/explicit-uninstall/chainsaw-test.yaml`:
- Around line 81-108: The test currently only asserts the Pod and initContainer
names/args (mypkg-init, mypkg-uninstall, mypkg-uninstallcheck) but does not
verify that the ConfigMap and environment variables are present; update the
assertion block to explicitly check the Pod spec contains the expected ConfigMap
volume (volumes[] referencing the ConfigMap name used in annotations) and that
each relevant initContainer (mypkg-uninstall, mypkg-uninstallcheck) includes the
expected env entries (e.g., SKYHOOK_PACKAGE, SKYHOOK_STAGE or other keys set
from the ConfigMap) or volumeMounts that map that ConfigMap into the container
filesystem; ensure you reference the Pod resource metadata annotations (the
parsed "skyhook.nvidia.com/package") when deriving expected config names so the
assertions fail if config/env are missing.

In `@k8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yaml`:
- Around line 64-83: The test currently only asserts that a Pod for the enabled
package exists (resource: kind: Pod with label skyhook.nvidia.com/package:
enabled-pkg-2.1.4 and initContainer names like enabled-pkg-uninstall); add a
complementary negative assertion that no Pod is created for the disabled package
by adding an assert that a Pod with label key skyhook.nvidia.com/package set to
the disabled package value (e.g., disabled-pkg-2.1.4 or the repo's disabled
package name) does not exist and/or that it does not contain an initContainer
named uninstall/uninstall-check; implement this as an additional assert block
mirroring the existing Pod resource check but expecting absence (no resource) so
the test fails if an uninstall pod is created for disabled packages.

In `@operator/api/v1alpha1/skyhook_webhook.go`:
- Around line 146-173: The downgrade gating block is rejecting downgrades even
when uninstall is disabled; update the logic in the downgrade handling (around
oldPkg.IsUninstalling() and isPackageFullyUninstalled()) to first detect whether
uninstall is enabled for the package (e.g. check the package's
Uninstall.Enabled/uninstall.enabled flag or equivalent on oldPkg) and if
uninstall is not enabled, skip the "must set uninstall.apply and wait" checks
and allow the downgrade path to proceed; only perform the IsUninstalling() and
isPackageFullyUninstalled() checks when uninstall is enabled.

In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1471-1486: The finalizer loop in skyhook_controller.go currently
swallows errors from node.State(), which can let the CR be finalized even when
uninstall state couldn't be verified; change the handling in the loop that
iterates skyhook.GetSkyhook().Spec.Packages and nodes so that if node.State()
returns an error you return (false, err) (or wrap and return the error) instead
of continuing, ensuring the finalizer fails-closed and prevents removal until
node state can be verified.
- Around line 385-389: The loop in RunSkyhookPackages() incorrectly skips
paused/disabled Skyhooks even when they are being deleted, preventing
HandleFinalizer() from triggering uninstall; update the early continue checks so
they only skip processing for paused/disabled Skyhooks when the object is not
being deleted (i.e., only continue if skyhook.IsDisabled() || skyhook.IsPaused()
AND skyhook.DeletionTimestamp == nil), and likewise ensure the IsComplete()
check also considers deletion (only continue when IsComplete() &&
!hasUninstallWork(skyhook) && skyhook.DeletionTimestamp == nil); this preserves
normal skipping behavior but allows finalizer-driven uninstall to run for
deletions.

In `@README.md`:
- Line 54: The README still says "uninstall runs on downgrade/removal" in two
places which conflicts with the new explicit uninstall model; update those
occurrences to reflect that uninstall is explicit and controlled via the
uninstall.enabled and uninstall.apply fields (and webhook/finalizer behavior)
and remove any wording that implies automatic uninstall on downgrade/removal so
the README aligns with the new "Explicit Uninstall" bullet.

---

Outside diff comments:
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 600-666: The global toExplicitUninstall list is being prepended to
every node's toRun, causing uninstalls from other nodes to be applied here;
instead compute a node-scoped uninstall list per selected node by filtering
toExplicitUninstall against the current nodeState: for each pkg in
toExplicitUninstall include it only if the
nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) is false and not
(pkg.IsUninstalling() && nodeState.IsUninstalled(pkg.GetUniqueName())), then
prepend that node-specific list (rather than the global toExplicitUninstall) to
toRun; update the code around node.State(), toExplicitUninstall and the final
toRun = append(...) to use this filtered list.

---

Nitpick comments:
In `@k8s-tests/chainsaw/helm/helm-webhook-test/chainsaw-test.yaml`:
- Around line 216-219: The teardown currently masks failures by appending "||
true" to the kubectl delete command in the script block that deletes the
"uninstall-webhook-reject" Skyhook resource; remove the "|| true" and instead
make the deletion explicit so failures surface (either let the step fail on
non-zero exit or implement a small retry loop that attempts "kubectl delete
skyhook uninstall-webhook-reject --timeout=120s" a few times with short sleeps
and then fails/report the final error), updating the script that contains the
kubectl delete invocation so stuck-finalizer or other delete errors are visible
to the suite logs.

In `@k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yaml`:
- Around line 82-91: The assertion currently compares the entire JSON payload
under the annotation "skyhook.nvidia.com/nodeState_downgrade-after-uninstall",
which will break if PackageStatus adds optional fields; instead modify the test
to assert only the important fields (name, version, image, stage, state) from
parse_json("skyhook.nvidia.com/nodeState_downgrade-after-uninstall") for key
"mypkg|1.2.3"—e.g., access that object's properties and compare name == "mypkg",
version == "1.2.3", image == "ghcr.io/nvidia/skyhook/agentless", stage ==
"config", and state == "complete" rather than requiring full JSON equality.

In `@k8s-tests/chainsaw/skyhook/explicit-uninstall/update-remove-package.yaml`:
- Line 27: The top-level YAML key packages is currently set with no value
(null); change it to an explicit empty object by replacing "packages:" with
"packages: {}" so the intended "remove all packages" intent is unambiguous to
validators and apply logic; update the same key in update-remove-package.yaml
(the packages field) and ensure no trailing comments or whitespace cause it to
be interpreted as null.

In `@k8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yaml`:
- Around line 68-71: The fixed 15s sleep step makes the uninstall-cancel test
flaky; replace the "- sleep: duration: 15s" step with a condition-based wait
that polls the cluster (e.g., wait for a Pod/Deployment/Job status or specific
resource annotation/label) before proceeding to the "- update: file:
update-cancel-uninstall.yaml" step; update the test to use a deterministic
readiness check (pod/deployment ready, job completed, or a custom kubectl/wait
condition) with a sensible timeout and error reporting so the update only runs
when the cluster is actually in the expected state.

In `@k8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yaml`:
- Around line 102-110: The test currently only asserts status: complete for the
Skyhook resource named uninstall-fix-config; add an explicit assertion that the
skyhook.nvidia.com/UninstallFailed condition is cleared by verifying that the
resource's status.conditions does not contain a condition with type:
UninstallFailed and status: "True" (or that a condition with type
UninstallFailed is absent or has status: "False"); update the assert block for
the Skyhook (apiVersion: skyhook.nvidia.com/v1alpha1, kind: Skyhook,
metadata.name: uninstall-fix-config) to include this additional conditions check
so the recovery path is fully verified.

In `@k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yaml`:
- Around line 87-96: The test currently does an exact object match against the
annotation "skyhook.nvidia.com/nodeState_uninstall-mixed-packages" (the
"pkg-keep|2.1.4" entry), which is brittle; change the assertion to only validate
stable fields (e.g., name, version, stage, state) using field-by-field
predicates rather than matching the whole PackageStatus object so optional
fields like containerSHA or future additions don't break the test—locate the
assertion that references
"skyhook.nvidia.com/nodeState_uninstall-mixed-packages" and replace the strict
equality check with individual predicates/assertions for .name == "pkg-keep",
.version == "2.1.4", .stage == "config", and .state == "complete".

In `@k8s-tests/chainsaw/skyhook/uninstall-on-delete/skyhook.yaml`:
- Around line 34-36: The test fixture relies on a default uninstall intent for
the package entry named "disabled-pkg"; make this explicit by adding an
uninstall-intent field set to "disabled" under the disabled-pkg mapping in
skyhook.yaml (e.g., add uninstall-intent: "disabled" alongside version and image
for the disabled-pkg entry) so the uninstall behavior no longer depends on CRD
defaults.

In `@operator/internal/controller/skyhook_controller_test.go`:
- Around line 1856-2938: The new tests were added as standard testing.T
functions (e.g. TestHandleUninstallRequests, TestHandleVersionChange_WI3,
TestHandleCompletePod_WI4, TestHandleCancelledUninstalls,
TestHandleUninstallRequests_FinalizerPath,
TestHandleUninstallRequests_InstallCompleteGuard, TestUpdateBlockedCondition,
TestHasUninstallWork, TestHandleCompletePod_VersionComparison,
TestHandleVersionChange_DowngradeIsNoOp) but the project uses Ginkgo/Gomega;
convert each Test* function into Ginkgo specs by folding their cases into
Describe/Context/It blocks (preserve the individual t.Run scenarios as It or
Context), replace testing.T-based assertions with Gomega (NewWithT -> Gomega's
Expect within It), and update imports accordingly; ensure helpers that reference
symbols like HandleUninstallRequests, HandleVersionChange, HandleCompletePod,
HandleCancelledUninstalls, updateBlockedCondition, hasUninstallWork, and
SkyhookReconciler tests are moved into the existing Ginkgo suite file so shared
setup/filters are reused and remove the standalone Test* functions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2c48bd9c-1933-43b8-a725-463133f9b7ff

📥 Commits

Reviewing files that changed from the base of the PR and between 8c5214b and d7241e3.

📒 Files selected for processing (55)

.gitignore
CLAUDE.md
README.md
docs/uninstall.md
k8s-tests/chainsaw/helm/helm-webhook-test/chainsaw-test.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/skyhook-apply-without-enabled.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/skyhook-uninstall-valid.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/update-downgrade-enabled-pkg.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/update-remove-enabled-pkg.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/skyhook.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/update-downgrade.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/skyhook.yaml
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/update-downgrade.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/skyhook.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/update-remove-package.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/update-cancel-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/update-fix-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/update-remove-keep.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/update-uninstall-one.yaml
k8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-on-delete/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/README.md
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/assert-cm-update.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/assert-install.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/skyhook.yaml
operator/CHANGELOG.md
operator/api/v1alpha1/skyhook_types.go
operator/api/v1alpha1/skyhook_types_test.go
operator/api/v1alpha1/skyhook_webhook.go
operator/api/v1alpha1/zz_generated.deepcopy.go
operator/config/crd/bases/skyhook.nvidia.com_skyhooks.yaml
operator/internal/controller/metrics.go
operator/internal/controller/pod_controller.go
operator/internal/controller/skyhook_controller.go
operator/internal/controller/skyhook_controller_test.go
operator/internal/controller/suite_test.go
operator/internal/wrapper/mock/SkyhookNode.go
operator/internal/wrapper/node.go
operator/internal/wrapper/node_test.go
operator/internal/wrapper/skyhook.go
operator/internal/wrapper/skyhook_test.go

💤 Files with no reviewable changes (5)

k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/README.md
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/assert-cm-update.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/assert-install.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/skyhook.yaml

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

operator/api/v1alpha1/skyhook_types.go (1)
563-566: ⚠️ Potential issue | 🟠 Major

Don't auto-advance a completed uninstall into StageApply.

NextStage still maps StageUninstall to StageApply for the no-interrupt path. That makes a successful explicit uninstall look like the start of a reinstall anywhere this helper is used, but explicit uninstall should terminate in RemoveState unless some separate upgrade/apply flow has been selected.
🔧 Suggested direction
  nextStage := map[Stage]Stage{
-		StageUninstall: StageApply,
 		StageApply:     StageConfig,
 		StageUpgrade:   StageConfig,
 	}
If downgrade/replace flows still need uninstall -> apply, that decision should be made by the caller that knows there is a replacement package, not unconditionally in NodeState.NextStage.
Based on learnings: For packages WITHOUT interrupts, follow the sequence: Uninstall (if downgrading) → Apply/Upgrade (if upgrading) → Config.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@operator/api/v1alpha1/skyhook_types.go` around lines 563 - 566, NextStage
currently maps StageUninstall -> StageApply causing a completed uninstall to be
treated as a reinstall; update NodeState.NextStage so StageUninstall does not
auto-advance to StageApply (either remove that mapping or map StageUninstall to
RemoveState) and ensure any downgrade/replace flows that require
uninstall->apply are handled by the caller that has context about replacement
packages rather than unconditionally in NextStage.
operator/internal/controller/skyhook_controller.go (2)
1568-1583: ⚠️ Potential issue | 🟠 Major

Propagate interrupt state persistence failures.

Line 1582 drops the skyhookNode.Upsert(...) error after the pod is created. If that write fails, the controller has a live interrupt pod with no matching node state, and later reconciles can invalidate or duplicate the work.
Suggested fix
-	_ = skyhookNode.Upsert(_package.PackageRef, _package.Image, v1alpha1.StateInProgress, stage, 0, _package.ContainerSHA)
+	if err := skyhookNode.Upsert(_package.PackageRef, _package.Image, v1alpha1.StateInProgress, stage, 0, _package.ContainerSHA); err != nil {
+		return fmt.Errorf(
+			"error updating interrupt state for package %s on node %s: %w",
+			_package.GetUniqueName(),
+			skyhookNode.GetNode().Name,
+			err,
+		)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/skyhook_controller.go` around lines 1568 - 1583,
The call to skyhookNode.Upsert(...) currently ignores errors, risking a created
interrupt pod without persisted node state; modify the logic after r.Create(ctx,
pod) so you capture and check the error returned by
skyhookNode.Upsert(packageRef, image, StateInProgress, stage, 0, containerSHA)
and return a wrapped error (e.g. fmt.Errorf("error persisting interrupt state:
%w", err)) if it fails; additionally consider cleaning up the created pod by
calling r.Delete(ctx, pod) on Upsert failure to avoid orphaned pods (refer to
createInterruptPodForPackage, r.Create, skyhookNode.Upsert, and r.Delete to
locate the relevant code).
600-666: ⚠️ Potential issue | 🔴 Critical

Scope uninstall work to the current node and fail closed on unreadable node state.

toUninstall is built once per skyhook, but Line 666 prepends it to every selected node’s toRun. If node A is uninstalling pkg-x and node B is only progressing some other package, node B can still receive pkg-x here; because Line 652 ignores node.State() errors, ApplyPackage() can fall back to StageApply and recreate the package instead of skipping it. Read node.State() once per node, return the error, and only prepend uninstall packages that are actually in uninstall cycle on that node.
Suggested fix
-		nodeState, _ := node.State()
+		nodeState, err := node.State()
+		if err != nil {
+			return nil, fmt.Errorf("error reading node state for node %s: %w", node.GetNode().Name, err)
+		}
+
+		nodeUninstall := make([]*v1alpha1.Package, 0, len(toUninstall))
+		for _, pkg := range toUninstall {
+			if nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) {
+				nodeUninstall = append(nodeUninstall, pkg)
+			}
+		}
+
 		filtered := make([]*v1alpha1.Package, 0, len(toRun))
 		for _, pkg := range toRun {
 			if nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) {
 				continue // uninstall running on this node — skip apply
 			}
@@
-		toRun = append(toUninstall, toRun...)
+		toRun = append(nodeUninstall, toRun...)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/skyhook_controller.go` around lines 600 - 666,
toUninstall is being prepended to every node's toRun and node.State() errors are
ignored; change the loop over selectedNode so you call node.State() once per
node, return the error if node.State() fails (fail closed), and only prepend
those entries from toUninstall that actually have
nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) true for that node
(instead of appending the whole toUninstall slice to every node's toRun); this
ensures IsUninstallCycleInProgress, IsUninstalled and ApplyPackage/StageApply
behavior are correctly scoped per node.

♻️ Duplicate comments (2)

k8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yaml (1)

64-83: ⚠️ Potential issue | 🟠 Major

Add a negative assertion for uninstall-disabled package pods.

Line 64 currently validates only the enabled path. Add an explicit check that disabled packages do not get uninstall init containers, otherwise this scenario can miss regressions.

Suggested test hardening

     - assert:
         ## uninstall pod for enabled package
         resource:
           kind: Pod
           apiVersion: v1
           metadata:
             namespace: skyhook
             labels:
               skyhook.nvidia.com/name: uninstall-on-delete
               skyhook.nvidia.com/package: enabled-pkg-2.1.4
           spec:
             initContainers:
               - name: enabled-pkg-init
               - name: enabled-pkg-uninstall
                 args:
                   ([0]): uninstall
               - name: enabled-pkg-uninstallcheck
                 args:
                   ([0]): uninstall-check
+    - script:
+        ## ensure uninstall-disabled package did not get uninstall init container
+        content: |
+          if kubectl get pods -n skyhook -l skyhook.nvidia.com/name=uninstall-on-delete -o jsonpath='{range .items[*]}{range .spec.initContainers[*]}{.name}{"\n"}{end}{end}' | grep -q 'disabled-pkg-uninstall'; then
+            echo "ERROR: uninstall-disabled package got uninstall init container"
+            kubectl get pods -n skyhook -l skyhook.nvidia.com/name=uninstall-on-delete -o yaml
+            exit 1
+          fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yaml` around
lines 64 - 83, Add a negative assertion that pods for the disabled package do
not include uninstall-related initContainers: locate the Pod resource validation
block (kind: Pod, apiVersion: v1) that uses labels skyhook.nvidia.com/name:
uninstall-on-delete and skyhook.nvidia.com/package: enabled-pkg-2.1.4 and add a
new assert for the disabled package label (e.g., skyhook.nvidia.com/package:
disabled-pkg-2.1.4 or the repo’s disabled package label) which checks
spec.initContainers does not contain entries with names enabled-pkg-uninstall or
enabled-pkg-uninstallcheck (or the uninstall container names used elsewhere);
ensure the new assert mirrors the enabled check structure but asserts absence of
those initContainers before continuing to the existing script step.

docs/uninstall.md (1)

88-90: ⚠️ Potential issue | 🟡 Minor

This enabled: false deletion bullet contradicts itself.

It says state is cleaned up immediately, then says the package state remains on nodes. Those are different behaviors, and the rest of the doc describes preserving the node-state entry when uninstall is disabled.

📝 Suggested wording

-- **`enabled: false` packages (or nil)**: No uninstall pods — state is cleaned up immediately. The package state remains on nodes so administrators can see what was previously applied.
+- **`enabled: false` packages (or nil)**: No uninstall pods run. Skyhook-owned cleanup still happens, but the old package entry is preserved in node state so administrators can see what was previously applied.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/uninstall.md` around lines 88 - 90, The deletion bullet for "enabled:
false packages (or nil)" is contradictory: it first says "No uninstall pods —
state is cleaned up immediately" then says "The package state remains on nodes";
update this sentence to consistently state that when enabled is false or nil no
uninstall pods are created and node state entries are preserved (i.e., package
object is removed but per-node state remains) — locate and edit the "enabled:
false packages (or nil)" bullet text to reflect that behavior unambiguously.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Line 12: Fix the malformed .gitignore entry `.claude.jsonworktrees/` by
replacing it with the intended `.claude/worktrees/` pattern so the `.claude`
worktrees directory is correctly ignored; update the line that currently reads
`.claude.jsonworktrees/` to `.claude/worktrees/` (note nearby pattern
`.claude/settings.local.json` as confirmation of the `.claude/` directory
structure).

In `@k8s-tests/chainsaw/skyhook/explicit-uninstall/update-remove-package.yaml`:
- Line 27: The YAML currently sets the key "packages" with no value which
becomes null and violates the CRD expecting an object; update the "packages"
field in update-remove-package.yaml to be an empty map by assigning an explicit
empty object (packages: {}) so spec.packages is an object rather than null,
ensuring Kubernetes validation accepts the patch.

In `@k8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yaml`:
- Around line 102-110: The current assert for the Skyhook resource (kind:
Skyhook, metadata.name: uninstall-fix-config) only checks status: complete but
does not verify the skyhook.nvidia.com/UninstallFailed condition was removed;
update the assert block that verifies status: complete to also include a
negative assertion for the UninstallFailed condition (either assert the
condition is absent or that its status is False/not present) so the test proves
the UninstallFailed condition is cleared.

In `@k8s-tests/chainsaw/skyhook/uninstall-fix-config/skyhook.yaml`:
- Around line 27-38: The fixture currently only varies env under packages ->
mypkg -> uninstall, so add a small ConfigMap-backed payload and make the
uninstall step consume it to exercise the missing ConfigMap path: create a
ConfigMap with a script (e.g., uninstall.sh) and reference it from
packages.mypkg.uninstall (or the uninstall scripts section) so the uninstall pod
mounts or reads that ConfigMap (and the script is executed by uninstall.sh);
ensure the ConfigMap name/key matches the pod mount or script path used by
uninstall.sh so tests will fail if ConfigMap-backed content is dropped.

In `@k8s-tests/chainsaw/skyhook/uninstall-on-delete/skyhook.yaml`:
- Around line 34-36: The fixture for package "disabled-pkg" relies on schema
defaults instead of being explicit about uninstall intent; update the
"disabled-pkg" entry to include an explicit uninstall section by adding
uninstall.enabled: false (or true as intended) and uninstall.apply: false (or
true) to clearly assert disable/uninstall behavior so the test validates
explicit-disable behavior directly and conforms to the YAML guideline of using
uninstall.enabled and uninstall.apply for package uninstall control.

In `@operator/internal/controller/cluster_state_v2.go`:
- Around line 493-503: The code silently ignores errors from node.State(), which
can let HasUninstallWork() incorrectly report no pending uninstall; change the
error handling in the loop that calls node.State() so that a failed state read
is treated as blocking completion instead of continuing: replace the current "if
err != nil { continue }" with logic that returns true (or otherwise signals
pending work) when node.State() fails, referencing the HasUninstallWork check
that calls node.State() and the subsequent
nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) /
nodeState.IsUninstalled(pkg.GetUniqueName()) checks; alternatively, if the
function already supports error propagation, propagate the node.State() error
out instead of skipping the node.
- Around line 571-582: The loop currently skips packages when
pkg.IsUninstalling() / beingDeleted is false, which hides nodes that are still
in uninstall/error states; change the loop to consult nodeState first and only
skip a package if there is no uninstall cycle and the CR isn’t requesting
uninstall (i.e. if !nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) &&
!pkg.IsUninstalling() && !(beingDeleted && pkg.UninstallEnabled()) then
continue). When nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) is
true, always mark inProgress and inspect nodeState[pkg.GetUniqueName()].State
(compare to v1alpha1.StateErroring) to set hasErrors, so
UninstallInProgress/UninstallFailed remain driven by nodeState until the cycle
actually exits.

In `@operator/internal/controller/pod_controller.go`:
- Around line 240-257: The early-return branches that call
skyhookNode.RemoveState(packagePtr.PackageRef) but then return without clearing
metrics leave per-package metrics stale; before returning from both the "package
removed from spec" and the "version mismatch" branches (the blocks that call
skyhookNode.RemoveState and set updated = true), invoke
zeroOutSkyhookPackageMetrics with the same package reference
(packagePtr.PackageRef) and any required args, handle its error (or ignore per
project convention) and then proceed to return updated, nil; ensure the call is
placed immediately after successful RemoveState and before setting/returning
updated so metrics are always cleared on cleanup.

In `@operator/internal/controller/skyhook_controller_test.go`:
- Around line 1856-2159: The tests were added as a standard library
TestHandleUninstallRequests with t.Run subtests which mixes testing frameworks;
convert them to Ginkgo style to match the rest of the suite. Replace func
TestHandleUninstallRequests(...) { ... t.Run("...", func(t *testing.T) { ... })
... } with a Ginkgo Describe/Context/It (or a DescribeTable with Entry) block
named "HandleUninstallRequests" and convert each t.Run case into an It (or
Entry) case, using Gomega's Expect directly (or NewWithT inside each It if you
prefer) and keeping the same setup/assertions and references to
HandleUninstallRequests, skyhookNodes, wrapperMock.NewMockSkyhookNode,
node.EXPECT(), and wrapper.NewSkyhookWrapper so the logic and mocks remain
identical. Ensure you remove imports/usages of testing.T and t.Run and update
the test file imports to use Ginkgo/Gomega instead.

In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1203-1215: The loop building uninstallingPkgs currently swallows
errors from node.State() by continuing, which can make the set incomplete;
modify the code in the block iterating skyhook.GetNodes() so that when
node.State() returns an error you return that error (include node context such
as node.Name or node.ID) from Reconcile instead of continue; keep using
ns.IsUninstallCycleInProgress(_package.GetUniqueName()) and mark
uninstallingPkgs[_package.Name] = true when appropriate, but do not ignore
node.State() failures—return a wrapped error referencing the node and the
failure so controller-runtime will requeue the reconcile.

In `@operator/internal/wrapper/node.go`:
- Around line 571-603: CleanupSCRMetadata currently calls State() and treats any
decode error as empty (preserveNodeState := len(state) > 0) and never clears the
in-memory cache, leaving node.nodeState stale; change the logic to call state,
err := node.State() and set preserveNodeState only when err == nil && len(state)
> 0, and when you remove the annotations nodeState_<skyhook>/version_<skyhook>
also clear the cached value (node.nodeState) so subsequent reconciles don't read
stale data; reference the CleanupSCRMetadata function, the State() call/site,
node.nodeState field, and the nodeStateKey/versionKey variables when making this
change.

---

Outside diff comments:
In `@operator/api/v1alpha1/skyhook_types.go`:
- Around line 563-566: NextStage currently maps StageUninstall -> StageApply
causing a completed uninstall to be treated as a reinstall; update
NodeState.NextStage so StageUninstall does not auto-advance to StageApply
(either remove that mapping or map StageUninstall to RemoveState) and ensure any
downgrade/replace flows that require uninstall->apply are handled by the caller
that has context about replacement packages rather than unconditionally in
NextStage.

In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1568-1583: The call to skyhookNode.Upsert(...) currently ignores
errors, risking a created interrupt pod without persisted node state; modify the
logic after r.Create(ctx, pod) so you capture and check the error returned by
skyhookNode.Upsert(packageRef, image, StateInProgress, stage, 0, containerSHA)
and return a wrapped error (e.g. fmt.Errorf("error persisting interrupt state:
%w", err)) if it fails; additionally consider cleaning up the created pod by
calling r.Delete(ctx, pod) on Upsert failure to avoid orphaned pods (refer to
createInterruptPodForPackage, r.Create, skyhookNode.Upsert, and r.Delete to
locate the relevant code).
- Around line 600-666: toUninstall is being prepended to every node's toRun and
node.State() errors are ignored; change the loop over selectedNode so you call
node.State() once per node, return the error if node.State() fails (fail
closed), and only prepend those entries from toUninstall that actually have
nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) true for that node
(instead of appending the whole toUninstall slice to every node's toRun); this
ensures IsUninstallCycleInProgress, IsUninstalled and ApplyPackage/StageApply
behavior are correctly scoped per node.

---

Duplicate comments:
In `@docs/uninstall.md`:
- Around line 88-90: The deletion bullet for "enabled: false packages (or nil)"
is contradictory: it first says "No uninstall pods — state is cleaned up
immediately" then says "The package state remains on nodes"; update this
sentence to consistently state that when enabled is false or nil no uninstall
pods are created and node state entries are preserved (i.e., package object is
removed but per-node state remains) — locate and edit the "enabled: false
packages (or nil)" bullet text to reflect that behavior unambiguously.

In `@k8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yaml`:
- Around line 64-83: Add a negative assertion that pods for the disabled package
do not include uninstall-related initContainers: locate the Pod resource
validation block (kind: Pod, apiVersion: v1) that uses labels
skyhook.nvidia.com/name: uninstall-on-delete and skyhook.nvidia.com/package:
enabled-pkg-2.1.4 and add a new assert for the disabled package label (e.g.,
skyhook.nvidia.com/package: disabled-pkg-2.1.4 or the repo’s disabled package
label) which checks spec.initContainers does not contain entries with names
enabled-pkg-uninstall or enabled-pkg-uninstallcheck (or the uninstall container
names used elsewhere); ensure the new assert mirrors the enabled check structure
but asserts absence of those initContainers before continuing to the existing
script step.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: bfdc687d-f5ea-4e4f-b58f-456ad5873c17

📥 Commits

Reviewing files that changed from the base of the PR and between d7241e3 and 2c9d1bb.

📒 Files selected for processing (70)

.gitignore
README.md
docs/uninstall.md
k8s-tests/chainsaw/helm/helm-webhook-test/chainsaw-test.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/skyhook-apply-without-enabled.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/skyhook-uninstall-valid.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/update-downgrade-enabled-pkg.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/update-remove-enabled-pkg.yaml
k8s-tests/chainsaw/skyhook/delete-blocked-when-paused/README.md
k8s-tests/chainsaw/skyhook/delete-blocked-when-paused/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/delete-blocked-when-paused/skyhook.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/README.md
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/skyhook.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/update-downgrade.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/README.md
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/skyhook.yaml
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/update-downgrade.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/README.md
k8s-tests/chainsaw/skyhook/explicit-uninstall/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/skyhook.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/update-remove-package.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/package-upgrade/README.md
k8s-tests/chainsaw/skyhook/package-upgrade/assert-cm-update.yaml
k8s-tests/chainsaw/skyhook/package-upgrade/assert-install.yaml
k8s-tests/chainsaw/skyhook/package-upgrade/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/package-upgrade/skyhook.yaml
k8s-tests/chainsaw/skyhook/package-upgrade/update.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/README.md
k8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/update-cancel-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/README.md
k8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/update-fix-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/README.md
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/update-remove-keep.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/update-uninstall-one.yaml
k8s-tests/chainsaw/skyhook/uninstall-on-delete/README.md
k8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-on-delete/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/README.md
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/assert-install.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/chainsaw-test.yaml
operator/CHANGELOG.md
operator/api/v1alpha1/skyhook_types.go
operator/api/v1alpha1/skyhook_types_test.go
operator/api/v1alpha1/skyhook_webhook.go
operator/api/v1alpha1/zz_generated.deepcopy.go
operator/config/crd/bases/skyhook.nvidia.com_skyhooks.yaml
operator/internal/controller/cluster_state_v2.go
operator/internal/controller/metrics.go
operator/internal/controller/mock/SkyhookNodes.go
operator/internal/controller/pod_controller.go
operator/internal/controller/skyhook_controller.go
operator/internal/controller/skyhook_controller_test.go
operator/internal/controller/suite_test.go
operator/internal/wrapper/mock/SkyhookNode.go
operator/internal/wrapper/node.go
operator/internal/wrapper/node_test.go
operator/internal/wrapper/skyhook.go
operator/internal/wrapper/skyhook_test.go

💤 Files with no reviewable changes (3)

k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/README.md
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/assert-install.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/chainsaw-test.yaml

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

operator/internal/controller/skyhook_controller.go (1)
1677-1681: ⚠️ Potential issue | 🟠 Major

Propagate the interrupt state-write failure.

If skyhookNode.Upsert() fails after the pod is created, the reconcile path reports success but the node annotation never records the running interrupt. The next validation pass can then treat that pod as out-of-state and delete or recreate it.
Suggested fix
-	_ = skyhookNode.Upsert(_package.PackageRef, _package.Image, v1alpha1.StateInProgress, stage, 0, _package.ContainerSHA)
+	if err := skyhookNode.Upsert(
+		_package.PackageRef,
+		_package.Image,
+		v1alpha1.StateInProgress,
+		stage,
+		0,
+		_package.ContainerSHA,
+	); err != nil {
+		return fmt.Errorf("error upserting interrupt state: %w", err)
+	}
As per coding guidelines, "Return errors from Reconcile to trigger controller-runtime's exponential backoff requeue; do not swallow errors to avoid retries."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/skyhook_controller.go` around lines 1677 - 1681,
After creating the interruption pod with r.Create, the call to
skyhookNode.Upsert(_package.PackageRef, _package.Image,
v1alpha1.StateInProgress, stage, 0, _package.ContainerSHA) is currently ignored;
if Upsert fails the controller reports success while the node annotation isn't
written, causing out-of-state pods. Fix by checking the error returned from
skyhookNode.Upsert and return a wrapped error (e.g., fmt.Errorf("error upserting
skyhook node state: %w", err)) so Reconcile surfaces the failure and triggers
controller-runtime retries.

♻️ Duplicate comments (3)

docs/uninstall.md (1)

88-90: ⚠️ Potential issue | 🟡 Minor

Fix the enabled: false deletion description.

This paragraph says the state is "cleaned up immediately" and also says it remains on nodes. Those are opposite behaviors. The rest of this PR documents the preserved-node-state behavior, so this section should say that uninstall pods do not run, but the node-state/version annotations stay behind as markers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/uninstall.md` around lines 88 - 90, The paragraph for `enabled: false`
currently contradicts itself; update the description to state that when
`enabled: false` (or nil) the uninstall pods are not created and the finalizer
does not run them, but the node-state/version annotations (and other package
state metadata) are intentionally left on nodes as markers of the previous
installation rather than being immediately removed. Ensure the text replaces
"state is cleaned up immediately" with a clear statement that state is preserved
on nodes and uninstall pods are skipped.

operator/internal/controller/cluster_state_v2.go (1)

491-513: ⚠️ Potential issue | 🟠 Major

Keep scanning nodes before returning uninstall work.

HasUninstallWork() exits on the first pkg.IsUninstalling() match, so unreadable nodeState_* annotations are never surfaced in that case. That breaks the method's own contract and lets callers treat "uninstall requested" as success even when node state is malformed and cleanup should stay blocked.

💡 Suggested fix

 func (s *skyhookNodes) HasUninstallWork() (bool, error) {
 	beingDeleted := !s.skyhook.DeletionTimestamp.IsZero()
+	hasWork := false
 	for _, pkg := range s.skyhook.Spec.Packages {
 		if pkg.IsUninstalling() {
-			return true, nil
+			hasWork = true
 		}
 	}
 	for _, node := range s.nodes {
 		nodeState, err := node.State()
 		if err != nil {
 			return false, fmt.Errorf("node %s: reading state: %w", node.GetNode().Name, err)
 		}
 		for _, pkg := range s.skyhook.Spec.Packages {
 			if nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) {
-				return true, nil
+				hasWork = true
 			}
 			// Finalizer case: CR deleting, package enabled, still present on node
 			if beingDeleted && pkg.UninstallEnabled() && !nodeState.IsUninstalled(pkg.GetUniqueName()) {
-				return true, nil
+				hasWork = true
 			}
 		}
 	}
-	return false, nil
+	return hasWork, nil
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/cluster_state_v2.go` around lines 491 - 513,
HasUninstallWork currently returns immediately on the first pkg.IsUninstalling()
and skips scanning nodes, which hides nodeState read errors; change it to record
an uninstallRequested flag when any pkg.IsUninstalling() is seen (do not return
yet), then continue scanning all nodes and if any node.State() returns an error
immediately return that error, and if any
nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) or the finalizer
condition (beingDeleted && pkg.UninstallEnabled() &&
!nodeState.IsUninstalled(pkg.GetUniqueName())) is true return true,nil; after
scanning all nodes, return true,nil if uninstallRequested is set, otherwise
return false,nil. Ensure you reference HasUninstallWork, pkg.IsUninstalling,
node.State, nodeState.IsUninstallCycleInProgress, nodeState.IsUninstalled, and
pkg.UninstallEnabled when implementing this change.

operator/internal/controller/skyhook_controller_test.go (1)

1856-2946: 🛠️ Refactor suggestion | 🟠 Major

Keep the new controller coverage in Ginkgo/Gomega.

These additions reintroduce func Test*/t.Run into a file that otherwise uses Ginkgo. Please fold them into Describe/It or DescribeTable blocks so the package keeps a single test harness and the new mocks continue to use GinkgoT(). As per coding guidelines operator/**/*_test.go: Use Ginkgo/Gomega for tests in Go, not stdlib t.Run; match the file and patterns of the package you're adding to.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/skyhook_controller_test.go` around lines 1856 -
2946, The tests were added as stdlib-style Test.../t.Run functions (e.g.,
TestHandleUninstallRequests, TestHandleVersionChange_WI3,
TestHandleCompletePod_WI4, etc.) into a file that uses Ginkgo; convert each
top-level Test* function into Ginkgo Describe/Context blocks and each t.Run case
into It (or DescribeTable) blocks, replacing NewWithT(t) with
NewWithT(GinkgoT()) and all mock constructors (e.g.,
wrapperMock.NewMockSkyhookNode, dalMock.NewMockDAL,
wrapperMock.NewMockSkyhookNodeOnly) to be called with GinkgoT() so the mocks
integrate with Ginkgo’s test harness; ensure calls to HandleUninstallRequests,
HandleVersionChange, HandleCompletePod, HandleCancelledUninstalls,
UpdateBlockedCondition, HasUninstallWork, and related assertions remain the same
but are executed inside It blocks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/uninstall.md`:
- Around line 131-135: The docs mention the wrong condition name — replace every
occurrence of "nodewright.nvidia.com/Blocked" with the actual controller-emitted
condition "skyhook.nvidia.com/Blocked" (e.g., in the dependency troubleshooting
section and the other occurrence currently referencing
nodewright.nvidia.com/Blocked) so the documentation points to the real Blocked
condition emitted by the controller.

In `@k8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yaml`:
- Around line 66-67: The test saves the node annotation to a fixed file
(/tmp/malformed-node-state-backup.json) which can collide across parallel runs;
update the kubectl command and related restore logic to write/read a unique temp
path (e.g., using mktemp or $TMPDIR with $$ or a random suffix) instead of the
hard-coded filename so each test run uses its own backup file; update every
occurrence that references /tmp/malformed-node-state-backup.json (including the
kubectl get node call and the restore/echo logic later in the script) to use the
new unique-temp variable.

In `@k8s-tests/chainsaw/skyhook/malformed-node-state/README.md`:
- Around line 11-13: The README currently says NewSkyhookNodeOnly returns a
wrapper with an "empty" cached state; update the wording to state that
NewSkyhookNodeOnly (in internal/wrapper/node.go) returns an "uncached" wrapper
(nodeState is left nil) so that downstream callers of State() will re-encounter
parse errors rather than having them suppressed by an empty cache.

In
`@k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/chainsaw-test.yaml`:
- Around line 156-168: The fast-cleanup step currently patches
metadata.finalizers to null for the Skyhook CR
(uninstall-cancel-during-interrupt), which bypasses the controller and risks
leaving skyhook.nvidia.com annotations/labels/conditions on the shared node;
remove that patch and instead replace it with a wait/poll step that confirms the
controller finished cleaning the node (i.e., poll until HasSkyhookAnnotations()
is false or until the node no longer has skyhook.nvidia.com labels/annotations)
before finishing the test so the controller-run cleanup path is honored.

In `@operator/internal/controller/skyhook_controller.go`:
- Around line 625-640: The current code builds toUninstall (from
HandleVersionChange and toExplicitUninstall) globally and then applies it to
each node, which causes packages uninstalled on node A to be considered for node
B; fix by scoping uninstall lists to each node: inside the per-node processing
loop, compute or filter the global lists (results of HandleUninstallRequests and
HandleVersionChange) against the node's nodeState/selection so that only
packages actually marked for uninstall on that node are included in that node's
toUninstall before calling ApplyPackage; update the code paths around
HandleUninstallRequests, HandleVersionChange, the
toExplicitUninstall/toUninstall merge, and where ApplyPackage is invoked to use
the node-scoped list (or compute per-node via a helper that inspects nodeState)
and apply the same change to the other block noted (lines 687-705 equivalent).
- Around line 1466-1482: The current branch for skyhook.IsDisabled() &&
hasPendingUninstall incorrectly clears the deletion guard by calling
skyhook.GetSkyhook().RemoveCondition(deletionBlockedType) and falls through to
Phase 3, allowing deletion even though uninstall pods will never run; change
this so that when IsDisabled() && hasPendingUninstall we do NOT remove the
deletionBlockedType condition (keep or set it) and return/stop reconciliation
instead of falling through, and emit the same Warning via r.recorder.Eventf to
explain deletion is blocked; this references the skyhook.IsDisabled(),
hasPendingUninstall variable, the RemoveCondition(deletionBlockedType) call, and
the processSkyhooksPerNode() behavior so reviewers can locate and update the
branch logic.

---

Outside diff comments:
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1677-1681: After creating the interruption pod with r.Create, the
call to skyhookNode.Upsert(_package.PackageRef, _package.Image,
v1alpha1.StateInProgress, stage, 0, _package.ContainerSHA) is currently ignored;
if Upsert fails the controller reports success while the node annotation isn't
written, causing out-of-state pods. Fix by checking the error returned from
skyhookNode.Upsert and return a wrapped error (e.g., fmt.Errorf("error upserting
skyhook node state: %w", err)) so Reconcile surfaces the failure and triggers
controller-runtime retries.

---

Duplicate comments:
In `@docs/uninstall.md`:
- Around line 88-90: The paragraph for `enabled: false` currently contradicts
itself; update the description to state that when `enabled: false` (or nil) the
uninstall pods are not created and the finalizer does not run them, but the
node-state/version annotations (and other package state metadata) are
intentionally left on nodes as markers of the previous installation rather than
being immediately removed. Ensure the text replaces "state is cleaned up
immediately" with a clear statement that state is preserved on nodes and
uninstall pods are skipped.

In `@operator/internal/controller/cluster_state_v2.go`:
- Around line 491-513: HasUninstallWork currently returns immediately on the
first pkg.IsUninstalling() and skips scanning nodes, which hides nodeState read
errors; change it to record an uninstallRequested flag when any
pkg.IsUninstalling() is seen (do not return yet), then continue scanning all
nodes and if any node.State() returns an error immediately return that error,
and if any nodeState.IsUninstallCycleInProgress(pkg.GetUniqueName()) or the
finalizer condition (beingDeleted && pkg.UninstallEnabled() &&
!nodeState.IsUninstalled(pkg.GetUniqueName())) is true return true,nil; after
scanning all nodes, return true,nil if uninstallRequested is set, otherwise
return false,nil. Ensure you reference HasUninstallWork, pkg.IsUninstalling,
node.State, nodeState.IsUninstallCycleInProgress, nodeState.IsUninstalled, and
pkg.UninstallEnabled when implementing this change.

In `@operator/internal/controller/skyhook_controller_test.go`:
- Around line 1856-2946: The tests were added as stdlib-style Test.../t.Run
functions (e.g., TestHandleUninstallRequests, TestHandleVersionChange_WI3,
TestHandleCompletePod_WI4, etc.) into a file that uses Ginkgo; convert each
top-level Test* function into Ginkgo Describe/Context blocks and each t.Run case
into It (or DescribeTable) blocks, replacing NewWithT(t) with
NewWithT(GinkgoT()) and all mock constructors (e.g.,
wrapperMock.NewMockSkyhookNode, dalMock.NewMockDAL,
wrapperMock.NewMockSkyhookNodeOnly) to be called with GinkgoT() so the mocks
integrate with Ginkgo’s test harness; ensure calls to HandleUninstallRequests,
HandleVersionChange, HandleCompletePod, HandleCancelledUninstalls,
UpdateBlockedCondition, HasUninstallWork, and related assertions remain the same
but are executed inside It blocks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bc5c36bb-3e60-4295-8be3-60d7498eee20

📥 Commits

Reviewing files that changed from the base of the PR and between 2c9d1bb and bfda1c1.

📒 Files selected for processing (17)

.gitignore
docs/uninstall.md
k8s-tests/chainsaw/skyhook/explicit-uninstall/update-remove-package.yaml
k8s-tests/chainsaw/skyhook/malformed-node-state/README.md
k8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/malformed-node-state/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/README.md
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/update-cancel-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yaml
operator/internal/controller/cluster_state_v2.go
operator/internal/controller/mock/SkyhookNodes.go
operator/internal/controller/skyhook_controller.go
operator/internal/controller/skyhook_controller_test.go
operator/internal/wrapper/node.go

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

operator/internal/controller/skyhook_controller.go (1)

1698-1712: ⚠️ Potential issue | 🟠 Major

Propagate interrupt-state update failures.

skyhookNode.Upsert(...) is ignored after the pod is created. If that write fails, the controller can leave the interrupt pod running without the matching node-state transition and churn it on the next reconcile.

💡 Suggested fix

-	_ = skyhookNode.Upsert(_package.PackageRef, _package.Image, v1alpha1.StateInProgress, stage, 0, _package.ContainerSHA)
+	if err := skyhookNode.Upsert(
+		_package.PackageRef,
+		_package.Image,
+		v1alpha1.StateInProgress,
+		stage,
+		0,
+		_package.ContainerSHA,
+	); err != nil {
+		return fmt.Errorf("error recording interrupt state for %s: %w", _package.GetUniqueName(), err)
+	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/skyhook_controller.go` around lines 1698 - 1712,
The call to skyhookNode.Upsert(...) is currently ignored, so any failure
updating the node state after creating the interrupt pod is lost; capture the
error returned by skyhookNode.Upsert (after r.Create(ctx, pod)), check it, and
return a wrapped fmt.Errorf if non-nil so reconcile fails and the pod/state
remain consistent; locate the sequence around createInterruptPodForPackage,
SetPackages, ctrl.SetControllerReference, r.Create and add error handling for
skyhookNode.Upsert to propagate failures.

♻️ Duplicate comments (2)

docs/uninstall.md (1)

131-135: ⚠️ Potential issue | 🟡 Minor

Use the emitted condition name (skyhook.nvidia.com/Blocked).

Line 134 and Line 181 reference nodewright.nvidia.com/Blocked, which does not match the condition namespace used elsewhere in this PR/docs.

Suggested doc fix

-- A `nodewright.nvidia.com/Blocked` condition is set with a message indicating the broken dependency
+- A `skyhook.nvidia.com/Blocked` condition is set with a message indicating the broken dependency
...
-Look for `nodewright.nvidia.com/Blocked` condition with the dependency chain message.
+Look for `skyhook.nvidia.com/Blocked` condition with the dependency chain message.

Also applies to: 181-181

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/uninstall.md` around lines 131 - 135, The docs reference the wrong
emitted condition name; replace any use of nodewright.nvidia.com/Blocked with
the correct emitted condition skyhook.nvidia.com/Blocked so the documentation
matches the condition namespace used across the PR; search for occurrences
(e.g., the condition string in the uninstall description) and update them to
skyhook.nvidia.com/Blocked to keep docs consistent.

operator/internal/controller/skyhook_controller_test.go (1)

1856-3017: ⚠️ Potential issue | 🟠 Major

Port the new uninstall coverage into the existing Ginkgo suite.

These cases are still added as func Test... + t.Run(...) blocks in a file that otherwise uses Ginkgo/Gomega, so the suite now mixes two harnesses in the same package. Describe/It/DescribeTable would keep this aligned with the rest of the file and the repo’s test conventions.

#!/bin/bash
# Verify the file still mixes stdlib tests with the Ginkgo suite.
rg -n '^\s*func Test|^\s*t\.Run\(' operator/internal/controller/skyhook_controller_test.go

As per coding guidelines, "operator/**/*_test.go: Use Ginkgo/Gomega for tests in Go, not stdlib t.Run; match the file and patterns of the package you're adding to."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/skyhook_controller_test.go` around lines 1856 -
3017, The file mixes stdlib tests with the repo's Ginkgo/Gomega suite (you added
TestHandleUninstallRequests, TestHandleVersionChange_WI3,
TestHandleCompletePod_WI4, TestHandleCancelledUninstalls,
TestHandleUninstallRequests_FinalizerPath,
TestHandleUninstallRequests_InstallCompleteGuard, TestFilterUninstallForNode,
TestUpdateBlockedCondition, TestHasUninstallWork,
TestHandleCompletePod_VersionComparison, TestHandleVersionChange_DowngradeIsNoOp
and their t.Run subtests) which violates the package convention; convert each
top-level func Test... and its t.Run cases into the file's Ginkgo style
(Describe/Context/It or DescribeTable) using the same assertions (Gomega) and
remove t.Run/NewWithT usage so the suite is uniform, keeping the same test logic
and mocks (references: HandleUninstallRequests, HandleVersionChange,
HandleCompletePod, HandleCancelledUninstalls, filterUninstallForNode,
skyhookNodes.UpdateBlockedCondition, skyhookNodes.HasUninstallWork,
SkyhookReconciler.HandleCompletePod) and ensure setup/teardown use existing
BeforeEach/AfterEach patterns in the file.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/uninstall.md`:
- Around line 42-46: The ordered list numbering after the fenced block in the
uninstall docs is continuing from the previous list and violates markdownlint
MD029; update the list so it restarts at "1." (or alternatively nest the fenced
block under the preceding item) — specifically change the items beginning "2.
The operator creates uninstall pods..." / "3. After uninstall completes..." /
"4. You may now safely remove..." to start at "1." (or indent the fenced block
as a sub-item of the prior list entry) so markdownlint no longer reports MD029.
- Around line 88-90: The current text for "enabled: false" is contradictory;
update the paragraph for the `enabled: false` (or nil) package deletion path to
state that no uninstall pods are created and the finalizer does not perform
node-level cleanup, so package state remains on nodes for admins to inspect;
mention that the finalizer/CR is removed immediately without uncordoning,
removing SCR labels/annotations, or running uninstall pods. Reference the
`enabled: false` flag, the finalizer, uninstall pods, and SCR labels/annotations
when making this change so it's unambiguous.
- Around line 164-179: The markdown has fenced code blocks that lack surrounding
blank lines (MD031); update the sections containing the kubectl command blocks
(e.g., the blocks starting with "kubectl logs -n skyhook <pod-name> -c
<package>-uninstall" and the "kubectl get nodes ... jsonpath" and "kubectl get
skyhook ... jsonpath" blocks) to ensure there is an empty line before and after
each triple-backtick fence so the linter passes.

In `@k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/README.md`:
- Around line 21-25: Update the SLEEP_LEN value in the chainsaw-test manifest’s
description to match the actual value used by the test: change the documented
SLEEP_LEN from 20 to 5 in
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/chainsaw-test.yaml
so it matches the SLEEP_LEN="5" configured in skyhook.yaml; look for the
SLEEP_LEN mention in the chainsaw-test.yaml description block and correct it to
5 for consistency.

In `@operator/internal/controller/skyhook_controller.go`:
- Around line 823-826: The error returned from node.State() in the uninstall
helpers loses node context; update both HandleUninstallRequests and
HandleCancelledUninstalls to wrap failures from node.State() with the node name
using fmt.Errorf so you return something like fmt.Errorf("node %s: failed to get
state: %w", node.Name, err). Locate the node.State() call inside the loop over
skyhook.GetNodes() and replace the bare return of err with the wrapped error
(preserving the original err with %w) and ensure fmt is imported if not already.
- Around line 677-703: The filter currently only skips absent-but-uninstalled
packages when pkg.IsUninstalling() is true, which misses finalizer-driven
deletes where Uninstall.Apply=false; update the loop after node.State() to also
skip packages that have UninstallEnabled() &&
nodeState.IsUninstalled(pkg.GetUniqueName()) so finalizer uninstalls won’t be
re-enqueued by RunNext(); keep the existing node.State() error propagation and
the IsUninstallCycleInProgress check intact and reference the symbols
node.State(), IsUninstallCycleInProgress, pkg.IsUninstalling(),
pkg.UninstallEnabled(), nodeState.IsUninstalled(), and RunNext() when making the
change.

---

Outside diff comments:
In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1698-1712: The call to skyhookNode.Upsert(...) is currently
ignored, so any failure updating the node state after creating the interrupt pod
is lost; capture the error returned by skyhookNode.Upsert (after r.Create(ctx,
pod)), check it, and return a wrapped fmt.Errorf if non-nil so reconcile fails
and the pod/state remain consistent; locate the sequence around
createInterruptPodForPackage, SetPackages, ctrl.SetControllerReference, r.Create
and add error handling for skyhookNode.Upsert to propagate failures.

---

Duplicate comments:
In `@docs/uninstall.md`:
- Around line 131-135: The docs reference the wrong emitted condition name;
replace any use of nodewright.nvidia.com/Blocked with the correct emitted
condition skyhook.nvidia.com/Blocked so the documentation matches the condition
namespace used across the PR; search for occurrences (e.g., the condition string
in the uninstall description) and update them to skyhook.nvidia.com/Blocked to
keep docs consistent.

In `@operator/internal/controller/skyhook_controller_test.go`:
- Around line 1856-3017: The file mixes stdlib tests with the repo's
Ginkgo/Gomega suite (you added TestHandleUninstallRequests,
TestHandleVersionChange_WI3, TestHandleCompletePod_WI4,
TestHandleCancelledUninstalls, TestHandleUninstallRequests_FinalizerPath,
TestHandleUninstallRequests_InstallCompleteGuard, TestFilterUninstallForNode,
TestUpdateBlockedCondition, TestHasUninstallWork,
TestHandleCompletePod_VersionComparison, TestHandleVersionChange_DowngradeIsNoOp
and their t.Run subtests) which violates the package convention; convert each
top-level func Test... and its t.Run cases into the file's Ginkgo style
(Describe/Context/It or DescribeTable) using the same assertions (Gomega) and
remove t.Run/NewWithT usage so the suite is uniform, keeping the same test logic
and mocks (references: HandleUninstallRequests, HandleVersionChange,
HandleCompletePod, HandleCancelledUninstalls, filterUninstallForNode,
skyhookNodes.UpdateBlockedCondition, skyhookNodes.HasUninstallWork,
SkyhookReconciler.HandleCompletePod) and ensure setup/teardown use existing
BeforeEach/AfterEach patterns in the file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 51456cc1-0015-462d-8eb8-1a946846ca53

📥 Commits

Reviewing files that changed from the base of the PR and between bfda1c1 and 2f8a084.

📒 Files selected for processing (11)

docs/uninstall.md
k8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/README.md
k8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/skyhook.yaml
k8s-tests/chainsaw/skyhook/delete-blocked-when-paused/README.md
k8s-tests/chainsaw/skyhook/delete-blocked-when-paused/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/README.md
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/chainsaw-test.yaml
operator/internal/controller/cluster_state_v2.go
operator/internal/controller/skyhook_controller.go
operator/internal/controller/skyhook_controller_test.go

coderabbitai

♻️ Duplicate comments (1)

k8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yaml (1)

66-67: ⚠️ Potential issue | 🟡 Minor

Avoid shared backup path across runs.

Line 66/67 and Line 93 still use a single global file in /tmp. This can collide in parallel jobs/retries and restore the wrong annotation payload.

♻️ Suggested hardening

-          kubectl get node "$node" -o jsonpath='{.metadata.annotations.skyhook\.nvidia\.com/nodeState_malformed-node-state}' > /tmp/malformed-node-state-backup.json
-          echo "Saved backup: $(wc -c < /tmp/malformed-node-state-backup.json) bytes"
+          backup="/tmp/malformed-node-state-backup-${node}.json"
+          kubectl get node "$node" -o jsonpath='{.metadata.annotations.skyhook\.nvidia\.com/nodeState_malformed-node-state}' > "$backup"
+          echo "Saved backup: $(wc -c < "$backup") bytes"
@@
-          value=$(cat /tmp/malformed-node-state-backup.json)
+          backup="/tmp/malformed-node-state-backup-${node}.json"
+          value=$(cat "$backup")

Also applies to: 93-93

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yaml` around
lines 66 - 67, The backup uses a fixed path
(/tmp/malformed-node-state-backup.json) which can collide across parallel runs;
change the save/restore to use a unique temp file (e.g., create a variable like
MALFORMED_BACKUP=$(mktemp) or include the node name/PID) and write kubectl
output to that variable (replace the literal > /tmp/... with >
"$MALFORMED_BACKUP") and then use the same "$MALFORMED_BACKUP" when restoring
the annotation so each run uses its own file and no cross-run overwrite occurs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@k8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yaml`:
- Around line 66-67: The backup uses a fixed path
(/tmp/malformed-node-state-backup.json) which can collide across parallel runs;
change the save/restore to use a unique temp file (e.g., create a variable like
MALFORMED_BACKUP=$(mktemp) or include the node name/PID) and write kubectl
output to that variable (replace the literal > /tmp/... with >
"$MALFORMED_BACKUP") and then use the same "$MALFORMED_BACKUP" when restoring
the annotation so each run uses its own file and no cross-run overwrite occurs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f1513ead-6ae0-4e95-94cb-f0063b451759

📥 Commits

Reviewing files that changed from the base of the PR and between 2f8a084 and 377b8c1.

📒 Files selected for processing (1)

k8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yaml

ayuskauskas · 2026-04-28T23:01:41Z

@CodeRabbit full review

coderabbitai · 2026-04-28T23:01:47Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

k8s-tests/chainsaw/helm/helm-webhook-test/update-remove-enabled-pkg.yaml (1)

26-29: ⚠️ Potential issue | 🟠 Major

Use an explicit empty object for spec.packages to ensure webhook-path coverage.

Line 29 currently leaves packages: as null. That can fail schema/type validation before admission webhook logic, so this fixture may not validate the intended “enabled package removal rejection” behavior. Prefer an explicit empty map.
Suggested fix
 spec:
   nodeSelectors:
     matchLabels:
       skyhook.nvidia.com/test-node: skyhooke2e
   interruptionBudget:
     count: 1
-  packages:
+  packages: {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/helm/helm-webhook-test/update-remove-enabled-pkg.yaml`
around lines 26 - 29, The fixture leaves spec.packages as null which can
short-circuit validation before the webhook; update the YAML so that the
packages field is an explicit empty map (i.e. set spec.packages to {} rather
than leaving it empty/null) to ensure the admission webhook sees an empty object
and the “enabled package removal rejection” path is exercised; locate the
packages entry (near interruptionBudget/count and the spec block) and replace
the implicit null with an explicit empty object.

operator/internal/controller/skyhook_controller.go (1)

1729-1744: ⚠️ Potential issue | 🟠 Major

Propagate Upsert failures after creating the interrupt pod.

If the pod is created but this state write fails, the node still looks like it's on the previous stage. Later reconciles can then treat a live interrupt pod as stray or recreate it against stale state.

🛠️ Suggested fix

-	_ = skyhookNode.Upsert(_package.PackageRef, _package.Image, v1alpha1.StateInProgress, stage, 0, _package.ContainerSHA)
+	if err := skyhookNode.Upsert(
+		_package.PackageRef,
+		_package.Image,
+		v1alpha1.StateInProgress,
+		stage,
+		0,
+		_package.ContainerSHA,
+	); err != nil {
+		return fmt.Errorf(
+			"error updating node state for interrupt pod [%s:%s] stage [%s]: %w",
+			_package.Name,
+			_package.Version,
+			stage,
+			err,
+		)
+	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/skyhook_controller.go` around lines 1729 - 1744,
The call to skyhookNode.Upsert is currently ignoring errors (using `_ =
skyhookNode.Upsert(...)`) after successfully creating the interrupt pod via
createInterruptPodForPackage and r.Create; change this so Upsert's returned
error is checked and propagated (e.g., capture err := skyhookNode.Upsert(...);
if err != nil { return fmt.Errorf("error upserting package state after creating
interrupt pod: %w", err) }) so failures updating node state are not dropped and
reconciles won't be left with a live pod but stale state.

♻️ Duplicate comments (9)

k8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yaml (1)

66-67: ⚠️ Potential issue | 🟡 Minor

Use a unique backup filepath instead of a fixed /tmp filename.

The hard-coded backup path can collide across parallel/retried runs and restore the wrong annotation value.

♻️ Suggested hardening

-          kubectl get node "$node" -o jsonpath='{.metadata.annotations.skyhook\.nvidia\.com/nodeState_malformed-node-state}' > /tmp/malformed-node-state-backup.json
-          echo "Saved backup: $(wc -c < /tmp/malformed-node-state-backup.json) bytes"
+          backup="/tmp/malformed-node-state-backup-${node}.json"
+          kubectl get node "$node" -o jsonpath='{.metadata.annotations.skyhook\.nvidia\.com/nodeState_malformed-node-state}' > "$backup"
+          echo "Saved backup: $(wc -c < "$backup") bytes"
@@
-          value=$(cat /tmp/malformed-node-state-backup.json)
+          backup="/tmp/malformed-node-state-backup-${node}.json"
+          value=$(cat "$backup")

Also applies to: 93-94

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yaml` around
lines 66 - 67, Replace the fixed /tmp/malformed-node-state-backup.json usage
with a unique temporary filepath for each run (e.g., use mktemp or include the
node name and PID/timestamp) when saving and later restoring the annotation;
update both occurrences shown (the kubectl get redirect and the echo that
references the file) so they use the same generated temp filename instead of the
hard-coded path to avoid cross-run collisions.

k8s-tests/chainsaw/skyhook/malformed-node-state/README.md (1)

11-13: ⚠️ Potential issue | 🟡 Minor

Use “uncached” instead of “empty cached state.”

This wording is inaccurate: an uncached (nil) state and an empty cached state imply different downstream behavior.

✏️ Suggested wording fix

-1. `NewSkyhookNodeOnly` (in `internal/wrapper/node.go`) no longer aborts
-   construction on a parse failure — it returns a wrapper whose cached state
-   is empty so downstream `State()` callers re-encounter the error.
+1. `NewSkyhookNodeOnly` (in `internal/wrapper/node.go`) no longer aborts
+   construction on a parse failure — it returns a wrapper with uncached state
+   (`nodeState` left nil), so downstream `State()` callers re-encounter the error.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/malformed-node-state/README.md` around lines 11 -
13, Update the README wording to say the wrapper returns an "uncached (nil)
state" rather than "an empty cached state" so the behavior distinction is clear;
reference the constructor NewSkyhookNodeOnly and downstream callers of State()
in the sentence to keep context.

k8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yaml (1)

64-83: ⚠️ Potential issue | 🟠 Major

Add an explicit negative assertion for disabled-package uninstall pod creation.

This still only proves the enabled uninstall pod exists; it does not directly fail when a disabled package also gets uninstall init containers.

Suggested hardening

     - assert:
         ## uninstall pod for enabled package
         resource:
           kind: Pod
           apiVersion: v1
           metadata:
             namespace: skyhook
             labels:
               skyhook.nvidia.com/name: uninstall-on-delete
               skyhook.nvidia.com/package: enabled-pkg-2.1.4
           spec:
             initContainers:
               - name: enabled-pkg-init
               - name: enabled-pkg-uninstall
                 args:
                   ([0]): uninstall
               - name: enabled-pkg-uninstallcheck
                 args:
                   ([0]): uninstall-check
+    - script:
+        ## ensure disabled package did not get uninstall init containers
+        content: |
+          if kubectl get pods -n skyhook -l skyhook.nvidia.com/name=uninstall-on-delete -o jsonpath='{range .items[*]}{range .spec.initContainers[*]}{.name}{"\n"}{end}{end}' | grep -q 'disabled-pkg-uninstall'; then
+            echo "ERROR: uninstall-disabled package got uninstall init container"
+            kubectl get pods -n skyhook -l skyhook.nvidia.com/name=uninstall-on-delete -o yaml
+            exit 1
+          fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@k8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yaml` around
lines 64 - 83, Add a negative assertion to ensure a disabled package does not
create uninstall init containers: check for a Pod resource in namespace
"skyhook" with label "skyhook.nvidia.com/package: disabled-pkg-2.1.4" (or the
disabled package label used in tests) and assert it either does not exist or
does not contain initContainers named
"enabled-pkg-uninstall"/"enabled-pkg-uninstallcheck" (or the
uninstall/uninstall-check args); update the assert block that currently verifies
the enabled package Pod to also include this negative check so the test fails if
uninstall init containers appear for the disabled package.

docs/uninstall.md (5)

163-179: ⚠️ Potential issue | 🟡 Minor

Add blank lines around these fenced code blocks.

The troubleshooting section still has fences immediately after prose lines, which keeps tripping MD031.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/uninstall.md` around lines 163 - 179, Add blank lines immediately before
and after each fenced code block that begins with "kubectl logs -n skyhook",
"kubectl get nodes -l skyhook.nvidia.com/test-node=skyhooke2e", and "kubectl get
skyhook <name>" so the fences are separated from surrounding prose (this will
satisfy MD031). Edit the sections containing those three fenced blocks to insert
one empty line above the opening ``` and one empty line below the closing ```,
keeping the exact command contents and headings unchanged.
88-90: ⚠️ Potential issue | 🟡 Minor

This enabled: false deletion note is still contradictory.

“state is cleaned up immediately” conflicts with “The package state remains on nodes.” Please make it explicit that no uninstall pods run and the node-state entry is preserved for inspection.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/uninstall.md` around lines 88 - 90, Clarify the contradictory "enabled:
false" description by explicitly stating that for packages with `enabled: false`
(or nil) no uninstall pods are created or run, and that the controller skips
uninstall actions while preserving the package state entries on nodes for
administrator inspection (i.e., node-state entries are not removed); update the
two conflicting sentences so they read something like “No uninstall pods are
run; controller skips uninstall actions and immediately proceeds without
removing node-state entries — the package state remains on nodes for
administrators to inspect.” Reference the `enabled: false` wording and the
“uninstall pods”/“package state remains on nodes” phrases when making the
change.
42-46: ⚠️ Potential issue | 🟡 Minor

Restart this ordered list after the fenced block.

Markdown parses the fenced YAML block as ending the previous list, so these items should restart at 1./2./3. instead of 2./3./4..
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/uninstall.md` around lines 42 - 46, Restart the ordered list after the
fenced YAML block so numbering begins at 1; specifically update the list entries
starting "2. The operator creates uninstall pods..." / "3. After uninstall
completes..." / "4. You may now safely remove..." to start at "1.", "2.", "3."
(or insert a blank line / an explicit HTML break after the fenced block) to
ensure markdown treats them as a new ordered list following the fenced YAML
block.
127-127: ⚠️ Potential issue | 🟡 Minor

Say packages are re-installed, not nodes.

“nodes may need to re-install” reads as if the node itself is being reinstalled. The behavior described elsewhere in this doc is that packages may need to be re-installed on some nodes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/uninstall.md` at line 127, Update the table cell text that currently
reads "Cancel (`apply: true` -> `false`) | **Warning** — nodes may need to
re-install |" so it accurately says that packages—not the nodes—are
re-installed; replace the phrase "nodes may need to re-install" with something
like "packages may need to be re-installed on some nodes" (or similar wording)
to clarify the subject; locate and edit the exact table row string to make this
change.
131-135: ⚠️ Potential issue | 🟡 Minor

Use the actual skyhook.nvidia.com/Blocked condition name.

These sections still point readers at nodewright.nvidia.com/Blocked, which is not the condition namespace used elsewhere in this change set.

Also applies to: 181-181
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/uninstall.md` around lines 131 - 135, Replace the incorrect condition
namespace `nodewright.nvidia.com/Blocked` with the correct
`skyhook.nvidia.com/Blocked` wherever it appears in the uninstall documentation
(the bullet list describing blocked packages and the other occurrence noted);
update the example message text if necessary to use `skyhook.nvidia.com/Blocked`
so all references to the Blocked condition are consistent with the rest of the
change set.

operator/internal/controller/skyhook_controller_test.go (1)

1857-3325: 🛠️ Refactor suggestion | 🟠 Major

Port these new controller tests into the existing Ginkgo suite.

This file already uses Ginkgo/Gomega, but the new uninstall coverage is added as func Test... blocks with t.Run(...). Please keep the new cases in the same suite style (Describe, It, DescribeTable, Entry) instead of mixing frameworks. As per coding guidelines: operator/**/*_test.go: Use Ginkgo/Gomega for tests in Go, not stdlib t.Run; match the file and patterns of the package you're adding to.

#!/bin/bash
# Verify this file mixes Ginkgo with stdlib test entrypoints/subtests.
sed -n '21,40p' operator/internal/controller/skyhook_controller_test.go
rg -n '^(func Test|[[:space:]]*t\.Run\()' operator/internal/controller/skyhook_controller_test.go

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/internal/controller/skyhook_controller_test.go` around lines 1857 -
3325, The tests currently defined as func Test... wrappers with t.Run blocks
must be converted into Ginkgo-style specs: replace each
TestHandleUninstallRequests, TestHandleVersionChange_WI3,
TestHandleCompletePod_WI4, TestHandleCancelledUninstalls,
TestHandleUninstallRequests_FinalizerPath,
TestHandleUninstallRequests_InstallCompleteGuard, TestFilterUninstallForNode,
TestShouldSkipApplyForUninstall, TestUpdateBlockedCondition,
TestHasUninstallWork, TestHandleCompletePod_VersionComparison, and
TestHandleVersionChange_DowngradeIsNoOp blocks into Describe/Context/It (or
DescribeTable/Entry where appropriate) suites and move each t.Run case into an
It; update mock constructions (wrapperMock.NewMockSkyhookNode(t) etc.) to pass
GinkgoT() and change NewWithT(t) to NewWithT(GinkgoT()); ensure imports use
Ginkgo/Gomega (e.g. . "github.com/onsi/ginkgo/v2" and .
"github.com/onsi/gomega") and remove usage of testing.T entrypoints so there are
no func Test* top-level stdlib tests left, keeping all references to symbols
like HandleUninstallRequests, HandleVersionChange, HandleCompletePod,
HandleCancelledUninstalls, filterUninstallForNode, shouldSkipApplyForUninstall,
skyhookNodes.UpdateBlockedCondition and skyhookNodes.HasUninstallWork intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@k8s-tests/chainsaw/skyhook/package-upgrade/assert-install.yaml`:
- Around line 77-82: The assertion currently expects status.observedGeneration:
2 which is incorrect for an initial install; update the expected value to
status.observedGeneration: 1 in the assert-install fixture so the first
create/assertion passes; locate the YAML key status.observedGeneration in the
assert-install.yaml and change its value from 2 to 1.

In `@k8s-tests/chainsaw/skyhook/package-upgrade/chainsaw-test.yaml`:
- Around line 165-168: The expected observedGeneration is incorrect: update the
status.observedGeneration value from 4 to 2 in the YAML test assertion (the
block containing status: status: complete observedGeneration: 4 completeNodes:
1/1) so it matches the single create + single update workflow; ensure only the
status.observedGeneration field is changed to 2.

In `@k8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yaml`:
- Around line 68-69: Replace the fixed "sleep: duration: 15s" step with a
deterministic poll that waits for the uninstall stage/condition to appear and
then triggers the cancel; specifically, remove the "sleep" block and implement a
loop or test harness "wait/until" step that checks the target resource's
status/phase (the uninstall stage/condition used by the test) e.g., by
inspecting the CR status or a pod/job label, retrying with short intervals and a
reasonable timeout, and then immediately execute the cancel step once the
uninstall condition is observed (reference the existing "sleep" block and the
subsequent cancel step to place the new wait before cancellation).

In `@k8s-tests/chainsaw/skyhook/uninstall-cancel/update-trigger-uninstall.yaml`:
- Around line 34-38: The fixture currently injects MY_VAR but doesn't validate
it; update the uninstall.sh script (the uninstall.sh block) to assert the
environment variable is present and correct by checking MY_VAR at runtime and
exiting non-zero on mismatch so the test fails if env propagation regresses;
ensure the script prints a clear error on failure and continues to perform the
original uninstall steps on success.

In `@k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yaml`:
- Line 120: Update the inline comment at the current comment line that reads
"Skyhook returns to complete — pkg-keep no longer tracked, pkg-remove already
uninstalled" so it accurately reflects the assertions on Lines 116-118: indicate
that pkg-keep remains in node-state (still tracked) and pkg-remove is
uninstalled; reference the package names "pkg-keep" and "pkg-remove" and the
"node-state" assertion in the revised wording to avoid confusion.

In `@operator/internal/controller/cluster_state_v2.go`:
- Around line 743-756: The condition message currently lists all malformed nodes
via badNodes/truncated and can grow unbounded; modify the s.skyhook.AddCondition
call to limit the output to the first N entries (e.g., N=10) by slicing badNodes
and building truncated using truncateNodeName on only those, then if
len(badNodes) > N append "and X more" where X = len(badNodes)-N, and use
strings.Join on the limited truncated slice in the fmt.Sprintf message while
still reporting the full count (len(badNodes)); update references around
badNodes, truncated, truncateNodeName, and the s.skyhook.AddCondition Message
construction accordingly.

In `@operator/internal/controller/skyhook_controller_test.go`:
- Around line 2257-2271: The tests TestHandleCompletePod_WI4 and
TestHandleCompletePod_VersionComparison use stdlib t.Run and
context.Background() and must be converted to Ginkgo/Gomega style so they use
the suite-level ctx and Gomega assertions; replace the t.Run blocks with
Describe/It (or Context/It) blocks, remove NewWithT(t) usage, update mock
expectations (mockDAL.NewMockDAL, wrapperMock.NewMockSkyhookNodeOnly) to expect
the suite ctx variable instead of context.Background(), invoke
SkyhookReconciler.HandleCompletePod with the suite ctx and use Gomega
Expect/Matchers for assertions on updated and err, and adjust any exact-match
context expectations to gomock.Any() if appropriate to avoid brittle context
equality checks.

In `@operator/internal/controller/skyhook_controller.go`:
- Around line 854-864: The bug is that in the StageUninstall branch we
unconditionally re-add packages (toUninstall) even when the node state is
StateComplete, which lets ApplyPackage/NextStage move a completed uninstall back
to StageApply; change the logic in the StageUninstall case inside Handle
packages progressing through the uninstall cycle to detect status.State ==
v1alpha1.StateComplete and handle it specially (remove node state or promote it
to v1alpha1.StageUninstallInterrupt/StateInterrupted) instead of appending to
toUninstall; only append the package (using the existing appendIfNotPresent and
p := skyhook.GetSkyhook().Spec.Packages[name]) when the state is not
StateComplete so ApplyPackage is not fed completed uninstall records.

---

Outside diff comments:
In `@k8s-tests/chainsaw/helm/helm-webhook-test/update-remove-enabled-pkg.yaml`:
- Around line 26-29: The fixture leaves spec.packages as null which can
short-circuit validation before the webhook; update the YAML so that the
packages field is an explicit empty map (i.e. set spec.packages to {} rather
than leaving it empty/null) to ensure the admission webhook sees an empty object
and the “enabled package removal rejection” path is exercised; locate the
packages entry (near interruptionBudget/count and the spec block) and replace
the implicit null with an explicit empty object.

In `@operator/internal/controller/skyhook_controller.go`:
- Around line 1729-1744: The call to skyhookNode.Upsert is currently ignoring
errors (using `_ = skyhookNode.Upsert(...)`) after successfully creating the
interrupt pod via createInterruptPodForPackage and r.Create; change this so
Upsert's returned error is checked and propagated (e.g., capture err :=
skyhookNode.Upsert(...); if err != nil { return fmt.Errorf("error upserting
package state after creating interrupt pod: %w", err) }) so failures updating
node state are not dropped and reconciles won't be left with a live pod but
stale state.

---

Duplicate comments:
In `@docs/uninstall.md`:
- Around line 163-179: Add blank lines immediately before and after each fenced
code block that begins with "kubectl logs -n skyhook", "kubectl get nodes -l
skyhook.nvidia.com/test-node=skyhooke2e", and "kubectl get skyhook <name>" so
the fences are separated from surrounding prose (this will satisfy MD031). Edit
the sections containing those three fenced blocks to insert one empty line above
the opening ``` and one empty line below the closing ```, keeping the exact
command contents and headings unchanged.
- Around line 88-90: Clarify the contradictory "enabled: false" description by
explicitly stating that for packages with `enabled: false` (or nil) no uninstall
pods are created or run, and that the controller skips uninstall actions while
preserving the package state entries on nodes for administrator inspection
(i.e., node-state entries are not removed); update the two conflicting sentences
so they read something like “No uninstall pods are run; controller skips
uninstall actions and immediately proceeds without removing node-state entries —
the package state remains on nodes for administrators to inspect.” Reference the
`enabled: false` wording and the “uninstall pods”/“package state remains on
nodes” phrases when making the change.
- Around line 42-46: Restart the ordered list after the fenced YAML block so
numbering begins at 1; specifically update the list entries starting "2. The
operator creates uninstall pods..." / "3. After uninstall completes..." / "4.
You may now safely remove..." to start at "1.", "2.", "3." (or insert a blank
line / an explicit HTML break after the fenced block) to ensure markdown treats
them as a new ordered list following the fenced YAML block.
- Line 127: Update the table cell text that currently reads "Cancel (`apply:
true` -> `false`) | **Warning** — nodes may need to re-install |" so it
accurately says that packages—not the nodes—are re-installed; replace the phrase
"nodes may need to re-install" with something like "packages may need to be
re-installed on some nodes" (or similar wording) to clarify the subject; locate
and edit the exact table row string to make this change.
- Around line 131-135: Replace the incorrect condition namespace
`nodewright.nvidia.com/Blocked` with the correct `skyhook.nvidia.com/Blocked`
wherever it appears in the uninstall documentation (the bullet list describing
blocked packages and the other occurrence noted); update the example message
text if necessary to use `skyhook.nvidia.com/Blocked` so all references to the
Blocked condition are consistent with the rest of the change set.

In `@k8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yaml`:
- Around line 66-67: Replace the fixed /tmp/malformed-node-state-backup.json
usage with a unique temporary filepath for each run (e.g., use mktemp or include
the node name and PID/timestamp) when saving and later restoring the annotation;
update both occurrences shown (the kubectl get redirect and the echo that
references the file) so they use the same generated temp filename instead of the
hard-coded path to avoid cross-run collisions.

In `@k8s-tests/chainsaw/skyhook/malformed-node-state/README.md`:
- Around line 11-13: Update the README wording to say the wrapper returns an
"uncached (nil) state" rather than "an empty cached state" so the behavior
distinction is clear; reference the constructor NewSkyhookNodeOnly and
downstream callers of State() in the sentence to keep context.

In `@k8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yaml`:
- Around line 64-83: Add a negative assertion to ensure a disabled package does
not create uninstall init containers: check for a Pod resource in namespace
"skyhook" with label "skyhook.nvidia.com/package: disabled-pkg-2.1.4" (or the
disabled package label used in tests) and assert it either does not exist or
does not contain initContainers named
"enabled-pkg-uninstall"/"enabled-pkg-uninstallcheck" (or the
uninstall/uninstall-check args); update the assert block that currently verifies
the enabled package Pod to also include this negative check so the test fails if
uninstall init containers appear for the disabled package.

In `@operator/internal/controller/skyhook_controller_test.go`:
- Around line 1857-3325: The tests currently defined as func Test... wrappers
with t.Run blocks must be converted into Ginkgo-style specs: replace each
TestHandleUninstallRequests, TestHandleVersionChange_WI3,
TestHandleCompletePod_WI4, TestHandleCancelledUninstalls,
TestHandleUninstallRequests_FinalizerPath,
TestHandleUninstallRequests_InstallCompleteGuard, TestFilterUninstallForNode,
TestShouldSkipApplyForUninstall, TestUpdateBlockedCondition,
TestHasUninstallWork, TestHandleCompletePod_VersionComparison, and
TestHandleVersionChange_DowngradeIsNoOp blocks into Describe/Context/It (or
DescribeTable/Entry where appropriate) suites and move each t.Run case into an
It; update mock constructions (wrapperMock.NewMockSkyhookNode(t) etc.) to pass
GinkgoT() and change NewWithT(t) to NewWithT(GinkgoT()); ensure imports use
Ginkgo/Gomega (e.g. . "github.com/onsi/ginkgo/v2" and .
"github.com/onsi/gomega") and remove usage of testing.T entrypoints so there are
no func Test* top-level stdlib tests left, keeping all references to symbols
like HandleUninstallRequests, HandleVersionChange, HandleCompletePod,
HandleCancelledUninstalls, filterUninstallForNode, shouldSkipApplyForUninstall,
skyhookNodes.UpdateBlockedCondition and skyhookNodes.HasUninstallWork intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e49a1e6b-642a-4b6a-ae3b-f3fef72795f6

📥 Commits

Reviewing files that changed from the base of the PR and between b4090a6 and 4a31b6c.

📒 Files selected for processing (81)

.gitignore
README.md
docs/uninstall.md
k8s-tests/chainsaw/helm/helm-webhook-test/chainsaw-test.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/skyhook-apply-without-enabled.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/skyhook-uninstall-valid.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/update-downgrade-enabled-pkg.yaml
k8s-tests/chainsaw/helm/helm-webhook-test/update-remove-enabled-pkg.yaml
k8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/README.md
k8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/delete-blocked-when-disabled/skyhook.yaml
k8s-tests/chainsaw/skyhook/delete-blocked-when-paused/README.md
k8s-tests/chainsaw/skyhook/delete-blocked-when-paused/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/delete-blocked-when-paused/skyhook.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/README.md
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/skyhook.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/update-downgrade.yaml
k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/README.md
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/skyhook.yaml
k8s-tests/chainsaw/skyhook/downgrade-enabled-false-preserves-state/update-downgrade.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/README.md
k8s-tests/chainsaw/skyhook/explicit-uninstall/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/skyhook.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/update-remove-package.yaml
k8s-tests/chainsaw/skyhook/explicit-uninstall/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/malformed-node-state/README.md
k8s-tests/chainsaw/skyhook/malformed-node-state/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/malformed-node-state/skyhook.yaml
k8s-tests/chainsaw/skyhook/package-upgrade/README.md
k8s-tests/chainsaw/skyhook/package-upgrade/assert-cm-update.yaml
k8s-tests/chainsaw/skyhook/package-upgrade/assert-install.yaml
k8s-tests/chainsaw/skyhook/package-upgrade/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/package-upgrade/skyhook.yaml
k8s-tests/chainsaw/skyhook/package-upgrade/update.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/README.md
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/update-cancel-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel-during-interrupt/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/README.md
k8s-tests/chainsaw/skyhook/uninstall-cancel/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/update-cancel-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-cancel/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/README.md
k8s-tests/chainsaw/skyhook/uninstall-fix-config/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/update-fix-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-fix-config/update-trigger-uninstall.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/README.md
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/update-remove-keep.yaml
k8s-tests/chainsaw/skyhook/uninstall-mixed-packages/update-uninstall-one.yaml
k8s-tests/chainsaw/skyhook/uninstall-on-delete/README.md
k8s-tests/chainsaw/skyhook/uninstall-on-delete/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-on-delete/skyhook.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/README.md
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/assert-install.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/chainsaw-test.yaml
operator/CHANGELOG.md
operator/api/v1alpha1/skyhook_types.go
operator/api/v1alpha1/skyhook_types_test.go
operator/api/v1alpha1/skyhook_webhook.go
operator/api/v1alpha1/zz_generated.deepcopy.go
operator/config/crd/bases/skyhook.nvidia.com_skyhooks.yaml
operator/internal/controller/cluster_state_v2.go
operator/internal/controller/metrics.go
operator/internal/controller/mock/SkyhookNodes.go
operator/internal/controller/pod_controller.go
operator/internal/controller/skyhook_controller.go
operator/internal/controller/skyhook_controller_test.go
operator/internal/controller/suite_test.go
operator/internal/wrapper/mock/SkyhookNode.go
operator/internal/wrapper/node.go
operator/internal/wrapper/node_test.go
operator/internal/wrapper/skyhook.go
operator/internal/wrapper/skyhook_test.go

💤 Files with no reviewable changes (3)

k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/README.md
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/chainsaw-test.yaml
k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/assert-install.yaml

Adds an opt-in declarative uninstall workflow to the Skyhook CRD and operator, replacing the prior "remove-from-spec-to-uninstall" behavior. Packages expose a new `uninstall` block with `enabled` and `apply` flags. Setting `apply=true` triggers an uninstall pod on every target node with the package's full config. Packages with an `interrupt` configured (reboot, service restart, etc.) run the interrupt after uninstall via a new `StageUninstallInterrupt` stage, distinct from the install-cycle `StageInterrupt`. The webhook enforces: `apply` requires `enabled`; removing an enabled package requires prior uninstall completion; version downgrades are rejected unless the package is already fully uninstalled. CR deletion drives uninstall via a finalizer that waits for all nodes to complete before cleaning up labels, annotations, and ConfigMaps. The legacy "downgrade auto-triggers uninstall" path is removed. Downgrades of `enabled=false` packages are allowed but preserve the old version's node-state entry (D2 semantics: non-absent signals "not cleanly uninstalled"). Ships with unit tests covering every new code path and chainsaw e2e tests for the full lifecycle, cancellation, failure recovery, finalizer cleanup, and webhook gating. Documentation in `docs/uninstall.md`.

Adds a stage-agnostic skyhook.nvidia.com/NodeStateMalformed condition and the plumbing that makes it reliably reachable even when every downstream state reader bails on a parse error. Key pieces: - NewSkyhookNodeOnly no longer aborts on a json.Unmarshal failure of the nodeState annotation. It returns a wrapper whose cached state is empty so downstream State() callers re-encounter the error and handle it themselves. Previously the error propagated out of BuildState and aborted the entire reconcile before any condition-setting code could run, leaving operators with nothing but "Reconciler error" logs. - UpdateNodeStateMalformedCondition (on SkyhookNodes) collects every node whose annotation fails to parse and writes an UninstallFailed-distinct condition with Reason=ParseError. Node names longer than 10 characters are truncated to <first10>... so the message stays compact on clusters with long DNS-style names. - The condition is set and persisted at the top of the main per-skyhook Reconcile loop, before HandleFinalizer / ReportState / IntrospectSkyhook / HandleVersionChange run. Any of those can still abort the reconcile, but the user-visible signal is already persisted and keeps getting re-applied until the annotation is repaired. RemoveCondition is a no-op when nothing's wrong, so the healthy path incurs no extra PATCH. - HasUninstallWork, UpdateBlockedCondition, UpdateUninstallConditions return errors instead of silently skipping unreadable nodes. The upstream callers (processSkyhooksPerNode, hasReadyNodesForSkyhook, RunSkyhookPackages) append them to the reconcile error aggregate for backoff-driven retry. UpdateUninstallConditions keeps focus on actual uninstall failures; the state-parse surface lives in the dedicated condition above. Also tightens the uninstall-fix-config chainsaw test: after the spec fix it now asserts the UninstallFailed condition is actually cleared (previously only status=complete was checked, so a lingering condition wouldn't have failed the test). Chainsaw test malformed-node-state/ exercises the whole path: install → overwrite annotation with invalid JSON → assert NodeStateMalformed with reason=ParseError and a message listing the affected node → restore → assert condition gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

UpdateUninstallConditions gated its per-package work on the spec-only pkg.IsUninstalling() || (beingDeleted && UninstallEnabled()) predicate, which hid nodes still in an uninstall cycle after the spec stopped requesting uninstall. Concretely: a user flipping uninstall.apply true -> false while the node was at StageUninstallInterrupt (the uncancellable phase — interrupt pod has fired and must complete) caused the UninstallInProgress / UninstallFailed conditions to clear while the uninstall was objectively still running. Fix: consult nodeState.IsUninstallCycleInProgress first. Skip a package only when no cycle is in progress AND the spec isn't requesting uninstall. When a cycle is in progress, always mark inProgress and inspect nodeState[...].State for StateErroring to drive hasErrors, so conditions remain accurate until the cycle actually exits (RemoveState on uninstall-interrupt completion). Adds a chainsaw test (uninstall-cancel-during-interrupt) guarding the behavior: install with a service interrupt and SLEEP_LEN=5 to widen the StageUninstallInterrupt window, flip apply=true, wait for the interrupt phase, flip apply=false (uncancellable cancel attempt), assert UninstallInProgress remains True, wait for cycle exit + package re-install, assert condition cleared. The test strips the finalizer via `kubectl patch` as a final step so chainsaw's built-in teardown doesn't trigger another full uninstall cycle and exceed the context deadline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

…filter Audited the remaining node.State() callers after the earlier propagation work. Two silently-skipping sites had real operational hazards: - UpsertConfigmaps (fixed in the previous commit) — an unreadable node could leave a mid-uninstall package out of uninstallingPkgs, letting the loop blow away its ConfigMap and interfere with the uninstall. - RunSkyhookPackages runnable-package filter — `nodeState, _ := node.State()` silently produced a nil map on parse errors. The IsUninstallCycleInProgress check below is nil-safe (returns false on nil), so an apply pod could be queued against a node that was actually mid-uninstall. Now returns a wrapped error naming the node; the user-visible NodeStateMalformed condition set at the top of Reconcile remains the surface, and controller-runtime requeues. Other State() callers either already propagate (HasUninstallWork, UpdateBlockedCondition, HandleUninstallRequests, HandleCancelledUninstalls, HandleVersionChange, HandleFinalizer Phase 2, pods-vs-state validation, pod_controller StageUpgrade, migration 0.5.0) or silently skip by design (UpdateUninstallConditions — NodeStateMalformed owns that surface; CleanupSCRMetadata — shielded by Phase 2 propagation of the finalizer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

…s it CleanupSCRMetadata called `state, _ := node.State()` and relied on `len(state) > 0` to decide whether to preserve the nodeState annotation. Two issues: - The discarded error was implicit: a parse failure produced a nil map which happened to give `len == 0`, so the correct "wipe" behavior fell out by accident. Switched to `state, err := node.State()` and `preserveNodeState = err == nil && len(state) > 0` so the intent is explicit and a future refactor of State() can't regress it. - When the annotation gets deleted the in-memory cache `node.nodeState` was left populated. State() returns the cached map whenever it's non-nil, so any subsequent State() read on the same wrapper in this reconcile would serve pre-wipe data. Added `node.nodeState = nil` whenever we wipe, keeping cache and annotation in sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

…blocked The finalizer's single paused-or-disabled guard was blocking deletion for every paused/disabled Skyhook, including ones with no uninstall work to drive. That caused chainsaw tests ending with enabled=false to hit context-deadline-exceeded during cleanup. Split the guard by actual blocker: - Malformed nodeState: still DeletionBlocked (Reason MalformedNodeState), return error for controller-runtime backoff. - Paused + uninstall-enabled package still in nodeState: DeletionBlocked (Reason PausedWithPendingUninstall), hold finalizer -- pause is a "resume later" signal. - Disabled + uninstall-enabled package still in nodeState: emit Warning DeletedWhileDisabled and fall through to Phase 3. CleanupSCRMetadata already preserves the non-empty nodeState (D2), so host-side remnants remain traceable without the finalizer hanging. - Normal pending-uninstall wait and no-pending fall-through unchanged. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

… finalizer The uninstall-cancel-during-interrupt test previously patched metadata.finalizers to null during teardown to skirt a second full uninstall cycle on CR delete. Stripping the finalizer bypasses the controller, leaving skyhook.nvidia.com/* annotations, labels, and conditions on the shared test node for follow-on tests to trip over. Replace fast-cleanup with drain-uninstall-before-cleanup: re-trigger uninstall so nodeState goes to '{}', then let chainsaw's built-in CR delete run. Phase 2 of the finalizer sees no pending uninstall work and falls straight through to Phase 3, which properly clears per-Skyhook node metadata and uncordons — fast enough for chainsaw's default context deadline. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

The previous iteration let a disabled Skyhook fall through the finalizer's Phase 2 and proceed to cleanup, even with uninstall.enabled=true packages still tracked in nodeState. That silently discards host-side state the user explicitly asked to be cleaned. Treat disabled + hasPendingUninstall the same as paused + hasPendingUninstall: emit a DeletionBlocked condition (reason DisabledWithPendingUninstall) and hold the finalizer until the user re-enables the Skyhook so the uninstall flow can run. Remove the DeletedWhileDisabled warning event that announced the now-removed fall-through path. Packages with uninstall.enabled=false are unaffected -- Phase 2 already skips them, and CleanupSCRMetadata preserves the non-empty nodeState annotation (D2 semantics: non-absent entry means files remain on host). Tests: - Update delete-blocked-when-paused expected reason from the legacy PausedOrDisabled to PausedWithPendingUninstall. - Add delete-blocked-when-disabled, symmetric to the paused test: install -> disable -> delete -> assert DeletionBlocked + DisabledWithPendingUninstall reason + Warning event -> re-enable -> assert CR drains. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

…nstall HandleUninstallRequests and HandleVersionChange build toUninstall globally across all of a Skyhook's nodes. In a multi-node rollout, a package can be pending uninstall on node B while already absent on node A, yet the global list was prepended to every node's toRun unconditionally. On node A, ApplyPackage then saw a package not in nodeState, fell through to StageApply (the default), and created an install pod -- re-installing a package the user explicitly uninstalled. Once node B's uninstall finished and the user left apply=true, the next reconcile re-uninstalled it, setting up a uninstall/reinstall loop. Extract filterUninstallForNode and apply it in RunSkyhookPackages before prepending toUninstall to toRun, so entries absent from the current node's state are dropped. All chainsaw tests are single-node, which is why this wasn't caught before. The regression is covered by a new TestFilterUninstallForNode stdlib test (five cases: empty, present, absent, mixed, version mismatch). Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

Flip !(beingDeleted && pkg.UninstallEnabled()) to (!beingDeleted || !pkg.UninstallEnabled()) in UpdateUninstallConditions so golangci-lint staticcheck QF1001 stops blocking CI. Pure refactor -- no behavior change. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

The repair-and-verify-cleanup step used a bash here-string (wc -c <<<"$value") to byte-count the restored annotation. CI's /usr/bin/sh is dash, which rejects <<< with "Syntax error: redirection unexpected", failing the test before it can run the kubectl annotate that follows. Replace with printf %s "$value" | wc -c -- same byte count, POSIX-portable. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

… helpers HandleUninstallRequests and HandleCancelledUninstalls returned the bare err from node.State(), losing the node identifier in the error chain. That violates the project rule against returning bare err from a function that does more than one thing -- when reconcile retries on backoff, the operator log can't say which node's annotation is unreadable without re-deriving it from elsewhere. Wrap with fmt.Errorf("node %s: reading state in <Helper>: %w", ...) to mirror the pattern already used at the other node.State() call sites. No behavior change. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

HandleVersionChange returned bare err from node.State(); the call site in the stage-metrics initialization loop wrapped the error but dropped the node name. Both lose the node identifier in the chain when reconcile retries on backoff. Wrap with fmt.Errorf("node %s: ... : %w", ...) for consistency with the other node.State() call sites. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

The pre-ApplyPackage filter in RunSkyhookPackages skipped a package absent from nodeState only when pkg.IsUninstalling() was true (uninstall.apply=true). Finalizer-driven uninstalls run for every UninstallEnabled package regardless of Apply (see HandleUninstallRequests' needsUninstall predicate), so after the finalizer completed uninstall on a node with apply=false the filter missed it and RunNext would re-enqueue the package -- ApplyPackage fell through to StageApply and re-installed what was just uninstalled, setting up a reinstall loop until all nodes' uninstalls finished. Extract shouldSkipApplyForUninstall mirroring HandleUninstallRequests: "uninstall requested" = IsUninstalling() || (beingDeleted && UninstallEnabled()). Apply it to the filter; routine first-installs of uninstall-enabled packages (absent + CR not deleting) are unaffected. Covered by a new TestShouldSkipApplyForUninstall with six cases, including the never-installed-enabled-package case that guards against over-skipping. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

isPackageFullyUninstalled returned false whenever Status.NodeState was empty, which permanently blocked removal of uninstall.enabled=true packages from any Skyhook whose NodeSelector matched zero nodes or that hadn't yet been reconciled. Zero tracked nodes means there's nothing to uninstall, so treat it as done rather than as "not yet confirmed." Also document the separate finalizer-deadlock when a package is stuck at state: erroring on an install stage — the CR can't be deleted because HandleUninstallRequests only transitions from install-cycle stages when state is complete. Added to docs/uninstall.md as a Known Issue with diagnosis steps and ranked workarounds (fix install, CLI reset, finalizer strip) pending a proper fix. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

UpdateBlockedCondition only set the condition while the dependency was in an active uninstall cycle (StageUninstall / StageUninstallInterrupt). Once the dep's uninstall pod completed and the package went absent from nodeState, the condition cleared even though the dependent was still broken — the Skyhook would report complete with an unsatisfiable DependsOn silently in the spec. Two changes: 1. Treat "gone" as either inCycle OR (IsUninstalling && absent from every node). The second case is the post-uninstall terminal state per D2 and must also trigger Blocked. 2. Only flag Blocked when the dependent has pending work — if IsPackageComplete is true on every node the orphaned DependsOn blocks nothing in-flight, so the Skyhook can go complete. The condition is raised only when it would otherwise be in_progress. Message now distinguishes "is being uninstalled" (in-cycle) from "has been uninstalled" (terminal) so users know whether cancelling is still an option. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

Move UpdateBlockedCondition and UpdateUninstallConditions from RunSkyhookPackages up to the pre-pause branch of Reconcile, next to UpdateNodeStateMalformedCondition. processSkyhooksPerNode short-circuits on paused/disabled Skyhooks, which meant the conditions went stale the moment a user paused a Skyhook with in-flight uninstall work — or during CR deletion, where HandleFinalizer may set DeletionBlocked but the uninstall-state conditions never refreshed. HandleFinalizer stays focused on deletion gating; condition mirroring is a separate concern that runs for every Skyhook regardless of pause/disable/delete state. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

Rare but permanent-stuck scenario: if the uninstall pod has already transitioned the node to stage: uninstall-interrupt but the interrupt pod isn't present (never fired, manually deleted, evicted), and the user then removes the interrupt: block from the package spec, ProcessInterrupt short-circuits on HasInterrupt=false, no pod gets created, and HandleUninstallRequests only calls RemoveState once state == complete — which never happens without a pod. Documented with trigger sequence, the jq one-liner to confirm the state, re-add-interrupt and CLI-reset workarounds, and the proposed long-term fix (commit the uninstall-interrupt decision irrevocably at HandleCompletePod time, ignore later spec edits to the interrupt: block). Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

After moving the condition updates to the pre-pause branch of Reconcile, a state-read error in UpdateBlockedCondition short-circuited the entire per-Skyhook loop and made HandleFinalizer's malformed-nodeState branch unreachable — the branch that sets DeletionBlocked with Reason=MalformedNodeState and emits a deletion-specific Warning event. Skip unreadable nodes silently and keep processing; the parse failure is already surfaced by UpdateNodeStateMalformedCondition at the top of Reconcile. When a node's state is unreadable, stay conservative on the "terminal uninstalled" determination: we can't rule out the package still being present on the missing node, so neither the inCycle nor done branch fires for its packages. inCycle observations from readable nodes still surface — that signal is monotonic (one node confirms the cycle). Matches the tolerant pattern already used in UpdateUninstallConditions. HandleFinalizer's stateErr branch is reachable again and its deletion-specific DeletionBlocked condition will fire on malformed state during CR deletion. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

Four user-reachable behaviors that don't yet have code fixes: 1. apply: true on a never-installed package is a silent no-op. The IsUninstalling() + absent-from-nodeState collision treats the package as "already uninstalled" and skips install. Easy to hit by copy-pasting YAML across clusters. 2. Changing version while apply: true is in flight strands the new version. The webhook only blocks downgrades during uninstall; upgrades slip through and shouldSkipApplyForUninstall then treats the new version as uninstalled. 3. Cancelling during stage: uninstall may "uninstall then reinstall" instead of resuming the install. HandleCancelledUninstalls resets the stage but doesn't kill the running pod, and HandleCompletePod erases the reset entry if the pod finishes first. 4. CleanupSCRMetadata's suffix match can over-delete autoTaint_* when the Skyhook name happens to equal a taint key. Almost never reachable in practice but worth documenting. Each entry includes symptom, why, confirmation commands, workaround, and proposed long-term fix. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

…iteral Two lint failures from the recent uninstall fixes: 1. gocyclo: Reconcile crossed the cyclomatic-complexity ceiling (31 > 30) after H4 added two more error/return branches at the top of the per-Skyhook loop. Pull the four-step "refresh per-Skyhook conditions" block (NodeStateMalformed → save → Blocked → UninstallInProgress / UninstallFailed) into a helper, leaving Reconcile with one error-handling branch in its place. 2. goconst: the "pkg-b" dependent-package fixture appeared 18 times in TestUpdateBlockedCondition. Promote it to a const dependentPkgName and a matchesPkgB matcher closure used by every IsPackageComplete expectation. No behavior change. Tests still pass. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

Skyhook.AddCondition compared LastTransitionTime for equality, but every caller passes metav1.Now(), so the comparison never matched and the function set Updated=true on every invocation — even when nothing about the condition had functionally changed. This interacted badly with refreshSkyhookConditions (introduced when UpdateBlockedCondition / UpdateUninstallConditions were moved to the top of the per-skyhook reconcile loop): every reconcile re-asserted the existing conditions, set Updated=true, and ReportState then short- circuited the rest of Reconcile. validateAndUpsertSkyhookData (which calls ValidateRunningPackages to invalidate pods whose env no longer matches the spec) never ran, so a failing uninstall pod could not be recreated when the user fixed its config — surfaced by the uninstall-fix-config chainsaw test. Follow K8s API conventions: preserve LastTransitionTime when Status is unchanged, and treat the call as a true no-op when Status, Reason, Message, and ObservedGeneration all match. Updated is only flipped when something functionally differs. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

UpdateNodeStateMalformedCondition listed every malformed node by name in the condition message — only the per-name length was bounded by truncateNodeName. On a large cluster with many malformed nodes this could grow the message into the kilobytes, where it stops being a readable signal and starts pressuring the status object size. Cap the inlined list at maxMalformedNodesListed (5) entries after the deterministic sort, then summarise any remainder as "and N more". The total affected count was already in the message and is preserved, so users still see the full scale of the problem. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

Adds a Known Issue covering the narrow path where force-deleting a Skyhook (--force --grace-period=0) bypasses the finalizer mid-uninstall, leaving the node annotation at stage: uninstall / state: complete. On recreate, HandleUninstallRequests re-adds the package and ApplyPackage's NextStage call maps uninstall -> apply for completed packages, running one stray apply pod before the controller re-triggers uninstall. Issue is self-correcting (one extra apply cycle, then proper uninstall) and not exercised by chainsaw because force-delete + recreate isn't a tested flow. Captured here with confirmation snippet, avoidance guidance, manual workaround (kubectl skyhook reset on the affected nodes), and the shape of the long-term fix in HandleUninstallRequests. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

… helpers Aligns the explicit-uninstall conditions with main's new convention (internal/wrapper/skyhook_conditions.go): bare types like "Blocked", "UninstallInProgress", "UninstallFailed", "NodeStateMalformed", and "DeletionBlocked" routed through wrapper.AddSkyhookCondition / RemoveSkyhookConditionTypes instead of legacy "skyhook.nvidia.com/<Type>" strings via the wrapper's AddCondition / RemoveCondition methods. Why: these conditions are introduced in this branch — there are no existing consumers to support — so they get the bare style without a legacy alias. Registering them as constants in skyhook_conditions.go gives the next reader a single place to find every supported type. Updated chainsaw assertions and READMEs in delete-blocked-when-paused, delete-blocked-when-disabled, malformed-node-state, uninstall-cancel-during-interrupt, and uninstall-fix-config to query the bare type names. docs/uninstall.md, the design doc, and the plan doc updated to match. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>

lockwobr

LGTM

ayuskauskas requested review from lockwobr, mskalka and rice-riley as code owners April 21, 2026 00:16

ayuskauskas force-pushed the feat/uninstall-enhancement-collapsed branch from 4c39a6d to d2dfe07 Compare April 21, 2026 00:57

rayaank-afk reviewed Apr 22, 2026

View reviewed changes

Comment thread docs/uninstall.md

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

lockwobr reviewed Apr 22, 2026

View reviewed changes

Comment thread operator/api/v1alpha1/skyhook_types.go

lockwobr requested changes Apr 22, 2026

View reviewed changes

lockwobr reviewed Apr 22, 2026

View reviewed changes

Comment thread k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/README.md

lockwobr reviewed Apr 22, 2026

View reviewed changes

Comment thread CLAUDE.md Outdated

lockwobr reviewed Apr 22, 2026

View reviewed changes

Comment thread k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook/chainsaw-test.yaml

ayuskauskas force-pushed the feat/uninstall-enhancement-collapsed branch 2 times, most recently from 1fe3ece to 2c9d1bb Compare April 23, 2026 17:32

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

lockwobr linked an issue Apr 23, 2026 that may be closed by this pull request

Uninstall pod does not receive package ConfigMap or Env — user-provided scripts and environment variables are silently dropped #186

Closed

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

lockwobr assigned ayuskauskas and unassigned ayuskauskas Apr 30, 2026

ayuskauskas and others added 3 commits May 1, 2026 09:06

fix: small issues from coderabbit

209115a

ayuskauskas and others added 23 commits May 1, 2026 09:06

ayuskauskas force-pushed the feat/uninstall-enhancement-collapsed branch from b51f8d7 to 56d68a7 Compare May 1, 2026 16:50

lockwobr approved these changes May 1, 2026

View reviewed changes

ayuskauskas merged commit 71ca22b into main May 1, 2026
15 checks passed

ayuskauskas deleted the feat/uninstall-enhancement-collapsed branch May 1, 2026 20:44

lockwobr mentioned this pull request May 7, 2026

fix: update helm chart for drift #227

Merged

4 tasks

Conversation

ayuskauskas commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

coveralls commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report for CI Build 25223484198

Coverage increased (+0.4%) to 81.782%

Details

Uncovered Changes

Coverage Regressions

Coverage Stats

💛 - Coveralls

Uh oh!

Uh oh!

lockwobr commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayuskauskas commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coveralls commented Apr 21, 2026 •

edited

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading