chore: Cherry pick again#254
Merged
Merged
Conversation
…fter-uninstall (#241) The test left mypkg@1.2.3 freshly installed at chainsaw's automatic CLEANUP, where the new delete-time uninstall finalizer (from #200) must run uninstall pods on every node before releasing the CR. That exceeds chainsaw's default cleanup window, causing "context deadline exceeded". Add a final uninstall-v1 step that flips uninstall.apply=true on the downgraded version and waits for nodeState to empty, so the CR has no installed packages when cleanup deletes it. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
* fix(test): harden core e2e pool against finalizer-cleanup races and improve diagnosability Two changes across the six core-pool chainsaw tests: 1. Add `timeouts.cleanup: 120s` to every core-pool test. Chainsaw's default cleanup window (~30s) is too tight for the project's CR finalizer, which must uncordon nodes, remove labels/annotations, and GC package pods before releasing. Same failure mode that hit downgrade-after-uninstall (PR #241). 2. Add `catch:` blocks to `depends-on` and `simple-skyhook`, mirroring the pattern used by the other four core tests. When these tests flake, CI now captures the Node, Skyhook CR, and package Pods for diagnosis instead of failing without artifacts. No assert logic changes — this is a foundation pass to remove a known flake class and make future flakes diagnosable. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com> * fix(operator): close StageInterrupt trap when configUpdates signal decays A package whose only interrupt is a `configInterrupts` entry (no top-level `interrupt` block) could get stuck at `StageInterrupt/StateSkipped` permanently when `Status.ConfigUpdates` cleared or never persisted (e.g. due to a 409 on the spec patch). The state machine queried the dynamic `HasInterrupt(config)` signal at four points, and once that signal decayed to false, the package was untouchable: `ProgressSkipped` wouldn't promote it, `NextStage` wouldn't advance it past Interrupt, and `GetComplete`/`IsPackageComplete` wouldn't count it complete. Decouple progression-past-Interrupt from the dynamic signal: - `NextStage`: add `StageInterrupt → StagePostInterrupt` to the no-interrupt default map. The with-interrupt full-replacement map is preserved as-is — PR #200 deliberately omits `StageUninstall → StageApply` from it so with-interrupt uninstalls route via `StageUninstallInterrupt`; collapsing the maps would silently re-enable that transition. - `ProgressSkipped`: drop the `HasInterrupt` gate. The only writer of `StateSkipped` at `StageInterrupt` is `ProcessInterrupt`'s budget-contention branch, which already decided the package needed an interrupt to schedule. Stage alone is sufficient. - `GetComplete`, `IsPackageComplete`: treat `StagePostInterrupt` as unconditionally terminal. The only way to reach it is via the interrupt cycle; gating the terminal check on a signal that can decay is redundant with the entry gate and only becomes load-bearing when the trap fires. Reproducer in `skyhook_types_test.go` exercises all three sites. Surfaced by the `chainsaw/config-skyhook` "update while running" step, which patches the spec mid-flight and concurrently triggers the `HandleMigrations` and `processSkyhooksPerNode` 409 conflict paths — those stretch convergence enough that ConfigUpdates can be lost between the package reaching StageInterrupt and ProgressSkipped firing. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com> * refactor(operator): GetComplete delegates to IsPackageComplete GetComplete and IsPackageComplete encoded the same terminal-state logic twice. Have GetComplete iterate the spec packages and call IsPackageComplete for each, and move the explanatory comment to IsPackageComplete where the predicate lives. Behavior preserved: the prior implementation iterated node state and filtered to spec packages by name+version match; the new one iterates spec packages and looks them up in node state by unique name (name|version). Same set of packages either way. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com> --------- Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
rice-riley
approved these changes
May 22, 2026
Contributor
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (184)
📝 WalkthroughWalkthroughAdds a design doc and implements a separate webhook-bootstrap manager/lease with concurrent startup, adjusts Skyhook NodeState completion/interrupt transitions with tests, extends Chainsaw tests and cleanup timeouts, updates Makefile and deps (govulncheck, docker default), and vendors significant golang.org/x/* (http2, idna, sys, tools) and Windows/Unix syscall changes. Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
git cherry-pick 2dfec98 6d6f126 bdfa3e6