Skip to content

util: tools to drain and uncordon cluster#1060

Merged
Arvindthiru merged 42 commits intoAzure:mainfrom
Arvindthiru:cordon-tool
Apr 10, 2025
Merged

util: tools to drain and uncordon cluster#1060
Arvindthiru merged 42 commits intoAzure:mainfrom
Arvindthiru:cordon-tool

Conversation

@Arvindthiru
Copy link
Copy Markdown
Contributor

@Arvindthiru Arvindthiru commented Feb 28, 2025

Description of your changes

Adding new tools to drain and uncordon cluster

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

If drain fails we automatically uncordon cluster

Comment thread tools/cordon-cluster/main/main.go Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
@Arvindthiru Arvindthiru marked this pull request as ready for review March 4, 2025 21:08
Comment thread tools/README.md Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
Comment thread tools/README.md Outdated
Comment thread tools/README.md Outdated
Comment thread tools/README.md Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
Comment thread tools/cordon-cluster/main/main.go Outdated
@Arvindthiru Arvindthiru marked this pull request as draft March 7, 2025 22:29
@Arvindthiru Arvindthiru changed the title util: simple tool to cordon cluster util: tools to drain and uncordon cluster Mar 11, 2025
@Arvindthiru Arvindthiru marked this pull request as ready for review March 12, 2025 00:29
Comment thread tools/draincluster/main.go Outdated
Comment thread tools/draincluster/README.md Outdated
Comment thread tools/draincluster/README.md Outdated
Comment thread tools/draincluster/README.md Outdated
Comment thread tools/draincluster/README.md
Comment thread tools/draincluster/README.md Outdated
Comment thread tools/draincluster/main.go Outdated
Comment thread tools/utils/common.go
return nil, err
}

hubClient, err := client.New(restConfig, client.Options{Scheme: scheme})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this client use cache?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread tools/draincluster/main.go Outdated
Comment thread tools/draincluster/main.go Outdated
Comment thread tools/draincluster/main.go Outdated
@michaelawyu
Copy link
Copy Markdown
Contributor

Hi Arvind! Below are some of the points I've been thinking about since our earlier discussion; hopefully they can be of use/relevance to you:

@michaelawyu
Copy link
Copy Markdown
Contributor

Hi Arvind! Below are some of the points I've been thinking about since our earlier discussion; hopefully they can be of use/relevance to you:

a) The kubectl drain command functions similar to what we have in this PR right now, which first marks a node as unschedulable (cordoned) then attempts to evict all pods (if applicable -> certain types of pods will be exempted). The tool is not fully consistent; there still leaves a (albeit quite short) time window when a pod can be assigned to a node during the draining process -> this can happen if the draining runs when the scheduler is about to finish scheduling a pod. In this regard, however, K8s has an advantage over us: in K8s we have first the pods, then the scheduling decisions (whereas in Fleet, bindings are provisioned after the scheduling is completed); so if strong consistency is desired (i.e., must drain everything applicable), kubectl can just watch for unassigned pods in addition to assigned pods, and wait until scheduling is complete on these unassigned pods (successful or not, either way's OK).

@michaelawyu
Copy link
Copy Markdown
Contributor

michaelawyu commented Mar 18, 2025

b) as discussed earlier in our sync, if we would like to implement stronger consistency in the draining tool, options would include

  1. use the NoExecute semantics -> this is not yet implemented
  2. set up the rollout controller to reject bindings associated with a unschedulable cluster -> the concerns are mostly that *) this incurs coupling/logic leaking in our code + *) since we have multiple rollout controller options now it would not feel right to make this case a part of our rollout contract
  3. set up the scheduler to be able to ACK cordoning/draining attempts, so that we know for sure no bindings would be created any more -> this option also incurs concerns regarding coupling/logic leaking

There might not be a very easily approachable solution here 😞

@michaelawyu
Copy link
Copy Markdown
Contributor

c) With that being said, it might make more sense for us to evaluate how important the stronger consistency is to our users, esp. considering that kubectl does not give the same guarantee either (though in theory it can implement this much more easily).

@michaelawyu
Copy link
Copy Markdown
Contributor

d) A side note is that, kubectl drain/cordon marks a cluster as unschedulable by setting its node spec (there is a schedulable boolean value) -> it does not use taints (consequently no toleration can be set up). Of course this might not necessarily apply to our case

Arvind Thirumurugan added 2 commits March 31, 2025 18:03
@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 4, 2025

Persistent review updated to latest commit 7fa0c3f

@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 4, 2025

Persistent review updated to latest commit 729c015

@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 4, 2025

Persistent review updated to latest commit e657966

@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 4, 2025

Persistent review updated to latest commit d7eebb8

Copy link
Copy Markdown
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a number of comments, PTAL

Comment thread tools/draincluster/README.md
Comment thread tools/draincluster/drain/drain.go
Comment thread tools/draincluster/drain/drain.go
Comment thread tools/draincluster/drain/drain.go Outdated
Comment thread tools/draincluster/drain/drain.go Outdated
Comment thread tools/draincluster/drain/drain.go Outdated
Comment thread tools/draincluster/drain/drain.go
Comment thread tools/draincluster/drain/drain.go Outdated
Comment thread tools/draincluster/main.go Outdated
Comment thread tools/uncordoncluster/uncordon/uncordon.go Outdated
@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 7, 2025

Persistent review updated to latest commit fc6ae91

@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 7, 2025

Persistent review updated to latest commit 7c34d23

@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 9, 2025

Persistent review updated to latest commit c715acf

@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 9, 2025

Persistent review updated to latest commit e4717e6

@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 9, 2025

Persistent review updated to latest commit eb20eee

@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 9, 2025

Persistent review updated to latest commit a369776

@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Apr 9, 2025

Persistent review updated to latest commit f00d3ad

Copy link
Copy Markdown
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Arvind! Apologies, I just need a little bit more time to review the code for one more time. Will add approval or more comments later if you do not mind.

Comment thread tools/draincluster/drain/drain.go
Comment thread tools/draincluster/drain/drain.go
}
}
executedCondition := eviction.GetCondition(string(placementv1beta1.PlacementEvictionConditionTypeExecuted))
if executedCondition == nil || executedCondition.Status == metav1.ConditionFalse {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Arvind! It's not a blocker per se but I still believe that we should reply on API semantics for validations in this regard.

}
}
executedCondition := eviction.GetCondition(string(placementv1beta1.PlacementEvictionConditionTypeExecuted))
if executedCondition == nil || executedCondition.Status == metav1.ConditionFalse {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, even with the function you mentioned, unknown Valid condition can still pass the check, so are unknown Executed conditions.

Copy link
Copy Markdown
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few other nits; otherwise LGTM.

Comment thread tools/draincluster/drain/drain.go Outdated
Comment thread tools/draincluster/drain/drain.go Outdated
Comment thread tools/draincluster/main.go
@kaito-pr-agent
Copy link
Copy Markdown

Persistent review updated to latest commit 1885458

Copy link
Copy Markdown
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ;)

@Arvindthiru Arvindthiru merged commit d246699 into Azure:main Apr 10, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants