New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cordon all outdated nodes before any rolling update action #41
Comments
So the reason why I haven't designed the algorithm to just cordon everything before beginning the draining phase is that if you cordon everything and you have a spike in load, scaling will be delayed by at least the time it takes for a node to be spun up, and possibly longer, because the rolling update handler may also be draining a node while the scheduler is trying to schedule pods whose scaling was brought about by an HPA. TL;DR: This was done on purpose, and cordoning all nodes, while making the upgrade faster, also increases the risk that the upgrade will no longer be "transparent"/"graceful" and may cause degraded application performance. That said, I'm not completely against the idea of implementing it as an optional feature, if you believe it to be necessary for your use case(s). |
I understand, that totally makes sense. |
Sounds good! |
Resolved by #42 |
Hey @TwiN, is this getting released any time soon? |
@someone-stole-my-name It was already available through the |
Describe the feature request
The current behaviour is to iterate over every outdated node and then first cordon and then drain it immediately afterwards. I think the behaviour should actually be to first cordon all outdated instances before doing anything and then just behave as usual.
Why do you personally want this feature to be implemented?
I whish this feature to be implemented because the current behaviour often (in my experience) leads to pods beeing replaced onto an outdated instance. This leads to a lot of pod restarts during rolling updates as pods get replaced more than once. This is espacially bad for pods with a long terminationGracePeriod or a long startup period. It happens that a pod doesn't even get ready after a replacement before it gets replaced again.
How long have you been using this project?
~3-4 months
Additional information
I would volunteer to implement this feature, even with backward compatibility if required.
The text was updated successfully, but these errors were encountered: