feat: variable upgrade timeout based on num nodes #3752
Conversation
👏 much better than what I suggested by having a configurable parameter. You are right. A user should not need to know the details. The software should have a good idea of how long it should take. |
Codecov Report
@@ Coverage Diff @@
## master #3752 +/- ##
==========================================
+ Coverage 73.17% 73.18% +0.01%
==========================================
Files 147 147
Lines 25322 25333 +11
==========================================
+ Hits 18529 18540 +11
Misses 5655 5655
Partials 1138 1138
Continue to review full report at Codecov.
|
I wonder if there are clusters in this range that could be impacted. Perhaps some VM type and/or region that is particularly slow to provision might currently be expecting more of the three-hour limit for But no, that sounds pathological anyway once I've typed it out, and this is a more generous timeout for clusters where it's likely to matter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis, mboersma The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Reason for Change:
This PR replaces the static 3 hour upgrade timeout with a "per-node" timeout. We calculate a a sum 20 mins per node for every node-to-be upgraded, and allow the aks-engine binary that much total time before throwing a general
aks-engine upgrade
operation timeout error.For small clusters (less than 9 nodes including control plane nodes), this has the practical effect of reducing the total timeout tolerance for
aks-engine upgrade
operations. For larger clusters (10 or more total nodes), this has the practical effect of increasing the timeout tolerance. For really large clusters, that timeout tolerance becomes quite large.In summary, the observed ~5 mins per-node upgrade time means that clusters with > 30 nodes cannot be upgraded in one operation successfully: they will consistently run into the 3 hour
aks-engine upgrade
timeout. This change unblocks such large clusteraks-engine upgrade
scenarios, with the tradeoff of waiting a really long time in the event of a "terminally stuck somewhere in Azure" encounter.Issue Fixed:
Fixes #3746
Requirements:
Notes: