feat: variable upgrade timeout based on num nodes #3752

jackfrancis · 2020-08-26T16:03:39Z

Reason for Change:

This PR replaces the static 3 hour upgrade timeout with a "per-node" timeout. We calculate a a sum 20 mins per node for every node-to-be upgraded, and allow the aks-engine binary that much total time before throwing a general aks-engine upgrade operation timeout error.

For small clusters (less than 9 nodes including control plane nodes), this has the practical effect of reducing the total timeout tolerance for aks-engine upgrade operations. For larger clusters (10 or more total nodes), this has the practical effect of increasing the timeout tolerance. For really large clusters, that timeout tolerance becomes quite large.

In summary, the observed ~5 mins per-node upgrade time means that clusters with > 30 nodes cannot be upgraded in one operation successfully: they will consistently run into the 3 hour aks-engine upgrade timeout. This change unblocks such large cluster aks-engine upgrade scenarios, with the tradeoff of waiting a really long time in the event of a "terminally stuck somewhere in Azure" encounter.

Issue Fixed:

Fixes #3746

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
tested upgrade from previous version

Notes:

devigned · 2020-08-26T16:16:23Z

👏 much better than what I suggested by having a configurable parameter. You are right. A user should not need to know the details. The software should have a good idea of how long it should take.

codecov · 2020-08-26T16:17:46Z

Codecov Report

Merging #3752 into master will increase coverage by 0.01%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master    #3752      +/-   ##
==========================================
+ Coverage   73.17%   73.18%   +0.01%     
==========================================
  Files         147      147              
  Lines       25322    25333      +11     
==========================================
+ Hits        18529    18540      +11     
  Misses       5655     5655              
  Partials     1138     1138

Impacted Files	Coverage Δ
pkg/operations/kubernetesupgrade/upgrader.go	`64.27% <93.75%> (+0.71%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 498c927...ff34157. Read the comment docs.

mboersma · 2020-08-26T18:46:50Z

For small clusters (less than 9 nodes including control plane nodes), this has the practical effect of reducing the total timeout tolerance

I wonder if there are clusters in this range that could be impacted. Perhaps some VM type and/or region that is particularly slow to provision might currently be expecting more of the three-hour limit for aks-engine upgrade than 20min/node will give them?

But no, that sounds pathological anyway once I've typed it out, and this is a more generous timeout for clusters where it's likely to matter.

mboersma

/lgtm

acs-bot · 2020-08-26T18:50:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jackfrancis,mboersma]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

feat: variable upgrade timeout based on num nodes

ff34157

acs-bot added size/S approved labels Aug 26, 2020

mboersma approved these changes Aug 26, 2020

View reviewed changes

acs-bot assigned mboersma Aug 26, 2020

acs-bot added the lgtm label Aug 26, 2020

jackfrancis merged commit 6b4dfcf into Azure:master Aug 26, 2020

jackfrancis deleted the upgrade-dynamic-timeout branch August 26, 2020 20:05

jackfrancis added a commit that referenced this pull request Aug 26, 2020

feat: variable upgrade timeout based on num nodes (#3752)

9a2d071

fmotrifork mentioned this pull request Aug 28, 2020

aks-engine 0.55.0 fishworks/fish-food#911

Merged

fmotrifork mentioned this pull request Sep 21, 2020

aks-engine 0.56.0 fishworks/fish-food#964

Merged

penggu pushed a commit to penggu/aks-engine that referenced this pull request Oct 28, 2020

feat: variable upgrade timeout based on num nodes (Azure#3752)

5a0e23e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: variable upgrade timeout based on num nodes #3752

feat: variable upgrade timeout based on num nodes #3752

jackfrancis commented Aug 26, 2020

devigned commented Aug 26, 2020

codecov bot commented Aug 26, 2020

mboersma commented Aug 26, 2020

mboersma left a comment

acs-bot commented Aug 26, 2020

feat: variable upgrade timeout based on num nodes #3752

feat: variable upgrade timeout based on num nodes #3752

Conversation

jackfrancis commented Aug 26, 2020

devigned commented Aug 26, 2020

codecov bot commented Aug 26, 2020

Codecov Report

mboersma commented Aug 26, 2020

mboersma left a comment

Choose a reason for hiding this comment

acs-bot commented Aug 26, 2020