Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

feat: variable upgrade timeout based on num nodes #3752

Merged
merged 1 commit into from Aug 26, 2020

Conversation

jackfrancis
Copy link
Member

Reason for Change:

This PR replaces the static 3 hour upgrade timeout with a "per-node" timeout. We calculate a a sum 20 mins per node for every node-to-be upgraded, and allow the aks-engine binary that much total time before throwing a general aks-engine upgrade operation timeout error.

For small clusters (less than 9 nodes including control plane nodes), this has the practical effect of reducing the total timeout tolerance for aks-engine upgrade operations. For larger clusters (10 or more total nodes), this has the practical effect of increasing the timeout tolerance. For really large clusters, that timeout tolerance becomes quite large.

In summary, the observed ~5 mins per-node upgrade time means that clusters with > 30 nodes cannot be upgraded in one operation successfully: they will consistently run into the 3 hour aks-engine upgrade timeout. This change unblocks such large cluster aks-engine upgrade scenarios, with the tradeoff of waiting a really long time in the event of a "terminally stuck somewhere in Azure" encounter.

Issue Fixed:

Fixes #3746

Requirements:

Notes:

@devigned
Copy link
Member

👏 much better than what I suggested by having a configurable parameter. You are right. A user should not need to know the details. The software should have a good idea of how long it should take.

@codecov
Copy link

codecov bot commented Aug 26, 2020

Codecov Report

Merging #3752 into master will increase coverage by 0.01%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3752      +/-   ##
==========================================
+ Coverage   73.17%   73.18%   +0.01%     
==========================================
  Files         147      147              
  Lines       25322    25333      +11     
==========================================
+ Hits        18529    18540      +11     
  Misses       5655     5655              
  Partials     1138     1138              
Impacted Files Coverage Δ
pkg/operations/kubernetesupgrade/upgrader.go 64.27% <93.75%> (+0.71%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 498c927...ff34157. Read the comment docs.

@mboersma
Copy link
Member

For small clusters (less than 9 nodes including control plane nodes), this has the practical effect of reducing the total timeout tolerance

I wonder if there are clusters in this range that could be impacted. Perhaps some VM type and/or region that is particularly slow to provision might currently be expecting more of the three-hour limit for aks-engine upgrade than 20min/node will give them?

But no, that sounds pathological anyway once I've typed it out, and this is a more generous timeout for clusters where it's likely to matter.

Copy link
Member

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot
Copy link

acs-bot commented Aug 26, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [jackfrancis,mboersma]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis jackfrancis merged commit 6b4dfcf into Azure:master Aug 26, 2020
@jackfrancis jackfrancis deleted the upgrade-dynamic-timeout branch August 26, 2020 20:05
penggu pushed a commit to penggu/aks-engine that referenced this pull request Oct 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade has a Timeout of 3 hrs
4 participants