Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

docs: upgrade + cluster-autoscaler notes #381

Merged
merged 5 commits into from Jan 31, 2019

Conversation

jackfrancis
Copy link
Member

Reason for Change:

Document current issues and limitations of aks-engine upgrade and cluster-autoscaler

Issue Fixed:

Requirements:

Notes:

@jackfrancis
Copy link
Member Author

@feiskyer Can you kindly read through my initial description of upgrade + VMSS + cluster-autoscaler and comment if it's an accurate summary of the current situation? Thanks!


### Cluster-autoscaler + VMSS

At present, the Azure cloudprovider cluster-autoscaler implementation for VMSS relies upon the original ARM template deployment specification to inform the Azure IaaS (VM, NIC, CustomScriptExtension, etc) configuration to scale out new nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aks-engine upgrade would also update the model of vmss itself, right? If so, then new nodes scaled should also be applied with this new model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, this is not the case. The kubectl get nodes output I pasted below is from a vmss upgrade then cluster-autoscaler event. So apparently the answer is "no, aks-engine upgrade does not update the model of the vmss itself".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then scale up for VMSS would still use the original old model. Do you have any ideas to make CA also works for upgraded clusters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O.K., so I have evidence that what I said is not true. Perhaps I was testing a VMAS agent pool. I will upgrade my VMSS test cluster a few more times and validate that cluster-autoscaler continues to respect the versions as they move forward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm wondering that only true if the model of VMSS itself is also upgraded.


### Cluster-autoscaler + VMAS

A similar scenario exists for VMAS as well, but because the cluster-autoscaler spec includes a configurable ARM template deployment reference, you may manually maintain that reference over time to be current with the ARM template deployment that `aks-engine upgrade` creates during an upgrade operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another paragraph of how to get the upgraded ARM templates and parameters?


The upgrade operation is long running, and for large clusters, more susceptible to single operational failures. This is based on the design principle of upgrade enumerating, one-at-a-time, through each node in the cluster. A transient Azure resource allocation error could thus interrupt the successful progression of the overall transaction. At present, the upgrade operation is implemented to "fail fast"; and so, if a well formed upgrade operation fails before completing, it can be manually retried by invoking the exact same command line arguments as were sent originally. The upgrade operation will enumerate through the cluster nodes, skipping any nodes that have already been upgraded to the desired Kubernetes version. Those nodes that match the *original* Kubernetes version will then, one-at-a-time, be cordon and drained, and upgraded to the desired version. Put another way, an upgrade command is designed to be idempotent across retry scenarios.

### Cluster-autoscaler + VMSS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a link/reference to this doc in the cluster-autoscaler doc examples/addons/cluster-autoscaler/README.md

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we can add one after this is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add it in the same PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I mean add a link in cluster-autoscaler repo back to here.

And yep, in examples, it could be added in the same PR.

@jackfrancis jackfrancis added this to In progress in backlog Jan 28, 2019
@acs-bot acs-bot added size/S and removed size/M labels Jan 29, 2019
@jackfrancis jackfrancis changed the title WIP docs: upgrade + cluster-autoscaler notes docs: upgrade + cluster-autoscaler notes Jan 30, 2019
@jackfrancis
Copy link
Member Author

@CecileRobertMichon this is ready for a re-review

@codecov
Copy link

codecov bot commented Jan 30, 2019

Codecov Report

Merging #381 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #381   +/-   ##
=======================================
  Coverage   53.42%   53.42%           
=======================================
  Files          95       95           
  Lines       14361    14361           
=======================================
  Hits         7673     7673           
  Misses       6025     6025           
  Partials      663      663

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot
Copy link

acs-bot commented Jan 31, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [CecileRobertMichon,jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis jackfrancis merged commit 1a58ba3 into Azure:master Jan 31, 2019
backlog automation moved this from In progress to Done Jan 31, 2019
@jackfrancis jackfrancis deleted the docs-upgrade-cluster-autoscaler branch January 31, 2019 18:54
juhacket pushed a commit to juhacket/aks-engine that referenced this pull request Mar 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
backlog
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants