-
Notifications
You must be signed in to change notification settings - Fork 527
docs: upgrade + cluster-autoscaler notes #381
Changes from 1 commit
f3944b9
dab25f6
d0b85e4
fadf70f
8f13c69
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -76,6 +76,37 @@ For example, | |
--client-secret xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ||
``` | ||
|
||
### What can go wrong | ||
## Known Limitations | ||
|
||
### Manual reconciliation | ||
|
||
The upgrade operation is long running, and for large clusters, more susceptible to single operational failures. This is based on the design principle of upgrade enumerating, one-at-a-time, through each node in the cluster. A transient Azure resource allocation error could thus interrupt the successful progression of the overall transaction. At present, the upgrade operation is implemented to "fail fast"; and so, if a well formed upgrade operation fails before completing, it can be manually retried by invoking the exact same command line arguments as were sent originally. The upgrade operation will enumerate through the cluster nodes, skipping any nodes that have already been upgraded to the desired Kubernetes version. Those nodes that match the *original* Kubernetes version will then, one-at-a-time, be cordon and drained, and upgraded to the desired version. Put another way, an upgrade command is designed to be idempotent across retry scenarios. | ||
|
||
### Cluster-autoscaler + VMSS | ||
|
||
At present, the Azure cloudprovider cluster-autoscaler implementation for VMSS relies upon the original ARM template deployment specification to inform the Azure IaaS (VM, NIC, CustomScriptExtension, etc) configuration to scale out new nodes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In practice, this is not the case. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, then scale up for VMSS would still use the original old model. Do you have any ideas to make CA also works for upgraded clusters? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. O.K., so I have evidence that what I said is not true. Perhaps I was testing a VMAS agent pool. I will upgrade my VMSS test cluster a few more times and validate that cluster-autoscaler continues to respect the versions as they move forward. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, I'm wondering that only true if the model of VMSS itself is also upgraded. |
||
|
||
Because `aks-engine upgrade` also employs ARM template deployments to evolve the cluster state going forward, as soon as you upgrade your cluster, the cluster-autoscaler will no longer scale out your cluster according to the latest version. You will see cluster-autoscaler scale out scenarios that look like this one: | ||
|
||
``` | ||
$ kubectl get nodes | ||
NAME STATUS ROLES AGE VERSION | ||
k8s-agentpool-27988949-vmss00000a Ready agent 1h v1.11.5 | ||
k8s-agentpool-27988949-vmss00000m Ready agent 1m v1.10.12 | ||
k8s-agentpool-27988949-vmss00000n Ready agent 1m v1.10.12 | ||
k8s-agentpool-27988949-vmss00000o Ready agent 1m v1.10.12 | ||
k8s-agentpool-27988949-vmss00000p Ready agent 1m v1.10.12 | ||
k8s-agentpool-27988949-vmss00000q Ready agent 1m v1.10.12 | ||
k8s-agentpool-27988949-vmss00000r Ready agent 1m v1.10.12 | ||
k8s-master-27988949-0 Ready master 2h v1.11.5 | ||
``` | ||
|
||
For this reason, we do not recommend incorporating `aks-engine upgrade` into your operational workflow if you also rely upon cluster-autoscaler + VMSS | ||
|
||
### Cluster-autoscaler + VMAS | ||
|
||
A similar scenario exists for VMAS as well, but because the cluster-autoscaler spec includes a configurable ARM template deployment reference, you may manually maintain that reference over time to be current with the ARM template deployment that `aks-engine upgrade` creates during an upgrade operation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add another paragraph of how to get the upgraded ARM templates and parameters? |
||
|
||
// TODO describe the above | ||
|
||
|
||
By its nature, the upgrade operation is long running and potentially could fail for various reasons, such as temporary lack of resources, etc. In this case, rerun the command. The `upgrade` command is idempotent, and will pick up execution from the point it failed on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a link/reference to this doc in the cluster-autoscaler doc examples/addons/cluster-autoscaler/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we can add one after this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not add it in the same PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I mean add a link in cluster-autoscaler repo back to here.
And yep, in examples, it could be added in the same PR.