Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

fix: make Windows scaling work #2643

Merged
merged 1 commit into from Jan 30, 2020
Merged

Conversation

jackfrancis
Copy link
Member

@jackfrancis jackfrancis commented Jan 28, 2020

Reason for Change:

This PR will make Windows scaling actually work. The current scaling implementation uses an incorrect substring parsing implementation to derive the "pool index suffix", which is used at cluster creation time to compose the name of a Windows VMSS resource. Ultimately, this incorrect parsing has two bad side-effects:

  • it prevents the scale operation from correctly identifying the VMSS resource when the --node-pool argument value is the pool name of a Windows node pool
  • the above has the side-effect of creating a new Windows node pool with the desired --new-node-count, instead of failing the scale operation

Additionally, this PR fixes a bug which was causing VHD-backed Linux node pools from entering the first bad side-effect above, and logging a spuriuos error ERRO[0004] resource name was missing from identifier for such Linux node pool scaling scenarios.

Issue Fixed:

Fixes #2642
Fixes #2637

Requirements:

Notes:

@codecov
Copy link

codecov bot commented Jan 28, 2020

Codecov Report

Merging #2643 into master will decrease coverage by <.01%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2643      +/-   ##
==========================================
- Coverage   71.76%   71.75%   -0.01%     
==========================================
  Files         134      134              
  Lines       25001    25003       +2     
==========================================
  Hits        17941    17941              
- Misses       6010     6012       +2     
  Partials     1050     1050

@jackfrancis
Copy link
Member Author

jackfrancis commented Jan 28, 2020

Tested with VMSS scale up and down against a cluster with a single Windows node pool, multiple Windows pool. And the same for VMAS.

@@ -374,8 +375,8 @@ func (sc *scaleCmd) run(cmd *cobra.Command, args []string) error {
return errors.Wrap(err, "failed to get VMSS list in the resource group")
}
for _, vmss := range vmssListPage.Values() {
vmName := *vmss.Name
if !sc.vmInAgentPool(vmName, vmss.Tags) {
vmssName := *vmss.Name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a code hygiene change: we are re-using the VMSS name as a variable, not the name of an individual vm instance. Naming the variable vmssName is clearer.

_, _, winPoolIndex, _, err = utils.WindowsVMNameParts(vmName)
log.Errorln(err)
if sc.agentPool.OSType == api.Windows {
winPoolIndexStr := vmssName[len(vmssName)-2:]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad. See:

https://github.com/Azure/aks-engine/blob/master/pkg/api/types.go#L916

Essentially, we bake a two digit integer into the pool name, which we are only able to determine conclusively at cluster creation time by enumerating through the api model array. There's nothing JSON-breaking that would prevent the api model from being changed over time in such a way where the agentPoolProfiles array items get re-ordered. So the only way we can determine the index that cluster creation appended into the VMSS name is to get the last two chars and convert it to an int.

And then we use that int starting in Line 466 below.

I think this solution is the most practical way forward without a significant refactor.

@jackfrancis
Copy link
Member Author

@jadarsie Can you confirm that Azure Stack windows VMSS names look like this:

https://github.com/Azure/aks-engine/blob/master/pkg/api/types.go#L916

E.g.: 4128k8s00

@jackfrancis jackfrancis added this to Under Review in backlog Jan 29, 2020
@jackfrancis jackfrancis added this to the v0.47.0 milestone Jan 29, 2020
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@acs-bot
Copy link

acs-bot commented Jan 30, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [CecileRobertMichon,jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis jackfrancis merged commit ba43d1e into Azure:master Jan 30, 2020
backlog automation moved this from Under Review to Done Jan 30, 2020
@jackfrancis jackfrancis deleted the windows-scale branch January 30, 2020 22:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
backlog
  
Done
3 participants