fix: Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints #1044

chengliangli0918 · 2019-04-12T23:36:42Z

Reason for Change:
Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints:
Edge case to fix: on a multiple node AKS/aks-engine cluster, manually set taints to prevent a pod from being scheduled on any agent nodes, so that pod is always in Pending. After upgrading, that pod is scheduled on one of new nodes, although the annotations/labels/taints are copied over from old nodes to new nodes correspondingly

Issue Fixed:

Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints by evicting any pods might be scheduled on to the newly created agent nodes before copying over node properties (annotations/labels/taints) from old node to new node
Move the recent retry logic added to VMSS agent nodes by xizhamsft to common place in order that both VMAS/VMSS agent pool share the same code path to copy over node properties from old nodes to new nodes.

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
tested upgrade from previous version

Notes:

…original node's labels/taints

codecov · 2019-04-13T00:06:56Z

Codecov Report

Merging #1044 into master will increase coverage by 0.03%.
The diff coverage is 63.63%.

@@            Coverage Diff             @@
##           master    #1044      +/-   ##
==========================================
+ Coverage   74.74%   74.78%   +0.03%     
==========================================
  Files         128      128              
  Lines       18352    18373      +21     
==========================================
+ Hits        13717    13740      +23     
+ Misses       3843     3837       -6     
- Partials      792      796       +4

pkg/operations/kubernetesupgrade/upgrader.go

palma21 · 2019-04-15T23:55:21Z

/lgtm

jackfrancis · 2019-04-16T00:02:35Z

/hold

… make the node schedulable

CecileRobertMichon · 2019-04-17T16:38:15Z

@jackfrancis what's the reason for holding this one? I think maybe you meant to hold #976

jackfrancis · 2019-04-17T17:00:31Z

I noticed that E2E looked good and thought this one needed more review/testing. Just didn't want it to auto-merge too quickly.

jackfrancis · 2019-04-22T22:18:16Z

This lgtm

@chengliangli0918 just a thought: in the future, should we taint an upgraded node at creation time so that no workloads are initially scheduled? And then we do all of the post-creation operations, and then finally we untaint the node? Basically something like this should reduce the edge case surface area of workloads being scheduled "during" an upgrade.

chengliangli0918 · 2019-04-22T23:21:31Z

agree. In the future, we should set UnSchedule=true when the node is created, and then set UnSchedule=false if creation and post-creation configuration is done. This will need code change for scenarios, like new cluster creation, cluster scaling up, cluster upgrading for both AKS/AKS-engine. This might need more efforts to achieve.

This lgtm
@chengliangli0918 just a thought: in the future, should we taint an upgraded node at creation time so that no workloads are initially scheduled? And then we do all of the post-creation operations, and then finally we untaint the node? Basically something like this should reduce the edge case surface area of workloads being scheduled "during" an upgrade.

palma21 · 2019-05-10T17:36:23Z

@jackfrancis are we ready to merge this?

jackfrancis · 2019-05-10T18:32:14Z

Running upgrade validation test now

…original node's labels/taints

… make the node schedulable

jackfrancis · 2019-05-10T20:14:07Z

FYI rebased/force-pushed so we can do back-compat tests

…ili/taintstoleration Merge with master

jackfrancis · 2019-05-13T18:12:30Z

/lgtm

acs-bot · 2019-05-13T18:13:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chengliangli0918, jackfrancis, palma21

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

chengliangli0918 added 2 commits April 12, 2019 16:17

fix: Ensure pods scheduled onto new nodes during upgrade respect the …

5a2b0a4

…original node's labels/taints

add one missing line

bdf52b3

acs-bot added the size/M label Apr 12, 2019

chengliangli0918 requested review from CecileRobertMichon, mboersma, jackfrancis and palma21 April 12, 2019 23:36

CecileRobertMichon changed the title ~~Fix: Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints~~ fix: Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints Apr 15, 2019

CecileRobertMichon reviewed Apr 15, 2019

View reviewed changes

pkg/operations/kubernetesupgrade/upgrader.go Show resolved Hide resolved

acs-bot assigned palma21 Apr 15, 2019

acs-bot added the lgtm label Apr 15, 2019

acs-bot added the do-not-merge/hold label Apr 16, 2019

use two UpdateNode() calls separately to appy the node properties and…

a5873f4

… make the node schedulable

acs-bot removed the lgtm label Apr 16, 2019

fix linter

b6f149a

jackfrancis added this to In progress in backlog Apr 16, 2019

mboersma added the approved label Apr 18, 2019

palma21 approved these changes Apr 22, 2019

View reviewed changes

backlog automation moved this from In progress to Under Review Apr 22, 2019

chengliangli0918 added 2 commits May 10, 2019 13:13

fix: Ensure pods scheduled onto new nodes during upgrade respect the …

47bb151

…original node's labels/taints

add one missing line

f131dbd

chengliangli0918 added 2 commits May 10, 2019 13:13

use two UpdateNode() calls separately to appy the node properties and…

248d016

… make the node schedulable

fix linter

983e331

jackfrancis force-pushed the charlili/taintstoleration branch from b6f149a to 983e331 Compare May 10, 2019 20:14

jackfrancis and others added 5 commits May 10, 2019 13:57

chore: regular async messages should be debug

3732892

Merge branch 'master' of ssh://github.com/Azure/aks-engine into charl…

72cb033

…ili/taintstoleration Merge with master

fix linting checks (same code passed linting checks a few weeks ago)

1d49563

use the same drain timeout as that in other places added recently

2f9b5cb

fix linting checks (same code passed linting checks a few weeks ago)

b3adf5d

acs-bot assigned jackfrancis May 13, 2019

acs-bot added the lgtm label May 13, 2019

jackfrancis merged commit fa37a95 into Azure:master May 13, 2019

backlog automation moved this from Under Review to Done May 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints #1044

fix: Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints #1044

chengliangli0918 commented Apr 12, 2019 •

edited

codecov bot commented Apr 13, 2019 •

edited

palma21 commented Apr 15, 2019

jackfrancis commented Apr 16, 2019

CecileRobertMichon commented Apr 17, 2019

jackfrancis commented Apr 17, 2019

jackfrancis commented Apr 22, 2019

chengliangli0918 commented Apr 22, 2019

palma21 commented May 10, 2019

jackfrancis commented May 10, 2019

jackfrancis commented May 10, 2019

jackfrancis commented May 13, 2019

acs-bot commented May 13, 2019

fix: Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints #1044

fix: Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints #1044

Conversation

chengliangli0918 commented Apr 12, 2019 • edited

codecov bot commented Apr 13, 2019 • edited

Codecov Report

palma21 commented Apr 15, 2019

jackfrancis commented Apr 16, 2019

CecileRobertMichon commented Apr 17, 2019

jackfrancis commented Apr 17, 2019

jackfrancis commented Apr 22, 2019

chengliangli0918 commented Apr 22, 2019

palma21 commented May 10, 2019

jackfrancis commented May 10, 2019

jackfrancis commented May 10, 2019

jackfrancis commented May 13, 2019

acs-bot commented May 13, 2019

chengliangli0918 commented Apr 12, 2019 •

edited

codecov bot commented Apr 13, 2019 •

edited