Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

fix: Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints #1044

Merged
merged 13 commits into from May 13, 2019

Conversation

chengliangli0918
Copy link
Contributor

@chengliangli0918 chengliangli0918 commented Apr 12, 2019

Reason for Change:
Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints:
Edge case to fix: on a multiple node AKS/aks-engine cluster, manually set taints to prevent a pod from being scheduled on any agent nodes, so that pod is always in Pending. After upgrading, that pod is scheduled on one of new nodes, although the annotations/labels/taints are copied over from old nodes to new nodes correspondingly

Issue Fixed:

  1. Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints by evicting any pods might be scheduled on to the newly created agent nodes before copying over node properties (annotations/labels/taints) from old node to new node

  2. Move the recent retry logic added to VMSS agent nodes by xizhamsft to common place in order that both VMAS/VMSS agent pool share the same code path to copy over node properties from old nodes to new nodes.

Requirements:

Notes:

@codecov
Copy link

codecov bot commented Apr 13, 2019

Codecov Report

Merging #1044 into master will increase coverage by 0.03%.
The diff coverage is 63.63%.

@@            Coverage Diff             @@
##           master    #1044      +/-   ##
==========================================
+ Coverage   74.74%   74.78%   +0.03%     
==========================================
  Files         128      128              
  Lines       18352    18373      +21     
==========================================
+ Hits        13717    13740      +23     
+ Misses       3843     3837       -6     
- Partials      792      796       +4

@CecileRobertMichon CecileRobertMichon changed the title Fix: Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints fix: Ensure pods scheduled onto new nodes during upgrade respect the original node's labels/taints Apr 15, 2019
@palma21
Copy link
Member

palma21 commented Apr 15, 2019

/lgtm

@jackfrancis
Copy link
Member

/hold

@acs-bot acs-bot removed the lgtm label Apr 16, 2019
@jackfrancis jackfrancis added this to In progress in backlog Apr 16, 2019
@CecileRobertMichon
Copy link
Contributor

@jackfrancis what's the reason for holding this one? I think maybe you meant to hold #976

@jackfrancis
Copy link
Member

I noticed that E2E looked good and thought this one needed more review/testing. Just didn't want it to auto-merge too quickly.

backlog automation moved this from In progress to Under Review Apr 22, 2019
@jackfrancis
Copy link
Member

This lgtm

@chengliangli0918 just a thought: in the future, should we taint an upgraded node at creation time so that no workloads are initially scheduled? And then we do all of the post-creation operations, and then finally we untaint the node? Basically something like this should reduce the edge case surface area of workloads being scheduled "during" an upgrade.

@chengliangli0918
Copy link
Contributor Author

agree. In the future, we should set UnSchedule=true when the node is created, and then set UnSchedule=false if creation and post-creation configuration is done. This will need code change for scenarios, like new cluster creation, cluster scaling up, cluster upgrading for both AKS/AKS-engine. This might need more efforts to achieve.

This lgtm
@chengliangli0918 just a thought: in the future, should we taint an upgraded node at creation time so that no workloads are initially scheduled? And then we do all of the post-creation operations, and then finally we untaint the node? Basically something like this should reduce the edge case surface area of workloads being scheduled "during" an upgrade.

@palma21
Copy link
Member

palma21 commented May 10, 2019

@jackfrancis are we ready to merge this?

@jackfrancis
Copy link
Member

Running upgrade validation test now

@jackfrancis
Copy link
Member

FYI rebased/force-pushed so we can do back-compat tests

@jackfrancis
Copy link
Member

/lgtm

@acs-bot acs-bot added the lgtm label May 13, 2019
@jackfrancis jackfrancis merged commit fa37a95 into Azure:master May 13, 2019
backlog automation moved this from Under Review to Done May 13, 2019
@acs-bot
Copy link

acs-bot commented May 13, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chengliangli0918, jackfrancis, palma21

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
backlog
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

6 participants