How to add tolerations to kube-proxy / kube-svc-redirect #363

derekperkins · 2018-05-11T18:07:44Z

I've added some node taints using NoSchedule, so it hasn't impacted kube-proxy or kube-svc-redirect, which are the only kube-system DaemonSets. If those get reset or upgraded, they currently only have a toleration to run on master nodes, so I don't think that they will get recreated on my tainted nodes. Is there some way to add a toleration, or is there something I'm not seeing that will force them to be deployed on every node?

The text was updated successfully, but these errors were encountered:

derekperkins · 2018-07-12T19:33:38Z

Any insights here? I ran into this after upgrading my cluster and neither of these were recreated properly and broke my applications.

derekperkins · 2018-07-12T19:46:26Z

It looks like it has already been implemented in both core k8s AND acs-engine, so I'm surprised that it isn't already in AKS. For reference:

Add wildcard tolerations to kube-proxy Add wildcard tolerations to kube-proxy kubernetes/kubernetes#56589
Improvement: fluentd-gcp to get same toleration as kube-proxy Improvement: fluentd-gcp to get same toleration as kube-proxy kubernetes/kubernetes#44445
Modifying toleration on Kube-Proxy Daemonset reverts back to default Modifying toleration on Kube-Proxy Daemonset reverts back to default acs-engine#2509
kube-proxy should always run even if a node tainted with NoExecute kube-proxy should always run even if a node tainted with NoExecute acs-engine#976
kube-proxy should be started with CriticalPodAnnotation and NoExecute toleration kube-proxy should be started with CriticalPodAnnotation and NoExecute toleration acs-engine#913
feat(kube-proxy): run kube-proxy as critical pod feat(kube-proxy): run kube-proxy as critical pod acs-engine#914

sancyx · 2018-10-18T13:13:44Z

I'm running into the same issue. Tried to manually edit tolerations on daemonsets which is fine for a few minutes, then it's replaced with original, which seems to be the issue on ACS engine Azure/acs-engine#2509.
@derekperkins Did you find any workarounds?

derekperkins · 2018-11-04T20:26:56Z

@sancyx No, and I just got bit by this when Azure redeployed these services to all my nodes except ones with taints. I honestly don't understand how Azure calls AKS a GA product.

janbrunrasmussen · 2018-11-12T15:12:19Z

We discussed using node-affinity with labels and pod-presets, but we don't really like having to add metadata to all the deployments not related to the taint. Our MS contact suggested ACI connector for AKS, but our memory requirements exceeds ACI's. So we are also not sure how to continue, other than creating a new cluster for the workloads we intended to go on the tainted nodes.

erewok · 2019-01-29T18:52:54Z

I think I'm seeing the same issue. I added a taint to a node and now the only kube-system deployment/daemonset on the node is kube-svc-redirect. Networking then falls apart for all deployments on the tainted node.

The weirdest part is that I tested this same exact set-up for over a month in my dev cluster and didn't see any issues. I can't figure out why it worked in my dev cluster and I wish it wouldn't have so I wouldn't have deployed with this solution in the first place.

Edit:

I removed the taint from my node and kube-proxy was immediately scheduled. After re-adding the taint (NoSchedule), it wasn't ejected, of course, but I assume it won't be rescheduled if needs to be scheduled somewhere.

brendandburns · 2019-01-30T00:46:51Z

This is clearly a bug and should be fixed.

For now, in addition to the trick of removing and then re-adding the taint (which will stick around until you upgrade the cluster, and then you will have to do the same trick again), sorry that's hacky.

An alternative hack would be to create a second DaemonSet which places kube-proxy onto the tainted nodes. (also a hack, but it would work as a temporary patch)

Ultimately the right fix is to do this in AKS, we'll get to it.

larsduelfer · 2019-01-30T05:53:39Z

We are currently facing a similar issue. Are there any plans to be able to add tolerations to aks components?

martin2176 · 2019-02-14T19:44:40Z

Have this issue been resolved? There are a lot many critical pods . Below is from a cluster fresh built.
I added a noexecute taint to one of the agent node. All the above pods was evicted from the tainted node. Looks like the functionality is still not implemented.

kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system azure-cni-networkmonitor-7b7pj 1/1 Running 0 52m
kube-system azure-cni-networkmonitor-9jthw 1/1 Running 0 57m
kube-system azure-ip-masq-agent-rv6tm 1/1 Running 0 57m
kube-system azure-ip-masq-agent-zfk5f 1/1 Running 0 52m
kube-system coredns-754f947b4-fvw4h 1/1 Running 0 52m
kube-system coredns-754f947b4-st6mv 1/1 Running 0 57m
kube-system coredns-autoscaler-589dd89ffd-pf8sk 1/1 Running 0 52m
kube-system heapster-5d6f9b846c-nnwcm 2/2 Running 0 57m
kube-system kube-proxy-26pff 1/1 Running 0 52m
kube-system kube-proxy-bnpf8 1/1 Running 0 57m
kube-system kube-svc-redirect-b9g5r 2/2 Running 0 52m
kube-system kube-svc-redirect-t78r5 2/2 Running 0 57m
kube-system kubernetes-dashboard-67bdc65878-zdnk2 1/1 Running 0 52m
kube-system metrics-server-5cbc77f79f-m7vtn 1/1 Running 0 52m
kube-system tunnelfront-647865c5c4-wvmpf 1/1 Running 0 52m

jnoller · 2019-02-14T20:13:14Z

This issue is unresolved - taints and tolerations are not supported at this time.

martin2176 · 2019-02-14T20:35:14Z

I think the Azure AKS documentation should make it very clear that taint and tolerations are not supported.
For eg: The below best practices from AKS says go ahead and use taint & tolerations
https://docs.microsoft.com/en-us/azure/aks/operator-best-practices-advanced-scheduler

martin2176 · 2019-05-06T16:42:43Z

with fix for Issue #468, node labels are now preserved on upgrade.
To be able to use taint and tolerations, the issue mentioned in this thread have to be resolved. Is there a fix coming soon?

janbrunrasmussen · 2019-05-13T10:07:21Z

https://github.com/Azure/AKS/releases/tag/2015-05-06

An issue where AKS managed pods (within kube-system) did not have the correct
tolerations preventing them from being scheduled when customers use
taints/tolerations has been fix

Does this mean a fix has been made for this?

jnoller · 2019-05-13T12:29:21Z

@janbrunrasmussen No, we're still ironing out taint/toleration support.

janbrunrasmussen · 2019-05-16T16:59:03Z

So what is this quote in the release notes referring to, @jnoller ? It does sound suspiciously like this issue :)

jnoller · 2019-05-16T17:14:35Z

@janbrunrasmussen which quote? We're still ironing out issues with taints / tolerations

janbrunrasmussen · 2019-05-16T17:33:04Z

Here: https://github.com/Azure/AKS/releases/tag/2019-05-06

An issue where AKS managed pods (within kube-system) did not have the correct
tolerations preventing them from being scheduled when customers use
taints/tolerations has been fix

jnoller · 2019-05-16T17:36:30Z

@janbrunrasmussen Yep, that was one of the fixes, we're tracking some additional ones to repair ASAP.

jnoller · 2019-05-17T21:27:50Z

Linking Issue #971

jluk · 2019-06-06T17:29:16Z

@derekperkins and @janbrunrasmussen could you both give this scenario another try? We've made some changes and fixes that should resolve this issue, would be good to get your confirmation.

janbrunrasmussen · 2019-06-06T19:23:24Z

@jluk Could you be more specific - which version? which region?

jluk · 2019-06-06T20:01:47Z

@janbrunrasmussen any region should work, but it may require a new cluster to guarantee the latest changes. Could you please try spinning up a new cluster and rerunning your scenario?

jluk · 2019-06-06T20:02:16Z

Any supported AKS k8s version should be fine

janbrunrasmussen · 2019-06-07T18:26:10Z

I ran through a couple of scenarios, and for me it seems to work - that is, I am able to add a taint, force restart of the proxy pod on the tainted node(s) in various ways and see the proxy pod come back up.
Thanks for resolving the issue (even if it took over a year to fix). I am not sure if others in the issue are facing slightly different use cases though.

jluk · 2019-06-07T18:32:39Z

@janbrunrasmussen thanks for confirmation and the patience, we've been tracking through a long backlog of things to fix and will be moving onto fixing many more to improve the experience. Closing this issue out based on feedback - if folks still hit issues please feel free to reopen.

martin2176 mentioned this issue Feb 14, 2019

Node labels removed after 1.10.3 upgrade #468

Closed

Azure deleted a comment from martin2176 Feb 18, 2019

jluk closed this as completed Jun 7, 2019

folkol mentioned this issue Jan 21, 2020

Tunnelfront and coredns fails to schedule when all nodes are tainted #1401

Closed

ghost locked as resolved and limited conversation to collaborators Aug 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to add tolerations to kube-proxy / kube-svc-redirect #363

How to add tolerations to kube-proxy / kube-svc-redirect #363

derekperkins commented May 11, 2018

derekperkins commented Jul 12, 2018

derekperkins commented Jul 12, 2018

sancyx commented Oct 18, 2018

derekperkins commented Nov 4, 2018

janbrunrasmussen commented Nov 12, 2018

erewok commented Jan 29, 2019 •

edited

brendandburns commented Jan 30, 2019

larsduelfer commented Jan 30, 2019

martin2176 commented Feb 14, 2019 •

edited

jnoller commented Feb 14, 2019

martin2176 commented Feb 14, 2019

martin2176 commented May 6, 2019

janbrunrasmussen commented May 13, 2019

jnoller commented May 13, 2019

janbrunrasmussen commented May 16, 2019

jnoller commented May 16, 2019 •

edited

janbrunrasmussen commented May 16, 2019

jnoller commented May 16, 2019

jnoller commented May 17, 2019

jluk commented Jun 6, 2019

janbrunrasmussen commented Jun 6, 2019

jluk commented Jun 6, 2019

jluk commented Jun 6, 2019

janbrunrasmussen commented Jun 7, 2019

jluk commented Jun 7, 2019

How to add tolerations to kube-proxy / kube-svc-redirect #363

How to add tolerations to kube-proxy / kube-svc-redirect #363

Comments

derekperkins commented May 11, 2018

derekperkins commented Jul 12, 2018

derekperkins commented Jul 12, 2018

sancyx commented Oct 18, 2018

derekperkins commented Nov 4, 2018

janbrunrasmussen commented Nov 12, 2018

erewok commented Jan 29, 2019 • edited

brendandburns commented Jan 30, 2019

larsduelfer commented Jan 30, 2019

martin2176 commented Feb 14, 2019 • edited

jnoller commented Feb 14, 2019

martin2176 commented Feb 14, 2019

martin2176 commented May 6, 2019

janbrunrasmussen commented May 13, 2019

jnoller commented May 13, 2019

janbrunrasmussen commented May 16, 2019

jnoller commented May 16, 2019 • edited

janbrunrasmussen commented May 16, 2019

jnoller commented May 16, 2019

jnoller commented May 17, 2019

jluk commented Jun 6, 2019

janbrunrasmussen commented Jun 6, 2019

jluk commented Jun 6, 2019

jluk commented Jun 6, 2019

janbrunrasmussen commented Jun 7, 2019

jluk commented Jun 7, 2019

erewok commented Jan 29, 2019 •

edited

martin2176 commented Feb 14, 2019 •

edited

jnoller commented May 16, 2019 •

edited