GPU node-pool with nvidia.com/gpu taint: missing ExtendedResourceToleration admission controller? #1449

thomas-riccardi · 2020-02-14T15:04:10Z

What happened:
The ExtendedResourceToleration admission controller seems to be missing: Pods using nvidia.com/gpu extended resources don't have the nvidia.com/gpu toleration automatically added.

What you expected to happen:
I would expect that requesting a nvidia.com/gpu would add a nvidia.com/gpu toleration (via the ExtendedResourceToleration admission controller), so I can properly and easily use multi node-pools.

How to reproduce it (as minimally and precisely as possible):

I created an AKS cluster with multiple node-pools (using terraform), one with nvidia GPUs.
To avoid non-gpu pods on gpu nodes I added the usual taint on these nodes: nvidia.com/gpu=present:NoSchedule
I then installed the nvidia device plugin: nvidia.com/gpu Resources are available on the node
I created a Pod requesting a gpu.

Anything else we need to know?:

https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#extendedresourcetoleration
https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#example-use-cases "Nodes with Special Hardware"
https://github.com/Azure/aks-engine/blob/master/docs/topics/clusterdefinitions.md does mention ExtendedResourceToleration

Environment:

Kubernetes version (use kubectl version): AKS v1.15.7
Size of cluster (how many worker nodes are in the cluster?): one default node pool with one node, and the GPU node pool with one node of type Standard_NC6_Promo
General description of workloads in the cluster: Pod requesting a nvidia.com/gpu

The text was updated successfully, but these errors were encountered:

martinhartig · 2020-04-09T08:31:38Z

Any progress on this issue?

github-actions · 2020-07-21T01:15:38Z

This issue has been automatically marked as stale because it has not had activity in 90 days. It will be closed in 30 days if no further activity occurs. Thank you!

palma21 · 2020-07-22T09:22:39Z

Thanks for the feedback, it's a great ask. Added to the backlog

CC @xuto2 @bowang-666

SahilChaudhary25 · 2020-08-12T12:20:36Z

@palma21 Is there any update on this issue ?

rajivml · 2020-08-12T12:22:52Z

I have been scratching my head from last 2 days on why it isn't working where as it used to work on GCP without adding toleration block to each pod explicitly.

Any update on when can we expect this feature in AKS, looks like the ticket is in opened 6 months back and it's still in backlog :( , it's a quite old feature , we have been using this on GCP from more than a year and right now we are migrating some of our workloads from gcp to azure and we require this feature to make sure non gpu workloads doesn't land on gpu Nodes and we have quite a lot of them, it would be a blocker issue for our migration

palma21 · 2020-08-17T23:50:03Z

Issue is fairly old but was only added to the backlog 27 days ago though 😄 (as of this comment)

This is something we actively working on and plan to have for next month. Thanks for raising the difference callout, it might be something other folks encounter too as they move. We'll prioritize accordingly.

ghost · 2020-10-31T18:02:19Z

Thank you for the feature request. I'm closing this issue as this feature has shipped and it hasn't had activity for 7 days.

triage-new-issues bot added the triage label Feb 14, 2020

github-actions bot added the stale Stale issue label Jul 21, 2020

triage-new-issues bot removed the triage label Jul 21, 2020

palma21 self-assigned this Jul 22, 2020

palma21 added feature feature-request Requested Features and removed stale Stale issue labels Jul 22, 2020

palma21 added this to Backlog in Azure Kubernetes Service Roadmap (Public) Jul 22, 2020

palma21 moved this from Backlog to Generally Available (Done) in Azure Kubernetes Service Roadmap (Public) Oct 24, 2020

github-actions bot mentioned this issue Oct 24, 2020

[AKS] Release 2020-10-19 dev-obs/actus#253

Open

palma21 added the resolution/shipped label Oct 24, 2020

palma21 moved this from Generally Available (Done) to Public Preview (Shipped & Improving) in Azure Kubernetes Service Roadmap (Public) Oct 24, 2020

ghost closed this as completed Oct 31, 2020

Azure locked as resolved and limited conversation to collaborators Dec 1, 2020

palma21 moved this from Public Preview (Shipped & Improving) to Generally Available (Done) in Azure Kubernetes Service Roadmap (Public) Dec 9, 2020

palma21 moved this from Generally Available (Done) to Archive (GA older than 6 months) in Azure Kubernetes Service Roadmap (Public) Dec 10, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU node-pool with nvidia.com/gpu taint: missing ExtendedResourceToleration admission controller? #1449

GPU node-pool with nvidia.com/gpu taint: missing ExtendedResourceToleration admission controller? #1449

thomas-riccardi commented Feb 14, 2020

martinhartig commented Apr 9, 2020

github-actions bot commented Jul 21, 2020

palma21 commented Jul 22, 2020

SahilChaudhary25 commented Aug 12, 2020

rajivml commented Aug 12, 2020 •

edited

palma21 commented Aug 17, 2020

ghost commented Oct 31, 2020

GPU node-pool with nvidia.com/gpu taint: missing ExtendedResourceToleration admission controller? #1449

GPU node-pool with nvidia.com/gpu taint: missing ExtendedResourceToleration admission controller? #1449

Comments

thomas-riccardi commented Feb 14, 2020

martinhartig commented Apr 9, 2020

github-actions bot commented Jul 21, 2020

palma21 commented Jul 22, 2020

SahilChaudhary25 commented Aug 12, 2020

rajivml commented Aug 12, 2020 • edited

palma21 commented Aug 17, 2020

ghost commented Oct 31, 2020

rajivml commented Aug 12, 2020 •

edited