Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU node-pool with nvidia.com/gpu taint: missing ExtendedResourceToleration admission controller? #1449

Closed
thomas-riccardi opened this issue Feb 14, 2020 · 7 comments

Comments

@thomas-riccardi
Copy link

What happened:
The ExtendedResourceToleration admission controller seems to be missing: Pods using nvidia.com/gpu extended resources don't have the nvidia.com/gpu toleration automatically added.

What you expected to happen:
I would expect that requesting a nvidia.com/gpu would add a nvidia.com/gpu toleration (via the ExtendedResourceToleration admission controller), so I can properly and easily use multi node-pools.

How to reproduce it (as minimally and precisely as possible):

  • I created an AKS cluster with multiple node-pools (using terraform), one with nvidia GPUs.
  • To avoid non-gpu pods on gpu nodes I added the usual taint on these nodes: nvidia.com/gpu=present:NoSchedule
  • I then installed the nvidia device plugin: nvidia.com/gpu Resources are available on the node
  • I created a Pod requesting a gpu.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): AKS v1.15.7
  • Size of cluster (how many worker nodes are in the cluster?): one default node pool with one node, and the GPU node pool with one node of type Standard_NC6_Promo
  • General description of workloads in the cluster: Pod requesting a nvidia.com/gpu
@martinhartig
Copy link

Any progress on this issue?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in 90 days. It will be closed in 30 days if no further activity occurs. Thank you!

@github-actions github-actions bot added the stale Stale issue label Jul 21, 2020
@palma21 palma21 self-assigned this Jul 22, 2020
@palma21 palma21 added feature feature-request Requested Features and removed stale Stale issue labels Jul 22, 2020
@palma21
Copy link
Member

palma21 commented Jul 22, 2020

Thanks for the feedback, it's a great ask. Added to the backlog

CC @xuto2 @bowang-666

@SahilChaudhary25
Copy link

@palma21 Is there any update on this issue ?

@rajivml
Copy link

rajivml commented Aug 12, 2020

I have been scratching my head from last 2 days on why it isn't working where as it used to work on GCP without adding toleration block to each pod explicitly.

Any update on when can we expect this feature in AKS, looks like the ticket is in opened 6 months back and it's still in backlog :( , it's a quite old feature , we have been using this on GCP from more than a year and right now we are migrating some of our workloads from gcp to azure and we require this feature to make sure non gpu workloads doesn't land on gpu Nodes and we have quite a lot of them, it would be a blocker issue for our migration

@palma21
Copy link
Member

palma21 commented Aug 17, 2020

Issue is fairly old but was only added to the backlog 27 days ago though 😄 (as of this comment)

This is something we actively working on and plan to have for next month. Thanks for raising the difference callout, it might be something other folks encounter too as they move. We'll prioritize accordingly.

@palma21 palma21 moved this from Backlog to Generally Available (Done) in Azure Kubernetes Service Roadmap (Public) Oct 24, 2020
@palma21 palma21 moved this from Generally Available (Done) to Public Preview (Shipped & Improving) in Azure Kubernetes Service Roadmap (Public) Oct 24, 2020
@ghost
Copy link

ghost commented Oct 31, 2020

Thank you for the feature request. I'm closing this issue as this feature has shipped and it hasn't had activity for 7 days.

@ghost ghost closed this as completed Oct 31, 2020
@Azure Azure locked as resolved and limited conversation to collaborators Dec 1, 2020
@palma21 palma21 moved this from Public Preview (Shipped & Improving) to Generally Available (Done) in Azure Kubernetes Service Roadmap (Public) Dec 9, 2020
@palma21 palma21 moved this from Generally Available (Done) to Archive (GA older than 6 months) in Azure Kubernetes Service Roadmap (Public) Dec 10, 2020
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Azure Kubernetes Service Roadmap (Pub...
Archive (GA older than 1 month)
Development

No branches or pull requests

5 participants