Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running GPU pod: "Insufficient nvidia.com/gpu" #36

Closed
JimmyWhitaker opened this issue Mar 5, 2018 · 6 comments
Closed

Error running GPU pod: "Insufficient nvidia.com/gpu" #36

JimmyWhitaker opened this issue Mar 5, 2018 · 6 comments

Comments

@JimmyWhitaker
Copy link

I am unable get GPU device support through k8s.
I am running 2 p2.xlarge nodes on AWS with a manual installation of K8s.
The nvidia-docker2 is installed and set as the default runtime. I tested this by running the following and getting the expected output.
docker run --rm nvidia/cuda nvidia-smi

I followed all the steps in the readme of this repo, and cannot seem to get the containers to have GPU access. Running the nvidia-device-plugin.yml seems to be up and working, but running a pod gives this error when trying to launch the digits job:

$ kubectl get pod gpu-pod --template '{{.status.conditions}}' [map[type:PodScheduled lastProbeTime:<nil> lastTransitionTime:2018-02-26T21:58:32Z message:0/2 nodes are available: 1 PodToleratesNodeTaints, 2 Insufficient nvidia.com/gpu. reason:Unschedulable status:False]]

I thought that it might be that I was requiring too many resources (2 per node), but even lowering the requirements in the yml still yielded the same result. Any ideas where things could be going wrong?

@RenaudWasTaken
Copy link
Contributor

Hello!

Can you show:

  • Make sure you enabled the device plugin feature gate
  • kubectl describe gpu-node
  • The k8s-device-plugin container logs
  • The output of nvidia-smi -a on your host
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

@mrajjay91
Copy link

@JimnyCricket : How did you solve this error? I am getting the same error. Thanks

@RenaudWasTaken
Copy link
Contributor

Hello @mrajjay91:

Can you show:

  • Make sure you enabled the device plugin feature gate
  • kubectl describe gpu-node
  • The k8s-device-plugin container logs
  • The output of nvidia-smi -a on your host
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

@mrajjay91
Copy link

@RenaudWasTaken : Thanks for the quick reply. I was able to resolve the issue by looking at the logs of k8s-device-plugin container. I was getting the Error: Could not load NVML library so I had to
export PATH=/usr/local/cuda-9.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

I was also getting the error of check default runtime as nvidia. I had edited the /etc/docker/daemon.json file to include default run-time but I had not restarted docker so I restarted docker and it fixed the issue.

Is there a community where we can post questions?

@RenaudWasTaken
Copy link
Contributor

Is there a community where we can post questions?

That should be on the github repository of either:

  • The nvidia device plugin
  • nvidia-docker

@meirhazonAnyVision
Copy link

Hello Renaud Gaubert,
I am having the same issue.
Can you please describe how you solved this issue,
did you define the PATH etc. at the kubernetes master? where did you define this?
Thanks

archlitchi referenced this issue in 4paradigm/k8s-vgpu-scheduler Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants