Error running GPU pod: "Insufficient nvidia.com/gpu" #36

JimmyWhitaker · 2018-03-05T14:15:14Z

I am unable get GPU device support through k8s.
I am running 2 p2.xlarge nodes on AWS with a manual installation of K8s.
The nvidia-docker2 is installed and set as the default runtime. I tested this by running the following and getting the expected output.
docker run --rm nvidia/cuda nvidia-smi

I followed all the steps in the readme of this repo, and cannot seem to get the containers to have GPU access. Running the nvidia-device-plugin.yml seems to be up and working, but running a pod gives this error when trying to launch the digits job:

$ kubectl get pod gpu-pod --template '{{.status.conditions}}' [map[type:PodScheduled lastProbeTime:<nil> lastTransitionTime:2018-02-26T21:58:32Z message:0/2 nodes are available: 1 PodToleratesNodeTaints, 2 Insufficient nvidia.com/gpu. reason:Unschedulable status:False]]

I thought that it might be that I was requiring too many resources (2 per node), but even lowering the requirements in the yml still yielded the same result. Any ideas where things could be going wrong?

The text was updated successfully, but these errors were encountered:

RenaudWasTaken · 2018-03-05T14:26:06Z

Hello!

Can you show:

Make sure you enabled the device plugin feature gate
kubectl describe gpu-node
The k8s-device-plugin container logs
The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

mrajjay91 · 2018-03-26T21:29:26Z

@JimnyCricket : How did you solve this error? I am getting the same error. Thanks

RenaudWasTaken · 2018-03-27T17:24:25Z

Hello @mrajjay91:

Can you show:

Make sure you enabled the device plugin feature gate
kubectl describe gpu-node
The k8s-device-plugin container logs
The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

mrajjay91 · 2018-03-27T20:54:21Z

@RenaudWasTaken : Thanks for the quick reply. I was able to resolve the issue by looking at the logs of k8s-device-plugin container. I was getting the Error: Could not load NVML library so I had to
export PATH=/usr/local/cuda-9.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

I was also getting the error of check default runtime as nvidia. I had edited the /etc/docker/daemon.json file to include default run-time but I had not restarted docker so I restarted docker and it fixed the issue.

Is there a community where we can post questions?

RenaudWasTaken · 2018-03-28T05:08:52Z

Is there a community where we can post questions?

That should be on the github repository of either:

The nvidia device plugin
nvidia-docker

meirhazonAnyVision · 2018-08-15T13:47:57Z

Hello Renaud Gaubert,
I am having the same issue.
Can you please describe how you solved this issue,
did you define the PATH etc. at the kubernetes master? where did you define this?
Thanks

Mlu combined

RenaudWasTaken closed this as completed Mar 28, 2018

archlitchi referenced this issue in 4paradigm/k8s-vgpu-scheduler Jan 25, 2024

Merge pull request #36 from 4paradigm/mlu-combined

ce5fd1d

Mlu combined

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error running GPU pod: "Insufficient nvidia.com/gpu" #36

Error running GPU pod: "Insufficient nvidia.com/gpu" #36

JimmyWhitaker commented Mar 5, 2018

RenaudWasTaken commented Mar 5, 2018

mrajjay91 commented Mar 26, 2018

RenaudWasTaken commented Mar 27, 2018

mrajjay91 commented Mar 27, 2018

RenaudWasTaken commented Mar 28, 2018

meirhazonAnyVision commented Aug 15, 2018

Error running GPU pod: "Insufficient nvidia.com/gpu" #36

Error running GPU pod: "Insufficient nvidia.com/gpu" #36

Comments

JimmyWhitaker commented Mar 5, 2018

RenaudWasTaken commented Mar 5, 2018

mrajjay91 commented Mar 26, 2018

RenaudWasTaken commented Mar 27, 2018

mrajjay91 commented Mar 27, 2018

RenaudWasTaken commented Mar 28, 2018

meirhazonAnyVision commented Aug 15, 2018