-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error running GPU pod: "Insufficient nvidia.com/gpu" #36
Comments
Hello! Can you show:
|
@JimnyCricket : How did you solve this error? I am getting the same error. Thanks |
Hello @mrajjay91: Can you show:
|
@RenaudWasTaken : Thanks for the quick reply. I was able to resolve the issue by looking at the logs of k8s-device-plugin container. I was getting the Error: Could not load NVML library so I had to I was also getting the error of check default runtime as nvidia. I had edited the /etc/docker/daemon.json file to include default run-time but I had not restarted docker so I restarted docker and it fixed the issue. Is there a community where we can post questions? |
That should be on the github repository of either:
|
Hello Renaud Gaubert, |
I am unable get GPU device support through k8s.
I am running 2 p2.xlarge nodes on AWS with a manual installation of K8s.
The nvidia-docker2 is installed and set as the default runtime. I tested this by running the following and getting the expected output.
docker run --rm nvidia/cuda nvidia-smi
I followed all the steps in the readme of this repo, and cannot seem to get the containers to have GPU access. Running the nvidia-device-plugin.yml seems to be up and working, but running a pod gives this error when trying to launch the digits job:
$ kubectl get pod gpu-pod --template '{{.status.conditions}}' [map[type:PodScheduled lastProbeTime:<nil> lastTransitionTime:2018-02-26T21:58:32Z message:0/2 nodes are available: 1 PodToleratesNodeTaints, 2 Insufficient nvidia.com/gpu. reason:Unschedulable status:False]]
I thought that it might be that I was requiring too many resources (2 per node), but even lowering the requirements in the yml still yielded the same result. Any ideas where things could be going wrong?
The text was updated successfully, but these errors were encountered: