-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
创建pod报错nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run #80
Comments
nvidia 驱动 410.48 ; NVIDIA Corporation GP102 [TITAN X] ; nvidia-docker2-2.2.2 ; Centos7.6 |
你要看一下gpushare-device-plugin的日志。我怀疑gpushare-scheduler-extender没有正确配置。 |
I have the same issue. it shows:
I also checked gpushare-device-plugin. The log is here:
|
Hi guys, I also have this issue when i use demo1: My docker version is 19.3.3 nvidia driver vesion is 418.116.00, and i use kubectl-inspect-gpushare it shows: NAME IPADDRESS GPU0(Allocated/Total) PENDING(Allocated) GPU Memory(GiB)
|
可以尝试一下在container中添加env,如下:
这对于我的类似错误有效。 |
Fixed my issue by re-creating |
I tried what @Carlosnight says, but when I run |
Same issue here, any feedback on this? |
@Svendegroote91 in my case, it was the wrong configuration for |
@Mhs-220 I have 3 controller nodes but only did the kube-scheduler update on the controller node (node "ctr-1" in my case) through which I am using the KubeAPI (I have no HA on top of the controller nodes at the moment). That should be sufficient no? You can see that the kube-scheduler restarted after updating the file: Maybe it helps if I share the logs of the gpushare-schd-extender together with the logs of the actual container: Can you elaborate what exactly you forgot to apply to the kube-scheduler.yaml file or point to the mistake in my attached file? |
I solved it - I had to update all my master nodes with the instruction from the installation guide. |
Can somebody explain why the environment variable |
The error is happening again after update kubernetes to 1.18. Any Idea? |
@Mhs-220 my cluster version is v1.15.11 (because I am using Kubeflow on top of Kubernetes and v1.15 is fully supported) |
In my case, it does not work with v1.17.4, but works well with 1.15.11. Thanks to @Mhs-220 and @Svendegroote91. I do not know why it works or why it does not work. |
在我们的尝试中,加上 我们碰到的问题是 container 实际使用的内存大小超出了 k8s 设置的限制(或者是本身运行了一些非 k8s 管理的程序占用了显存),导致虽然 最后我们的解决方法是严格保证实际使用的显存小于限制值,并且将非 k8s 管理的程序迁移到其他服务器(或集群中)。 In our case, adding the The cause of our situation is that the container is using more memory than we give it (or there are some processes unmanaged by k8s using this card). Although the And finally, we ensure all containers will use less memory than we assign to them and move all processes which are not managed by k8s to other machines. |
versions
processingcontainers:
- name: cuda
image: nvidia/cuda:latest
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
resources:
limits:
# GiB
aliyun.com/gpu-mem: 1
kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
k8s243 10.3.171.243 2/7 2/7
k8s25 10.3.144.25 0/14 0/14
------------------------------------------------
Allocated/Total GPU Memory In Cluster:
2/21 (9%) |
I encountered the same problem after installing according to the instructions. Later, I found that the other master had not been adjusted according to the instructions. After I modified the scheduler component configuration of all two masters, they worked normally. |
分享下我的解决方案以供大家参考。我的问题出在修改 之前错误的方式:
问题就出在第 2 步: 在 正确的方式:
|
已收到您的邮件,我将及时查看并回复,谢谢 王鑫
|
seems like this is how it works AliyunContainerService/gpushare-device-plugin#55 (comment) |
你好,邮件已收到。我处理完会马上给你回复,祝生活愉快!
|
请教下您看的哪个安装教程吗 |
我使用的集群是1.16版本,按照教程安装完成后,aliyun.com/gpu-count: 2 gpu-mem:22
创建pod时aliyun.com/gpu-mem: 10,但是会报错stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run; 如果指定gpu-count: 1 报错Back-off restarting failed container
可能是什么原因?
环境:
docker 19.03.5
The text was updated successfully, but these errors were encountered: