Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

创建pod报错nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run #80

Open
HistoryGift opened this issue Dec 18, 2019 · 24 comments

Comments

@HistoryGift
Copy link

我使用的集群是1.16版本,按照教程安装完成后,aliyun.com/gpu-count: 2 gpu-mem:22
创建pod时aliyun.com/gpu-mem: 10,但是会报错stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run; 如果指定gpu-count: 1 报错Back-off restarting failed container

可能是什么原因?

环境:
docker 19.03.5

@HistoryGift
Copy link
Author

nvidia 驱动 410.48 ; NVIDIA Corporation GP102 [TITAN X] ; nvidia-docker2-2.2.2 ; Centos7.6
go 1.13.5 ;

@cheyang
Copy link
Collaborator

cheyang commented Feb 3, 2020

你要看一下gpushare-device-plugin的日志。我怀疑gpushare-scheduler-extender没有正确配置。

@bluelml
Copy link

bluelml commented Feb 20, 2020

I have the same issue. it shows:

Events:
  Type     Reason     Age                            From                  Message
  ----     ------     ----                           ----                  -------
  Normal   Scheduled  28s                            default-scheduler     Successfully assigned kong/binpack-2-54df84c8d7-nknfx to 192.168.3.4
  Normal   Pulled     <invalid> (x3 over <invalid>)  kubelet, 192.168.3.4  Container image "cheyang/gpu-player:v2" already present on machine
  Normal   Created    <invalid> (x3 over <invalid>)  kubelet, 192.168.3.4  Created container binpack-2
  Warning  Failed     <invalid> (x3 over <invalid>)  kubelet, 192.168.3.4  Error: failed to start container "binpack-2": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-2MiB-to-run\\\\n\\\"\"": unknown

I also checked gpushare-device-plugin. The log is here:

I0220 10:16:27.337772       1 main.go:18] Start gpushare device plugin
I0220 10:16:27.337870       1 gpumanager.go:28] Loading NVML
I0220 10:16:27.365052       1 gpumanager.go:37] Fetching devices.
I0220 10:16:27.365082       1 gpumanager.go:43] Starting FS watcher.
I0220 10:16:27.365204       1 gpumanager.go:51] Starting OS watcher.
I0220 10:16:27.392614       1 nvidia.go:64] Deivce GPU-5f44aae0-ca45-9038-5202-a033fa4f471a's Path is /dev/nvidia0
I0220 10:16:27.392682       1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.392691       1 nvidia.go:40] set gpu memory: 11
I0220 10:16:27.392699       1 nvidia.go:76] # Add first device ID: GPU-5f44aae0-ca45-9038-5202-a033fa4f471a-_-0
I0220 10:16:27.392713       1 nvidia.go:79] # Add last device ID: GPU-5f44aae0-ca45-9038-5202-a033fa4f471a-_-10
I0220 10:16:27.421574       1 nvidia.go:64] Deivce GPU-28071eed-0993-c165-e123-ea818a546f14's Path is /dev/nvidia1
I0220 10:16:27.421596       1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.421604       1 nvidia.go:76] # Add first device ID: GPU-28071eed-0993-c165-e123-ea818a546f14-_-0
I0220 10:16:27.421628       1 nvidia.go:79] # Add last device ID: GPU-28071eed-0993-c165-e123-ea818a546f14-_-10
I0220 10:16:27.453463       1 nvidia.go:64] Deivce GPU-203884e8-4afc-ad03-1a15-2ff5b34fc01c's Path is /dev/nvidia2
I0220 10:16:27.453482       1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.453490       1 nvidia.go:76] # Add first device ID: GPU-203884e8-4afc-ad03-1a15-2ff5b34fc01c-_-0
I0220 10:16:27.453502       1 nvidia.go:79] # Add last device ID: GPU-203884e8-4afc-ad03-1a15-2ff5b34fc01c-_-10
I0220 10:16:27.480145       1 nvidia.go:64] Deivce GPU-d0cd36b5-9221-facd-203c-b2342b207439's Path is /dev/nvidia3
I0220 10:16:27.480166       1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.480172       1 nvidia.go:76] # Add first device ID: GPU-d0cd36b5-9221-facd-203c-b2342b207439-_-0
I0220 10:16:27.480190       1 nvidia.go:79] # Add last device ID: GPU-d0cd36b5-9221-facd-203c-b2342b207439-_-10
I0220 10:16:27.501184       1 nvidia.go:64] Deivce GPU-c3ea63e8-ccde-4e49-7093-d570e14d82c2's Path is /dev/nvidia4
I0220 10:16:27.501203       1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.501209       1 nvidia.go:76] # Add first device ID: GPU-c3ea63e8-ccde-4e49-7093-d570e14d82c2-_-0
I0220 10:16:27.501216       1 nvidia.go:79] # Add last device ID: GPU-c3ea63e8-ccde-4e49-7093-d570e14d82c2-_-10
I0220 10:16:27.524208       1 nvidia.go:64] Deivce GPU-b663687a-4219-c707-1c3c-8ddcc59b9dbd's Path is /dev/nvidia5
I0220 10:16:27.524226       1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.524231       1 nvidia.go:76] # Add first device ID: GPU-b663687a-4219-c707-1c3c-8ddcc59b9dbd-_-0
I0220 10:16:27.524243       1 nvidia.go:79] # Add last device ID: GPU-b663687a-4219-c707-1c3c-8ddcc59b9dbd-_-10
I0220 10:16:27.547600       1 nvidia.go:64] Deivce GPU-8d72b13d-0942-e5f4-e2bc-e77d29e88be1's Path is /dev/nvidia6
I0220 10:16:27.547627       1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.547635       1 nvidia.go:76] # Add first device ID: GPU-8d72b13d-0942-e5f4-e2bc-e77d29e88be1-_-0
I0220 10:16:27.547653       1 nvidia.go:79] # Add last device ID: GPU-8d72b13d-0942-e5f4-e2bc-e77d29e88be1-_-10
I0220 10:16:27.573674       1 nvidia.go:64] Deivce GPU-7d1c58db-fc7d-8042-f86e-f37297c2a1c9's Path is /dev/nvidia7
I0220 10:16:27.573696       1 nvidia.go:69] # device Memory: 12036
I0220 10:16:27.573704       1 nvidia.go:76] # Add first device ID: GPU-7d1c58db-fc7d-8042-f86e-f37297c2a1c9-_-0
I0220 10:16:27.573718       1 nvidia.go:79] # Add last device ID: GPU-7d1c58db-fc7d-8042-f86e-f37297c2a1c9-_-10
I0220 10:16:27.573736       1 server.go:43] Device Map: map[GPU-b663687a-4219-c707-1c3c-8ddcc59b9dbd:5 GPU-8d72b13d-0942-e5f4-e2bc-e77d29e88be1:6 GPU-7d1c58db-fc7d-8042-f86e-f37297c2a1c9:7 GPU-5f44aae0-ca45-9038-5202-a033fa4f471a:0 GPU-28071eed-0993-c165-e123-ea818a546f14:1 GPU-203884e8-4afc-ad03-1a15-2ff5b34fc01c:2 GPU-d0cd36b5-9221-facd-203c-b2342b207439:3 GPU-c3ea63e8-ccde-4e49-7093-d570e14d82c2:4]
I0220 10:16:27.573807       1 server.go:44] Device List: [GPU-5f44aae0-ca45-9038-5202-a033fa4f471a GPU-28071eed-0993-c165-e123-ea818a546f14 GPU-203884e8-4afc-ad03-1a15-2ff5b34fc01c GPU-d0cd36b5-9221-facd-203c-b2342b207439 GPU-c3ea63e8-ccde-4e49-7093-d570e14d82c2 GPU-b663687a-4219-c707-1c3c-8ddcc59b9dbd GPU-8d72b13d-0942-e5f4-e2bc-e77d29e88be1 GPU-7d1c58db-fc7d-8042-f86e-f37297c2a1c9]
I0220 10:16:27.592888       1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I0220 10:16:27.593476       1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I0220 10:16:27.594247       1 server.go:230] Registered device plugin with Kubelet
I0220 16:33:58.048842       1 allocate.go:46] ----Allocating GPU for gpu mem is started----
I0220 16:33:58.048863       1 allocate.go:57] RequestPodGPUs: 2
I0220 16:33:58.048868       1 allocate.go:61] checking...
I0220 16:33:58.063205       1 podmanager.go:112] all pod list [{{ } {binpack-2-54df84c8d7-nknfx binpack-2-54df84c8d7- kong /api/v1/namespaces/kong/pods/binpack-2-54df84c8d7-nknfx ea316ce3-fd64-413e-b5fe-c06baff6444b 366992 0 2020-02-20 16:32:54 +0000 UTC <nil> <nil> map[app:binpack-2 pod-template-hash:54df84c8d7] map[] [{apps/v1 ReplicaSet binpack-2-54df84c8d7 a0faf71e-cd14-4614-a9f4-d87a236badd1 0xc4203bfe6a 0xc4203bfe6b}] nil [] } {[{default-token-kj8jt {nil nil nil nil nil &SecretVolumeSource{SecretName:default-token-kj8jt,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}] [] [{binpack-2 cheyang/gpu-player:v2 [] []  [] [] [] {map[aliyun.com/gpu-mem:{{2 0} {<nil>} 2 DecimalSI}] map[aliyun.com/gpu-mem:{{2 0} {<nil>} 2 DecimalSI}]} [{default-token-kj8jt true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}] Always 0xc4203bff08 <nil> ClusterFirst map[] default default <nil> 192.168.3.4 false false false <nil> &PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],} []   nil default-scheduler [{node.kubernetes.io/not-ready Exists  NoExecute 0xc4203bffa0} {node.kubernetes.io/unreachable Exists  NoExecute 0xc4203bffc0}] []  0xc4203bffd0 nil []} {Pending [{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-02-20 16:32:54 +0000 UTC  }]      <nil> [] [] BestEffort}}]
I0220 16:33:58.063418       1 podmanager.go:123] list pod binpack-2-54df84c8d7-nknfx in ns kong in node 192.168.3.4 and status is Pending
I0220 16:33:58.063430       1 podutils.go:81] No assume timestamp for pod binpack-2-54df84c8d7-nknfx in namespace kong, so it's not GPUSharedAssumed assumed pod.
W0220 16:33:58.063439       1 allocate.go:152] invalid allocation requst: request GPU memory 2 can't be satisfied.

@HaKuLaMeTAT
Copy link

Hi guys, I also have this issue when i use demo1:
Warning Failed 3m41s (x5 over 5m20s) kubelet, k8s-master Error: failed to start container "binpack-1": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-2MiB-to-run\\\\n\\\"\"": unknown

My docker version is 19.3.3 nvidia driver vesion is 418.116.00, and i use kubectl-inspect-gpushare it shows:

NAME IPADDRESS GPU0(Allocated/Total) PENDING(Allocated) GPU Memory(GiB)
k8s-master 192.168.1.103 0/7 2 2/7

Allocated/Total GPU Memory In Cluster:
2/7 (28%)

The gpushare-device-plugin log is same as above.
Is there any solution can share with me,thanks.

@Carlosnight
Copy link

nvidia 驱动 410.48 ; NVIDIA Corporation GP102 [TITAN X] ; nvidia-docker2-2.2.2 ; Centos7.6
go 1.13.5 ;

我使用的集群是1.16版本,按照教程安装完成后,aliyun.com/gpu-count: 2 gpu-mem:22
创建pod时aliyun.com/gpu-mem: 10,但是会报错stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run; 如果指定gpu-count: 1 报错Back-off restarting failed container

可能是什么原因?

环境:
docker 19.03.5

可以尝试一下在container中添加env,如下:

containers:
    - name: cuda
      image: nvidia/cuda:latest
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      resources:
        limits:
          # GiB
          aliyun.com/gpu-mem: 1

这对于我的类似错误有效。

@wynn5a
Copy link

wynn5a commented May 11, 2020

Fixed my issue by re-creating /etc/kubernetes/manifests/kube-scheduler.yaml to make static scheduler pod use correct configuration.

@Mhs-220
Copy link

Mhs-220 commented Jun 23, 2020

I tried what @Carlosnight says, but when I run kubectl inspect gpushare I see a new column called PENDING(Allocated) and it seems that GPU isolation not happened.
I checked the value of ALIYUN_COM_GPU_MEM_IDX inside my pod and it's -1.
P.S: I have a warning in gpushare-schd-extender that says: Pod gpu-pod in ns production is not set the GPU ID -1 in node xxxxxxxxx
P.S: The DaemonSet log shows that it's found GPU but repeatedly saying about not able to assign 3GiB memory to the pod.

@Svendegroote91
Copy link

Same issue here, any feedback on this?

@Mhs-220
Copy link

Mhs-220 commented Jul 15, 2020

@Svendegroote91 in my case, it was the wrong configuration for kube-scheduler.yaml. I recommend you to check it and read installation guide again

@Svendegroote91
Copy link

Svendegroote91 commented Jul 16, 2020

@Mhs-220
I attached the kube-scheduler.yaml.zip from my controller node.

I have 3 controller nodes but only did the kube-scheduler update on the controller node (node "ctr-1" in my case) through which I am using the KubeAPI (I have no HA on top of the controller nodes at the moment). That should be sufficient no? You can see that the kube-scheduler restarted after updating the file:
image

Maybe it helps if I share the logs of the gpushare-schd-extender together with the logs of the actual container:
What strikes me is that it first goes into "Pending" state and subsequently to "Running" but the status in the kubectl inspect gpushare is not moving.

image

Can you elaborate what exactly you forgot to apply to the kube-scheduler.yaml file or point to the mistake in my attached file?
Your help would be appreciated a lot!

@Svendegroote91
Copy link

I solved it - I had to update all my master nodes with the instruction from the installation guide.

@Svendegroote91
Copy link

Can somebody explain why the environment variable NVIDIA_VISIBLE_DEVICES fixes this and why it is needed in the manifest file?

@Mhs-220
Copy link

Mhs-220 commented Aug 19, 2020

The error is happening again after update kubernetes to 1.18. Any Idea?
@Svendegroote91 what is your cluster version?

@Svendegroote91
Copy link

@Mhs-220 my cluster version is v1.15.11 (because I am using Kubeflow on top of Kubernetes and v1.15 is fully supported)

@Panlichen
Copy link

In my case, it does not work with v1.17.4, but works well with 1.15.11. Thanks to @Mhs-220 and @Svendegroote91.

I do not know why it works or why it does not work.

@Panlichen
Copy link

v1.16.15 has the same problem. Seems that v1.16 is a version where API changes a lot. I have seen some deprecated "pre-1.16" versions in other projects, like shown here. Maybe @cheyang can help us out?

@chenaoxd
Copy link

chenaoxd commented Jul 27, 2021

nvidia 驱动 410.48 ; NVIDIA Corporation GP102 [TITAN X] ; nvidia-docker2-2.2.2 ; Centos7.6
go 1.13.5 ;

我使用的集群是1.16版本,按照教程安装完成后,aliyun.com/gpu-count: 2 gpu-mem:22
创建pod时aliyun.com/gpu-mem: 10,但是会报错stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run; 如果指定gpu-count: 1 报错Back-off restarting failed container
可能是什么原因?
环境:
docker 19.03.5

可以尝试一下在container中添加env,如下:

containers:
    - name: cuda
      image: nvidia/cuda:latest
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      resources:
        limits:
          # GiB
          aliyun.com/gpu-mem: 1

这对于我的类似错误有效。

在我们的尝试中,加上 NVIDIA_VISIBLE_DEVICES=all 这个环境变量看起来并不是一个合理的解决方法;加上之后会导致容器实际使用的显卡与 scheduler 分配的显卡并不匹配,从而导致其他问题。这一点可以通过 k exec [pod] -- env 查看正常的 pod 环境变量确认;可以发现 scheduler 会通过类似于 NVIDIA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 的环境变量控制 container 使用的显卡。而且加上 =all 的设置之后,通过 nvidia-smi 检查也会发现实际使用的显卡和分配的不匹配。

我们碰到的问题是 container 实际使用的内存大小超出了 k8s 设置的限制(或者是本身运行了一些非 k8s 管理的程序占用了显存),导致虽然 scheduler 认为卡上还有显存(于是调度到了该卡),但是实际上已经没有足够的显存。

最后我们的解决方法是严格保证实际使用的显存小于限制值,并且将非 k8s 管理的程序迁移到其他服务器(或集群中)。

In our case, adding the NVIDIA_VISIBLE_DEVICES=all env config is not a reasonable solution to this problem. Add NVIDIA_VISIBLE_DEVICES will cause that the GPU card the container actually used doesn't match the GPU card scheduler assigned to the container, and this will cause other issues. And this can be verified by k exec [pod] -- env, you will find a NVIDIA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx even you don't specify one. Also, the nvidia-smi can verify the actually used GPU card is not matched with the assigned one if you added NVIDIA_VISIBLE_DEVICES=all.

The cause of our situation is that the container is using more memory than we give it (or there are some processes unmanaged by k8s using this card). Although the scheduler thinks there's enough memory (so assign the process to that card), it finds that there isn't enough memory left when it actually runs.

And finally, we ensure all containers will use less memory than we assign to them and move all processes which are not managed by k8s to other machines.

@golimix
Copy link

golimix commented Sep 9, 2021

versions

  • docker 19.03.11
  • kubernetes v1.19.8

processing

containers:
    - name: cuda
      image: nvidia/cuda:latest
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      resources:
        limits:
          # GiB
          aliyun.com/gpu-mem: 1
  • after add the environment variable NVIDIA_VISIBLE_DEVICES=all the program can be run but the result of nvidia-smi comand , gpu memory, may be not correct or something wrong No running processes found
  • 当我使用了NVIDIA_VISIBLE_DEVICES=all之后, 我的应用可以正常启动了, 但是nvidia-smi命令的输出结果显示的内存大小似乎不符合预期或者某个地方出现了什么问题, 例如: No running processes found
[root@k8s243 models]# nvidia-smi 
 Thu Sep  9 17:21:00 2021       
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |===============================+======================+======================|
 |   0  Tesla P4            Off  | 00000000:00:10.0 Off |                    0 |
 | N/A   31C    P8     6W /  75W |      0MiB /  7611MiB |      0%      Default |
 +-------------------------------+----------------------+----------------------+
                                                                                
 +-----------------------------------------------------------------------------+
 | Processes:                                                       GPU Memory |
 |  GPU       PID   Type   Process name                             Usage      |
 |=============================================================================|
 |  No running processes found                                                 |
 +-----------------------------------------------------------------------------+
kubectl inspect gpushare
 NAME    IPADDRESS     GPU0(Allocated/Total)  GPU Memory(GiB)
 k8s243  10.3.171.243  2/7                    2/7
 k8s25   10.3.144.25   0/14                   0/14
 ------------------------------------------------
 Allocated/Total GPU Memory In Cluster:
 2/21 (9%)  

@karlhjm
Copy link

karlhjm commented Nov 16, 2021

I encountered the same problem after installing according to the instructions. Later, I found that the other master had not been adjusted according to the instructions. After I modified the scheduler component configuration of all two masters, they worked normally.

@serend1p1ty
Copy link

分享下我的解决方案以供大家参考。我的问题出在修改 kube-scheduler.yaml 上。

之前错误的方式:

  1. cd /etc/kubernetes/manifests
  2. cp kube-scheduler.yaml kube-scheduler.yaml.backup
  3. 按照安装教程直接修改 kube-scheduler.yaml

问题就出在第 2 步: 在 /etc/kubernetes/manifests 下现在有两个文件 kube-scheduler.yamlkube-scheduler.yaml.backup,kubernetes 会把它们全部加载,可能因为后者会覆盖前者导致配置修改不生效。

正确的方式:

  1. cd /etc/kubernetes
  2. mv manifests/kube-scheduler.yaml .,把 kube-scheduer.yaml 先移出 manifests 目录。
  3. cp kube-scheduler.yaml kube-scheduler.yaml.backup,如果不想备份可以跳过这一步。
  4. 按照安装教程直接修改 kube-scheduler.yaml
  5. mv kube-scheduler.yaml manifests,将修改好的文件重新放入 manifests 文件夹。

@fenwuyaoji
Copy link

fenwuyaoji commented Aug 10, 2022 via email

@k0nstantinv
Copy link

k0nstantinv commented Nov 8, 2022

Can somebody explain why the environment variable NVIDIA_VISIBLE_DEVICES fixes this and why it is needed in the manifest file?

seems like this is how it works AliyunContainerService/gpushare-device-plugin#55 (comment)

@karlhjm
Copy link

karlhjm commented Nov 8, 2022 via email

@icovej
Copy link

icovej commented Apr 30, 2023

请教下您看的哪个安装教程吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests