GPU-mem is the whole GB value, not MB value #16

reverson · 2019-03-15T18:40:53Z

Right now on a g3s.xlarge instance I'm seeing the gpu-mem value being set to 7 though the host has 1 GPU with 7GB of memory (7618MiB according to nvidia-smi).

If I try to schedule a fraction of gpu-mem (1.5 for example) I'm told I need to use a whole integer.

Should the plugin be exporting 7618 as the gpu-mem value?

cheyang · 2019-03-16T03:27:28Z

Yes，if you want to use 7618MiB, you should change the unit into MiB in https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/device-plugin-ds.yaml#L28.

guunergooner · 2019-03-19T14:29:17Z

Yes，if you want to use 7618MiB, you should change the unit into MiB in https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/device-plugin-ds.yaml#L28.

I change the unit into MiB, and recreate device-plugin-ds，find node kubelet.service report grpc error。

Mar 19 06:58:31 k8s-node-1 kubelet[12836]: E0319 06:58:31.266996   12836 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5862768 vs. 4194304)
Mar 19 06:58:31 k8s-node-1 kubelet[12836]: I0319 06:58:31.267070   12836 manager.go:430] Mark all resources Unhealthy for resource aliyun.com/gpu-mem

And gpushare device pluging pod logs as blow

I0319 13:58:30.902668       1 main.go:18] Start gpushare device plugin
I0319 13:58:30.902780       1 gpumanager.go:28] Loading NVML
I0319 13:58:30.908589       1 gpumanager.go:37] Fetching devices.
I0319 13:58:30.908639       1 gpumanager.go:43] Starting FS watcher.
I0319 13:58:30.908785       1 gpumanager.go:51] Starting OS watcher.
I0319 13:58:30.924544       1 nvidia.go:64] Deivce GPU-bda0bcfa-022d-e4a5-ecb7-0ca863a47e75's Path is /dev/nvidia0
I0319 13:58:30.924630       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:30.924649       1 nvidia.go:40] set gpu memory: 12196
I0319 13:58:30.924659       1 nvidia.go:76] # Add first device ID: GPU-bda0bcfa-022d-e4a5-ecb7-0ca863a47e75-_-0
I0319 13:58:30.935332       1 nvidia.go:79] # Add last device ID: GPU-bda0bcfa-022d-e4a5-ecb7-0ca863a47e75-_-12195
I0319 13:58:30.950346       1 nvidia.go:64] Deivce GPU-a12a3921-ea32-1160-c3b0-394b977ffc84's Path is /dev/nvidia1
I0319 13:58:30.950378       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:30.950388       1 nvidia.go:76] # Add first device ID: GPU-a12a3921-ea32-1160-c3b0-394b977ffc84-_-0
I0319 13:58:30.959102       1 nvidia.go:79] # Add last device ID: GPU-a12a3921-ea32-1160-c3b0-394b977ffc84-_-12195
I0319 13:58:30.985063       1 nvidia.go:64] Deivce GPU-4f7ecd0f-69ca-45ab-558e-f0d798c8d181's Path is /dev/nvidia2
I0319 13:58:30.985110       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:30.985119       1 nvidia.go:76] # Add first device ID: GPU-4f7ecd0f-69ca-45ab-558e-f0d798c8d181-_-0
I0319 13:58:30.995293       1 nvidia.go:79] # Add last device ID: GPU-4f7ecd0f-69ca-45ab-558e-f0d798c8d181-_-12195
I0319 13:58:31.047900       1 nvidia.go:64] Deivce GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c's Path is /dev/nvidia3
I0319 13:58:31.047935       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:31.047946       1 nvidia.go:76] # Add first device ID: GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c-_-0
I0319 13:58:31.054558       1 nvidia.go:79] # Add last device ID: GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c-_-12195
I0319 13:58:31.087392       1 nvidia.go:64] Deivce GPU-c9d55403-db94-541a-098e-aa1a4fac438c's Path is /dev/nvidia4
I0319 13:58:31.087415       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:31.087423       1 nvidia.go:76] # Add first device ID: GPU-c9d55403-db94-541a-098e-aa1a4fac438c-_-0
I0319 13:58:31.093386       1 nvidia.go:79] # Add last device ID: GPU-c9d55403-db94-541a-098e-aa1a4fac438c-_-12195
I0319 13:58:31.124518       1 nvidia.go:64] Deivce GPU-6c5d0cb4-ab2c-3eb8-5c1f-531d39d11579's Path is /dev/nvidia5
I0319 13:58:31.124535       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:31.124541       1 nvidia.go:76] # Add first device ID: GPU-6c5d0cb4-ab2c-3eb8-5c1f-531d39d11579-_-0
I0319 13:58:31.134973       1 nvidia.go:79] # Add last device ID: GPU-6c5d0cb4-ab2c-3eb8-5c1f-531d39d11579-_-12195
I0319 13:58:31.171276       1 nvidia.go:64] Deivce GPU-d5ac7a2c-c032-3f23-6244-2fc08f8aa363's Path is /dev/nvidia6
I0319 13:58:31.171312       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:31.171323       1 nvidia.go:76] # Add first device ID: GPU-d5ac7a2c-c032-3f23-6244-2fc08f8aa363-_-0
I0319 13:58:31.179836       1 nvidia.go:79] # Add last device ID: GPU-d5ac7a2c-c032-3f23-6244-2fc08f8aa363-_-12195
I0319 13:58:31.215859       1 nvidia.go:64] Deivce GPU-0dd2b0c3-3f55-5872-3e17-d6b889e77750's Path is /dev/nvidia7
I0319 13:58:31.215904       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:31.215916       1 nvidia.go:76] # Add first device ID: GPU-0dd2b0c3-3f55-5872-3e17-d6b889e77750-_-0
I0319 13:58:31.223627       1 nvidia.go:79] # Add last device ID: GPU-0dd2b0c3-3f55-5872-3e17-d6b889e77750-_-12195
I0319 13:58:31.223647       1 server.go:43] Device Map: map[GPU-bda0bcfa-022d-e4a5-ecb7-0ca863a47e75:0 GPU-a12a3921-ea32-1160-c3b0-394b977ffc84:1 GPU-4f7ecd0f-69ca-45ab-558e-f0d798c8d181:2 GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c:3 GPU-c9d55403-db94-541a-098e-aa1a4fac438c:4 GPU-6c5d0cb4-ab2c-3eb8-5c1f-531d39d11579:5 GPU-d5ac7a2c-c032-3f23-6244-2fc08f8aa363:6 GPU-0dd2b0c3-3f55-5872-3e17-d6b889e77750:7]
I0319 13:58:31.223707       1 server.go:44] Device List: [GPU-d5ac7a2c-c032-3f23-6244-2fc08f8aa363 GPU-0dd2b0c3-3f55-5872-3e17-d6b889e77750 GPU-bda0bcfa-022d-e4a5-ecb7-0ca863a47e75 GPU-a12a3921-ea32-1160-c3b0-394b977ffc84 GPU-4f7ecd0f-69ca-45ab-558e-f0d798c8d181 GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c GPU-c9d55403-db94-541a-098e-aa1a4fac438c GPU-6c5d0cb4-ab2c-3eb8-5c1f-531d39d11579]
I0319 13:58:31.248160       1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I0319 13:58:31.249329       1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I0319 13:58:31.250685       1 server.go:230] Registered device plugin with Kubelet

mine nvidia-smi print physical machine, i think multi card used MiB unit, cause grpc stream data struct overflow。

Tue Mar 19 07:09:10 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    On   | 00000000:04:00.0 Off |                  N/A |
| 23%   30C    P8     7W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    On   | 00000000:05:00.0 Off |                  N/A |
| 23%   29C    P8     7W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    On   | 00000000:08:00.0 Off |                  N/A |
| 23%   26C    P8     7W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    On   | 00000000:09:00.0 Off |                  N/A |
| 23%   24C    P8     9W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  TITAN X (Pascal)    On   | 00000000:84:00.0 Off |                  N/A |
| 23%   28C    P8     9W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  TITAN X (Pascal)    On   | 00000000:85:00.0 Off |                  N/A |
| 23%   31C    P8     7W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  TITAN X (Pascal)    On   | 00000000:88:00.0 Off |                  N/A |
| 23%   23C    P8     7W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  TITAN X (Pascal)    On   | 00000000:89:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

cheyang · 2019-03-19T16:07:06Z

I think it's due to grpc max msg size. If you'd like to fix, it should be similar to helm/helm#3514.

guunergooner · 2019-03-20T04:20:05Z

I think it's due to grpc max msg size. If you'd like to fix, it should be similar to helm/helm#3514.

it can't fix mine problem. i review gpushare-device-plugin proj code, https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/pkg/gpu/nvidia/nvidia.go#L82, find fakeID mini size = len(GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c-_-0) * 12195 * 8 > 4194304 overflow grpc library default is 4MB。

cheyang · 2019-03-20T23:15:07Z

I mean you can increase the default grpc max msg size in source code of Kubelet and device plugin to 16MB, and compile them to new binary then deploy. I think it can work. Otherwise, you use use GiB as memory unit.

guunergooner · 2019-03-21T03:25:04Z

I mean you can increase the default grpc max msg size in source code of Kubelet and device plugin to 16MB, and compile them to new binary then deploy. I think it can work. Otherwise, you use use GiB as memory unit.

Thanks, I agree with the solution. It is recommended that this case be added to the User Guide

cheyang · 2019-03-23T07:12:20Z

Thank you for your suggestions. Would you like to help?

therc · 2019-05-07T00:28:35Z

In that case, it added almost 100,000 device IDs (object+string) just for that machine. It's a big waste of CPU and memory and risks causing crashes in the kubelet. This is an example of gRPC limits being helpful.

Rather than messing with gRPC and building custom plugins and custom kubelets, you could and should just use a different unit. Something like 64MB, 100MB or 128MB is a reasonable compromise. Having to round up numbers also prevents you from packing things perfectly, which is perhaps a good idea if your pods will compete a lot for the same GPU.

zlingqu · 2021-04-06T01:29:55Z

重新编译kubelet源码，替换原有kubelet。修改代码的地方是https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/devicemanager/endpoint.go 中，

dial方法添加参数grpc.WithDefaultCallOptions(grpc.MaxCallRecvMsgSize(1024*1024*16))

效果：

[root@jenkins ~]# kubectl inspect gpushare
NAME           IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU4(Allocated/Total)  GPU5(Allocated/Total)  GPU6(Allocated/Total)  GPU7(Allocated/Total)  GPU Memory(MiB)
192.168.68.13  192.168.68.13  0/12066                0/12066                0/12066                0/12066                0/12066                0/12066                0/12066                0/12066                0/96528
192.168.68.5   192.168.68.5   0/11178                0/11178                0/11178                0/11178                0/11178                0/11178                0/11178                0/11178                0/89424
------------------------------------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/185952 (0%)

joy717 · 2021-10-12T08:00:49Z

@cheyang @therc
Hi, can you tell me how to set the unit to 128MiB?
I've checked the code, the --memory-unit only accepts MiB or GiB.
If I set to 128MiB, the unit will fall back to GiB

debMan · 2021-11-01T14:08:58Z

Yes，if you want to use 7618MiB, you should change the unit into MiB in https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/device-plugin-ds.yaml#L28.

Thanks, worked for me 👍 .

sloth2012 · 2022-10-28T09:48:38Z

Yes，if you want to use 7618MiB, you should change the unit into MiB in https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/device-plugin-ds.yaml#L28.

Thanks, worked for me 👍 .

not work，--memory-unit set MiB，but aliyun.com/gpu-mem still use Gib。

harrymore · 2023-11-10T06:28:44Z

my case is that if I set MiB，use commad"kubectl inspect gpushare" display GPU with MiB unit,but when I apply for gpu in pod, it remand me: 0/3 nodes are available: 3 Insufficient GPU Memory in one device.

Vae1997 mentioned this issue Oct 15, 2019

Can I apply for memory in MiB？such as 100MiB,300MiB. #56

Open

cheyang mentioned this issue Jun 29, 2021

fix: modify the naming method to reduce the gRPC's request bytes AliyunContainerService/gpushare-device-plugin#42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-mem is the whole GB value, not MB value #16

GPU-mem is the whole GB value, not MB value #16

reverson commented Mar 15, 2019

cheyang commented Mar 16, 2019

guunergooner commented Mar 19, 2019

cheyang commented Mar 19, 2019

guunergooner commented Mar 20, 2019

cheyang commented Mar 20, 2019

guunergooner commented Mar 21, 2019

cheyang commented Mar 23, 2019

therc commented May 7, 2019

zlingqu commented Apr 6, 2021 •

edited

Loading

joy717 commented Oct 12, 2021

debMan commented Nov 1, 2021

sloth2012 commented Oct 28, 2022

harrymore commented Nov 10, 2023

GPU-mem is the whole GB value, not MB value #16

GPU-mem is the whole GB value, not MB value #16

Comments

reverson commented Mar 15, 2019

cheyang commented Mar 16, 2019

guunergooner commented Mar 19, 2019

cheyang commented Mar 19, 2019

guunergooner commented Mar 20, 2019

cheyang commented Mar 20, 2019

guunergooner commented Mar 21, 2019

cheyang commented Mar 23, 2019

therc commented May 7, 2019

zlingqu commented Apr 6, 2021 • edited Loading

joy717 commented Oct 12, 2021

debMan commented Nov 1, 2021

sloth2012 commented Oct 28, 2022

harrymore commented Nov 10, 2023

zlingqu commented Apr 6, 2021 •

edited

Loading