Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-mem is the whole GB value, not MB value #16

Open
reverson opened this issue Mar 15, 2019 · 13 comments
Open

GPU-mem is the whole GB value, not MB value #16

reverson opened this issue Mar 15, 2019 · 13 comments

Comments

@reverson
Copy link

Right now on a g3s.xlarge instance I'm seeing the gpu-mem value being set to 7 though the host has 1 GPU with 7GB of memory (7618MiB according to nvidia-smi).

If I try to schedule a fraction of gpu-mem (1.5 for example) I'm told I need to use a whole integer.

Should the plugin be exporting 7618 as the gpu-mem value?

@cheyang
Copy link
Collaborator

cheyang commented Mar 16, 2019

Yes,if you want to use 7618MiB, you should change the unit into MiB in https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/device-plugin-ds.yaml#L28.

@guunergooner
Copy link
Contributor

Yes,if you want to use 7618MiB, you should change the unit into MiB in https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/device-plugin-ds.yaml#L28.

  • I change the unit into MiB, and recreate device-plugin-ds,find node kubelet.service report grpc error。
Mar 19 06:58:31 k8s-node-1 kubelet[12836]: E0319 06:58:31.266996   12836 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5862768 vs. 4194304)
Mar 19 06:58:31 k8s-node-1 kubelet[12836]: I0319 06:58:31.267070   12836 manager.go:430] Mark all resources Unhealthy for resource aliyun.com/gpu-mem
  • And gpushare device pluging pod logs as blow
I0319 13:58:30.902668       1 main.go:18] Start gpushare device plugin
I0319 13:58:30.902780       1 gpumanager.go:28] Loading NVML
I0319 13:58:30.908589       1 gpumanager.go:37] Fetching devices.
I0319 13:58:30.908639       1 gpumanager.go:43] Starting FS watcher.
I0319 13:58:30.908785       1 gpumanager.go:51] Starting OS watcher.
I0319 13:58:30.924544       1 nvidia.go:64] Deivce GPU-bda0bcfa-022d-e4a5-ecb7-0ca863a47e75's Path is /dev/nvidia0
I0319 13:58:30.924630       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:30.924649       1 nvidia.go:40] set gpu memory: 12196
I0319 13:58:30.924659       1 nvidia.go:76] # Add first device ID: GPU-bda0bcfa-022d-e4a5-ecb7-0ca863a47e75-_-0
I0319 13:58:30.935332       1 nvidia.go:79] # Add last device ID: GPU-bda0bcfa-022d-e4a5-ecb7-0ca863a47e75-_-12195
I0319 13:58:30.950346       1 nvidia.go:64] Deivce GPU-a12a3921-ea32-1160-c3b0-394b977ffc84's Path is /dev/nvidia1
I0319 13:58:30.950378       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:30.950388       1 nvidia.go:76] # Add first device ID: GPU-a12a3921-ea32-1160-c3b0-394b977ffc84-_-0
I0319 13:58:30.959102       1 nvidia.go:79] # Add last device ID: GPU-a12a3921-ea32-1160-c3b0-394b977ffc84-_-12195
I0319 13:58:30.985063       1 nvidia.go:64] Deivce GPU-4f7ecd0f-69ca-45ab-558e-f0d798c8d181's Path is /dev/nvidia2
I0319 13:58:30.985110       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:30.985119       1 nvidia.go:76] # Add first device ID: GPU-4f7ecd0f-69ca-45ab-558e-f0d798c8d181-_-0
I0319 13:58:30.995293       1 nvidia.go:79] # Add last device ID: GPU-4f7ecd0f-69ca-45ab-558e-f0d798c8d181-_-12195
I0319 13:58:31.047900       1 nvidia.go:64] Deivce GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c's Path is /dev/nvidia3
I0319 13:58:31.047935       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:31.047946       1 nvidia.go:76] # Add first device ID: GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c-_-0
I0319 13:58:31.054558       1 nvidia.go:79] # Add last device ID: GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c-_-12195
I0319 13:58:31.087392       1 nvidia.go:64] Deivce GPU-c9d55403-db94-541a-098e-aa1a4fac438c's Path is /dev/nvidia4
I0319 13:58:31.087415       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:31.087423       1 nvidia.go:76] # Add first device ID: GPU-c9d55403-db94-541a-098e-aa1a4fac438c-_-0
I0319 13:58:31.093386       1 nvidia.go:79] # Add last device ID: GPU-c9d55403-db94-541a-098e-aa1a4fac438c-_-12195
I0319 13:58:31.124518       1 nvidia.go:64] Deivce GPU-6c5d0cb4-ab2c-3eb8-5c1f-531d39d11579's Path is /dev/nvidia5
I0319 13:58:31.124535       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:31.124541       1 nvidia.go:76] # Add first device ID: GPU-6c5d0cb4-ab2c-3eb8-5c1f-531d39d11579-_-0
I0319 13:58:31.134973       1 nvidia.go:79] # Add last device ID: GPU-6c5d0cb4-ab2c-3eb8-5c1f-531d39d11579-_-12195
I0319 13:58:31.171276       1 nvidia.go:64] Deivce GPU-d5ac7a2c-c032-3f23-6244-2fc08f8aa363's Path is /dev/nvidia6
I0319 13:58:31.171312       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:31.171323       1 nvidia.go:76] # Add first device ID: GPU-d5ac7a2c-c032-3f23-6244-2fc08f8aa363-_-0
I0319 13:58:31.179836       1 nvidia.go:79] # Add last device ID: GPU-d5ac7a2c-c032-3f23-6244-2fc08f8aa363-_-12195
I0319 13:58:31.215859       1 nvidia.go:64] Deivce GPU-0dd2b0c3-3f55-5872-3e17-d6b889e77750's Path is /dev/nvidia7
I0319 13:58:31.215904       1 nvidia.go:69] # device Memory: 12196
I0319 13:58:31.215916       1 nvidia.go:76] # Add first device ID: GPU-0dd2b0c3-3f55-5872-3e17-d6b889e77750-_-0
I0319 13:58:31.223627       1 nvidia.go:79] # Add last device ID: GPU-0dd2b0c3-3f55-5872-3e17-d6b889e77750-_-12195
I0319 13:58:31.223647       1 server.go:43] Device Map: map[GPU-bda0bcfa-022d-e4a5-ecb7-0ca863a47e75:0 GPU-a12a3921-ea32-1160-c3b0-394b977ffc84:1 GPU-4f7ecd0f-69ca-45ab-558e-f0d798c8d181:2 GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c:3 GPU-c9d55403-db94-541a-098e-aa1a4fac438c:4 GPU-6c5d0cb4-ab2c-3eb8-5c1f-531d39d11579:5 GPU-d5ac7a2c-c032-3f23-6244-2fc08f8aa363:6 GPU-0dd2b0c3-3f55-5872-3e17-d6b889e77750:7]
I0319 13:58:31.223707       1 server.go:44] Device List: [GPU-d5ac7a2c-c032-3f23-6244-2fc08f8aa363 GPU-0dd2b0c3-3f55-5872-3e17-d6b889e77750 GPU-bda0bcfa-022d-e4a5-ecb7-0ca863a47e75 GPU-a12a3921-ea32-1160-c3b0-394b977ffc84 GPU-4f7ecd0f-69ca-45ab-558e-f0d798c8d181 GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c GPU-c9d55403-db94-541a-098e-aa1a4fac438c GPU-6c5d0cb4-ab2c-3eb8-5c1f-531d39d11579]
I0319 13:58:31.248160       1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I0319 13:58:31.249329       1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I0319 13:58:31.250685       1 server.go:230] Registered device plugin with Kubelet
  • mine nvidia-smi print physical machine, i think multi card used MiB unit, cause grpc stream data struct overflow。
Tue Mar 19 07:09:10 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    On   | 00000000:04:00.0 Off |                  N/A |
| 23%   30C    P8     7W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    On   | 00000000:05:00.0 Off |                  N/A |
| 23%   29C    P8     7W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    On   | 00000000:08:00.0 Off |                  N/A |
| 23%   26C    P8     7W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    On   | 00000000:09:00.0 Off |                  N/A |
| 23%   24C    P8     9W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  TITAN X (Pascal)    On   | 00000000:84:00.0 Off |                  N/A |
| 23%   28C    P8     9W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  TITAN X (Pascal)    On   | 00000000:85:00.0 Off |                  N/A |
| 23%   31C    P8     7W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  TITAN X (Pascal)    On   | 00000000:88:00.0 Off |                  N/A |
| 23%   23C    P8     7W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  TITAN X (Pascal)    On   | 00000000:89:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |      1MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@cheyang
Copy link
Collaborator

cheyang commented Mar 19, 2019

I think it's due to grpc max msg size. If you'd like to fix, it should be similar to helm/helm#3514.

@guunergooner
Copy link
Contributor

I think it's due to grpc max msg size. If you'd like to fix, it should be similar to helm/helm#3514.

it can't fix mine problem. i review gpushare-device-plugin proj code, https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/pkg/gpu/nvidia/nvidia.go#L82, find fakeID mini size = len(GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c-_-0) * 12195 * 8 > 4194304 overflow grpc library default is 4MB。

@cheyang
Copy link
Collaborator

cheyang commented Mar 20, 2019

I mean you can increase the default grpc max msg size in source code of Kubelet and device plugin to 16MB, and compile them to new binary then deploy. I think it can work. Otherwise, you use use GiB as memory unit.

@guunergooner
Copy link
Contributor

I mean you can increase the default grpc max msg size in source code of Kubelet and device plugin to 16MB, and compile them to new binary then deploy. I think it can work. Otherwise, you use use GiB as memory unit.

Thanks, I agree with the solution. It is recommended that this case be added to the User Guide

@cheyang
Copy link
Collaborator

cheyang commented Mar 23, 2019

Thank you for your suggestions. Would you like to help?

@therc
Copy link

therc commented May 7, 2019

In that case, it added almost 100,000 device IDs (object+string) just for that machine. It's a big waste of CPU and memory and risks causing crashes in the kubelet. This is an example of gRPC limits being helpful.

Rather than messing with gRPC and building custom plugins and custom kubelets, you could and should just use a different unit. Something like 64MB, 100MB or 128MB is a reasonable compromise. Having to round up numbers also prevents you from packing things perfectly, which is perhaps a good idea if your pods will compete a lot for the same GPU.

@zlingqu
Copy link

zlingqu commented Apr 6, 2021

dial方法添加参数grpc.WithDefaultCallOptions(grpc.MaxCallRecvMsgSize(1024*1024*16))

  • 效果:
[root@jenkins ~]# kubectl inspect gpushare
NAME           IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU4(Allocated/Total)  GPU5(Allocated/Total)  GPU6(Allocated/Total)  GPU7(Allocated/Total)  GPU Memory(MiB)
192.168.68.13  192.168.68.13  0/12066                0/12066                0/12066                0/12066                0/12066                0/12066                0/12066                0/12066                0/96528
192.168.68.5   192.168.68.5   0/11178                0/11178                0/11178                0/11178                0/11178                0/11178                0/11178                0/11178                0/89424
------------------------------------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/185952 (0%) 

@joy717
Copy link

joy717 commented Oct 12, 2021

@cheyang @therc
Hi, can you tell me how to set the unit to 128MiB?
I've checked the code, the --memory-unit only accepts MiB or GiB.
If I set to 128MiB, the unit will fall back to GiB

@debMan
Copy link

debMan commented Nov 1, 2021

Yes,if you want to use 7618MiB, you should change the unit into MiB in https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/device-plugin-ds.yaml#L28.

Thanks, worked for me 👍 .

@sloth2012
Copy link

Yes,if you want to use 7618MiB, you should change the unit into MiB in https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/device-plugin-ds.yaml#L28.

Thanks, worked for me 👍 .

not work,--memory-unit set MiB,but aliyun.com/gpu-mem still use Gib。

@harrymore
Copy link

my case is that if I set MiB,use commad"kubectl inspect gpushare" display GPU with MiB unit,but when I apply for gpu in pod, it remand me: 0/3 nodes are available: 3 Insufficient GPU Memory in one device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants