-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU-mem is the whole GB value, not MB value #16
Comments
Yes,if you want to use 7618MiB, you should change the unit into |
|
I think it's due to grpc max msg size. If you'd like to fix, it should be similar to helm/helm#3514. |
it can't fix mine problem. i review gpushare-device-plugin proj code, https://github.com/AliyunContainerService/gpushare-device-plugin/blob/master/pkg/gpu/nvidia/nvidia.go#L82, find fakeID mini size = len(GPU-17f59c6f-0e44-f0d8-675f-30833e525c5c-_-0) * 12195 * 8 > 4194304 overflow grpc library default is 4MB。 |
I mean you can increase the default grpc max msg size in source code of Kubelet and device plugin to 16MB, and compile them to new binary then deploy. I think it can work. Otherwise, you use use GiB as memory unit. |
Thanks, I agree with the solution. It is recommended that this case be added to the User Guide |
Thank you for your suggestions. Would you like to help? |
In that case, it added almost 100,000 device IDs (object+string) just for that machine. It's a big waste of CPU and memory and risks causing crashes in the kubelet. This is an example of gRPC limits being helpful. Rather than messing with gRPC and building custom plugins and custom kubelets, you could and should just use a different unit. Something like 64MB, 100MB or 128MB is a reasonable compromise. Having to round up numbers also prevents you from packing things perfectly, which is perhaps a good idea if your pods will compete a lot for the same GPU. |
dial方法添加参数grpc.WithDefaultCallOptions(grpc.MaxCallRecvMsgSize(1024*1024*16))
|
Thanks, worked for me 👍 . |
not work, |
my case is that if I set MiB,use commad"kubectl inspect gpushare" display GPU with MiB unit,but when I apply for gpu in pod, it remand me: 0/3 nodes are available: 3 Insufficient GPU Memory in one device. |
Right now on a g3s.xlarge instance I'm seeing the gpu-mem value being set to 7 though the host has 1 GPU with 7GB of memory (7618MiB according to nvidia-smi).
If I try to schedule a fraction of gpu-mem (1.5 for example) I'm told I need to use a whole integer.
Should the plugin be exporting 7618 as the gpu-mem value?
The text was updated successfully, but these errors were encountered: