-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: nil pointer dereference around getDevices:d.CPUAffinity #141
Conversation
@@ -40,13 +40,18 @@ func getDevices() []*pluginapi.Device { | |||
for i := uint(0); i < n; i++ { | |||
d, err := nvml.NewDeviceLite(i) | |||
check(err) | |||
var cpu_affinity int64 = 0 | |||
if d.CPUAffinity != nil { | |||
cpu_affinity = int64(*(d.CPUAffinity)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If d.CPUAffinity is not set, then we shouldn’t force it to 0. Instead, we should just not set the Topology field of the device at all.
When is it the case that d.CPUAffinity is not set, such that this occurs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the same problem.
Since NewDeviceLite
does not getCPUAffinity
, CPUAffinity
will always be nil (maybe).
https://github.com/NVIDIA/gpu-monitoring-tools/blob/0474d08c7a070ac97e9723d8f18f9cfaa64d5918/bindings/go/nvml/nvml.go#L395
IfNewDevice
is used, CPUAffinity
may be obtained.
https://github.com/NVIDIA/gpu-monitoring-tools/blob/0474d08c7a070ac97e9723d8f18f9cfaa64d5918/bindings/go/nvml/nvml.go#L371
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klueska So should NewDeviceLite() be replaced by NewDevice(),which always sets CPUAffinity in numaNode() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be more complicated than that, because NewDevice()
currently always sets CPUAffinity to "something" (e.g. 0) even if the underlying host doesn't have any NUMA info for it (i.e. the system returns -1). I think we will need to change NVML to change the way it reports things, in addition to changing the plugin. I plan to look into this later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this merges https://gitlab.com/nvidia/container-toolkit/gpu-monitoring-tools/merge_requests/4, we can update the vendoring for the NVML bindings and update this patch to selectively set Topology
based on if d.CPUAffinity
is nil
or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. That change has merged. You will need to update the vendoring to pull it in and then update the logic here to only set Topology
if d.CPUAffinity != nil
.
Hello! Can you move your PR to : https://gitlab.com/nvidia/kubernetes/device-plugin Also it looks like you haven't signed-off your commits, do you mind taking care of it? Thanks a lot! |
Merged :) |
I hope it didn’t merge without my suggested changes. Where can I see the final PR? |
@klueska I believe your changes were taken into account, the MR was opened on gitlab here: |
Yep. Looks good. Thanks! |
As mentioned in issue:140 by @jucrouzet and @RenaudWasTaken , when CPUAffinity is not set, *(d.CPUAffinity) simply fails, and cause the whole process to exit.