-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nVidia GPUs not listed when configured for PCI-Passthrough #5968
Comments
After further investigation, commenting out lines 114 to 118 in https://github.com/OpenNebula/one/blob/master/src/im_mad/remotes/node-probes.d/pci.rb#L114 allows the GPUs to be listed (albeit without the names showing, only IDs). The offending lines are:
So it looks like a bug introduced in 6.4 when vGPU support was added with commit 7f71959 |
@JungleCatSW @gavin-cudo Do you have a workaround to actually get PCI passthrough to work on 6.4? We found ourselves in the same boat with an invisible GPU until we commented out the filtering in We face a similar problem to this ON Forum Post, with the full error message (04:00.00 is our host PCI address):
|
@JungleCatSW Unfortunately, applying this change only fixes the invisibility of the PCI device, but not the passthrough error when booting a new VM with a passthrough GPU (not a vGPU). It looks to me that #5968 and the passthrough problem are related, as ON currently wants to get a mediated device (vGPU) instead of a PCI passthrough device. Did I maybe miss a configuration from the official PCI passthrough documentation which enables vGPUs by default? (formatted for clarity, taken from the GUI when spawning a new VM)
|
@cirquit We had the same issue, once you have added a host with the old pci.rb and pci.conf the UUID gets stored, so even when you correct them the pci data just gets merged with the old incorrect data. try running: scroll up and look for the pci section to see if there is a UUID field:
The way ONE knows whether it is using passthrough or vGPU is whether the UUID field exists in the PCI section of the host.
If you enroll a new host it should work, but to clear an existing host you have to:
you can use let me know if that works for you |
@JungleCatSW Thanks for the detailed explanation! It worked out exactly as you said. One interesting detail is when the PCI device was added as a vGPU it did not follow the natural ordering of the device address (04:00.0) in the PCI tab or the For other people who find this issue and have problems with GPU PCI passthrough with KVMs, make sure that you have the correct name and group rights on your Also, in my case, I needed to reduce the memory size of the VM by ~ 2GB compared to a no-PCI-passthrough VM, as I would get an OOM by qemu. In my case, the host and VM became unresponsive via SSH and only got back after a few hours when (I presume) the qemu process was terminated by the OS. |
The problem should be solved with this patch 3f300f3 The source of the issue comes from forcing the use of vGPU, avoiding the use of the physical GPU for PCI-Passthrough. As @gavin-cudo commented, one of the problems resided here:
However, removing those lines makes that both GPUs and vGPUs can be used at the same time, which is not correct. On the other hand, the configuration modification that @JungleCatSW commented avoids adding the UUID to the GPU device when it works as physical GPU for PCI-Passthrough, but it does not properly manage the use of vGPUs since, as he indicated, OpenNebula use this field in order to use the vGPU.
With the patch I propose, GPUs and vGPUs should be listed correctly depending on whether GPU virtualization is enabled or not with NVIDIA drivers (as indicated in the official documentation). Similarly, it is also controlled that the UUID is added only to the vGPUs, leaving the GPUs configured as a usual PCI device. |
Hello, |
Description
nVidia GPUs are not listed under a PCI devices on a host configured with PCI-Passthrough.
To Reproduce
Configure a host with nVidia GPUs for PCI-Passthrough as per the documentation at https://docs.opennebula.io/6.4/open_cluster_deployment/kvm_node/pci_passthrough.html
Set the filter under /var/lib/one/remotes/etc/im/kvm-probes.d/pci.conf on the frontend to be:
Expected behavior
All PCI devices are listed under a
onehost show <host_id>
including nVidia GPUs.Actual behavior
All PCI devices are listed except nVidia GPUs.
Details
Additional context
GPUs are listed fine on the host with:
vfio driver is confirmed working as seen below:
The above configuration was known to be working on version 6.2.0.
The text was updated successfully, but these errors were encountered: