Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to enable the GPU's via the GUI #5793

Open
IASN-CCC opened this issue May 13, 2024 · 18 comments
Open

[BUG] Unable to enable the GPU's via the GUI #5793

IASN-CCC opened this issue May 13, 2024 · 18 comments
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one

Comments

@IASN-CCC
Copy link

IASN-CCC commented May 13, 2024

Describe the bug
Unable to enable the GPU's that are installed in the server from the UI

Two GPUs are listed (2xNVIDIA L40) however selecting one and selecting enable nothing happens

To Reproduce
Goto SR-IOV GPU Devices, Select a listed GPU and click the 3 dots to enable
image

Expected behavior
GPU should enable

Support bundle
Please reach out to request one securely

Environment

  • Harvester ISO version: 1.3.0
  • Underlying Infrastructure: DELL R760XA Baremetal with Dual NVIDIA L40 GPUs

Additional context
Add any other context about the problem here.

@IASN-CCC IASN-CCC added kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels May 13, 2024
@ibrokethecloud
Copy link
Contributor

are you please able to confirm that the nvidia driver addon is enabled?

@IASN-CCC
Copy link
Author

are you please able to confirm that the nvidia driver addon is enabled?

Hi, Yes ive enabled the driver addon and its showing as working
image

If i look in the PCI device list i can also see the two GPUs and i have not enabled them for pass through

@bathomas
Copy link

Are the pcidevices-controllers crashing? I had to increase the limits on them and was able to enable.

@IASN-CCC
Copy link
Author

Are the pcidevices-controllers crashing? I had to increase the limits on them and was able to enable.

Hi, Not sure how i can check this?
How did you increase the limits?

@rebeccazzzz rebeccazzzz added this to New in Community Issue Review via automation May 14, 2024
@ibrokethecloud
Copy link
Contributor

the nvidia-driver-toolkit needs the driver location, which is http endpoint where the nvidia kvm driver is located.

From that screenshot it looks like this has not been edited so no real driver has been installed on the underlying hosts.

@IASN-CCC
Copy link
Author

the nvidia-driver-toolkit needs the driver location, which is http endpoint where the nvidia kvm driver is located.

From that screenshot it looks like this has not been edited so no real driver has been installed on the underlying hosts.

Oh ok, that makes sense then.

The host is not currently internet connected (sits behind a proxy and i am trying to locate all the URLs to white list in our proxy)

Will internet access resolve this or do i still need to find a location

@ibrokethecloud
Copy link
Contributor

the http endpoint is supposed to be an internal http server where you can host the drivers. You will need access to the nvidia portal to download the nvidia kvm drivers. These are different from the opensource drivers.

Please refer to the docs: https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit

@IASN-CCC
Copy link
Author

https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit

OK, downloaded the latest KVM drivers and put them on an internal web server, updated it but still no luck
image

Tested that i can hit the URL from my PC and the file starts downloading, i also can ping the host from the harvester host.

@ibrokethecloud
Copy link
Contributor

any chance i may have a support bundle to figure out what is going on? There would be messages in the nvidia driver toolkit container / pcidevices which would provide insights on what is going on.

@IASN-CCC
Copy link
Author

any chance i may have a support bundle to figure out what is going on? There would be messages in the nvidia driver toolkit container / pcidevices which would provide insights on what is going on.

Sure, What is the best way to provide them to you securely?

@ibrokethecloud
Copy link
Contributor

please email the bundle to harvester-support-bundle@suse.com

@IASN-CCC
Copy link
Author

please email the bundle to harvester-support-bundle@suse.com

Sent. Thank you

@ibrokethecloud
Copy link
Contributor

The nvidia-driver-runtime image cannot be pulled by your nodes

nvidia-driver-runtime-5vvwn                             0/1     ImagePullBackOff   0              21h

This image is not shipped in the iso and needs to be pulled to your private registry in case your nodes do not have access to the docker hub.

Once the image is available please update the image details in the addon to point to your private registry.

This is mentioned in the docs: https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit

@IASN-CCC
Copy link
Author

The nvidia-driver-runtime image cannot be pulled by your nodes

nvidia-driver-runtime-5vvwn                             0/1     ImagePullBackOff   0              21h

This image is not shipped in the iso and needs to be pulled to your private registry in case your nodes do not have access to the docker hub.

Once the image is available please update the image details in the addon to point to your private registry.

This is mentioned in the docs: https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit

Oh ok, I didnt realise that, I thought i just had to download the driver and host it on a web server which i done

Do i need to setup this private registry also? Do i just deploy a SUSE microOS and setup as a private registry?

Thanks

@ibrokethecloud
Copy link
Contributor

The private registry is a container registry. I do not think microOS contains a registry of its own. You could use something like goharbor to get started with a private registry

@IASN-CCC
Copy link
Author

The private registry is a container registry. I do not think microOS contains a registry of its own. You could use something like goharbor to get started with a private registry

Thanks for that, we will look into that.

Interestingly, i white listed all the domains the system was trying to get out to in our proxy and then configured the proxy in the harvester UI, but still no luck, it should be able to get out now

@ibrokethecloud
Copy link
Contributor

are you able to ssh to all your nodes and just run docker pull rancher/harvester-nvidia-driver-toolkit:v1.3-20240307

if the nodes can pull this image then the addon should work

@IASN-CCC
Copy link
Author

docker pull rancher/harvester-nvidia-driver-toolkit:v1.3-20240307

No such luck
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one
Development

No branches or pull requests

3 participants