-
Notifications
You must be signed in to change notification settings - Fork 2k
Problems after upgraded the host nvidia driver #365
Comments
Are you containers still running? If so, you need to relaunch the containers after a driver upgrade. |
I have rebooted the machine and relaunched the container. Neither works for me. |
I found the path (/var/lib/nvidia-docker/volumes/nvidia_driver/375.26) existed in my host pc using the command "docker volumn inspection ...". |
I have found new clues.
Questions:
|
You can delete the old volume but you don't need to, new containers will pick up the new driver. In your case, it looks like the kernel module version didn't match the useland drivers. This usually happens when your driver wasn't installed properly or conflicted with a previous one. |
I encountered a similar error when upgrading nvidia-docker from 1.0.0 to 1.0.1:
My driver was pretty recent (361). I'm not sure how it was installed (someone else installed it for me), but I "fixed" the situation by installing the latest driver from the APT repository ( @flx42, @3XX0 do either of you have any ideas as to why a preexisting driver installation would be broken by the nvidia-docker deb installer? Does the deb installer assume that the drivers are installed in a certain way? If my "driver wasn't installed properly", what does that mean? Thanks! |
I had to restart the nvidia-docker service after upgrading the nvidia driver. If that doesn't look to kosher, try to restart
|
@grisaitis |
@flx42 Thanks for the reply! That makes sense. EDIT: I'm pretty sure I broke things because of a linux kernel update... And the driver wasn't built with dkms.
Here's what I observe:
|
I've run into problems recently on redhat systems with the latest kernel updates and the 375.39 driver. This is separate from nvidia-docker, but may be your problem. Right now, I have to apply a patch prior to installing the nvidia driver. Instructions here. Without the patch, the nvidia kernel modules failed to compile and load - hence no output from the nvidia-smi program. Once I managed to patch and install the 375.39 driver, I still have a few problems with the nvidia-docker installation. For some reason, the service no longers starts and I have to manually run the nvidia-docker-plugin (similar to the instructions for non-CentOS/nonDebian systems). For now, I'm fine with that and will wait until the dust settles and either Redhat or NVidia fix the problems on the newest kernel. John. |
Closing as it most likely a driver installation issue and/or fixed in 2.0/master. |
How to install the latest driver from APT repository ( |
I committed the container into a docker image. Then I recreate another container using this docker image and the problem was gone. |
I have upgraded the host nvidia driver to 375.39. However, the nvidia-docker and some container were installed during 375.26. Therefore, I got the error message when running "nvidia-smi" command in the container: Failed to initialize NVML: Driver/library version mismatch.
One of the solution should be upgrading the driver inside container. Is there any simple way to solve the problem?
Thanks!
The text was updated successfully, but these errors were encountered: