Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Problems after upgraded the host nvidia driver #365

Closed
SandW opened this issue Apr 14, 2017 · 14 comments
Closed

Problems after upgraded the host nvidia driver #365

SandW opened this issue Apr 14, 2017 · 14 comments

Comments

@SandW
Copy link

SandW commented Apr 14, 2017

I have upgraded the host nvidia driver to 375.39. However, the nvidia-docker and some container were installed during 375.26. Therefore, I got the error message when running "nvidia-smi" command in the container: Failed to initialize NVML: Driver/library version mismatch.
One of the solution should be upgrading the driver inside container. Is there any simple way to solve the problem?
Thanks!

@flx42
Copy link
Member

flx42 commented Apr 14, 2017

Are you containers still running? If so, you need to relaunch the containers after a driver upgrade.
If that's happening with new containers, you should probably just reboot your machine.

@SandW
Copy link
Author

SandW commented Apr 15, 2017

I have rebooted the machine and relaunched the container. Neither works for me.

@SandW
Copy link
Author

SandW commented Apr 16, 2017

I found the path (/var/lib/nvidia-docker/volumes/nvidia_driver/375.26) existed in my host pc using the command "docker volumn inspection ...".

@SandW
Copy link
Author

SandW commented Apr 17, 2017

I have found new clues.
After upgraded the host gpu driver from 375.26 to 375.39:

  1. The driver can be seen when I use nvidia-docker run -it nvidia/cuda
    Therefore, I use docker volume ls
    2 items are listed:
    nvidia-docker nvidia_driver_375.26
    nvidia-docker nvidia_driver_375.39
    The nvidia_driver_375.39 item is newly created after I run nvidia-docker run -it nvidia/cuda.
    However, the container created with nvidia-docker run before the upgrade suffers the same error.

  2. I remembered that the docker image caused the error "Failed to initialize NVML: Driver/library version mismatch" is pull from docker hub. It is not created by nvidia/cuda.

Questions:

  1. How can I solved the problem? Reinstall new 375.39 driver inside docker?
  2. Since a volume is automatically mounted during nvidia-docker run, should I delete the nvidia_driver_375.26 item? Will it affect other docker images?

@3XX0
Copy link
Member

3XX0 commented Apr 18, 2017

You can delete the old volume but you don't need to, new containers will pick up the new driver. In your case, it looks like the kernel module version didn't match the useland drivers. This usually happens when your driver wasn't installed properly or conflicted with a previous one.

@StevenLOL
Copy link

If I am not wrong you don't need the driver installed in your nvidia-docker container.

REF:
nvidia-docker

@grisaitis
Copy link

grisaitis commented Apr 24, 2017

I encountered a similar error when upgrading nvidia-docker from 1.0.0 to 1.0.1:

$ nvidia-smi
modprobe: FATAL: Module nvidia not found.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

My driver was pretty recent (361). I'm not sure how it was installed (someone else installed it for me), but I "fixed" the situation by installing the latest driver from the APT repository (nvidia-375-dev=375.51-0ubuntu1) and then nvidia-modprobe=375.51-0ubuntu1. Then nvidia-smi worked.

@flx42, @3XX0 do either of you have any ideas as to why a preexisting driver installation would be broken by the nvidia-docker deb installer? Does the deb installer assume that the drivers are installed in a certain way? If my "driver wasn't installed properly", what does that mean?

Thanks!

@sentient
Copy link

sentient commented Apr 24, 2017

I had to restart the nvidia-docker service after upgrading the nvidia driver.
See your results from
service nvidia-docker status

If that doesn't look to kosher, try to restart
service nvidia-docker stop
service nvidia-docker start

nvidia-smi
Mon Apr 24 09:02:13 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 0000:01:00.0      On |                  N/A |
| 56%   30C    P8    17W / 185W |   1326MiB /  8112MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1209    G   /usr/lib/xorg/Xorg                             827MiB |
|    0      2364    G   compiz                                         359MiB |
+-----------------------------------------------------------------------------+

@flx42
Copy link
Member

flx42 commented Apr 24, 2017

@grisaitis nvidia-docker shouldn't be able to tamper with your existing driver installation. If nvidia-smi isn't working, it's likely that your setup wasn't working before you even installed nvidia-docker. We don't assume the drivers are installed in a certain way, nvidia-docker should work whether you installed the drivers through a NVIDIA-[...].run script, or through a deb package. However, if you mixed the two ways, you might end up in a broken state.

@grisaitis
Copy link

grisaitis commented Apr 28, 2017

@flx42 Thanks for the reply! That makes sense.

EDIT: I'm pretty sure I broke things because of a linux kernel update... And the driver wasn't built with dkms.

I'm pretty sure the driver was installed with the runscript, and nothing else. I ran a few relevant commands below. Any ideas? Something looks fishy about my kernel version vs how the driver is installed...

Here's what I observe:

$ nvidia-smi
modprobe: FATAL: Module nvidia not found.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

$ sudo modprobe nvidia
modprobe: FATAL: Module nvidia not found.
$ sudo nvidia-modprobe -u -c=0
modprobe: FATAL: Module nvidia-uvm not found.
$ dkms status
The program 'dkms' is currently not installed. To run 'dkms' please ask your administrator to install the package 'dkms'
$ uname -a
Linux <hostname> 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ find /lib/modules/ -name "*nvidia*"
/lib/modules/3.13.0-88-generic/kernel/drivers/video/nvidia.ko
/lib/modules/3.13.0-88-generic/kernel/drivers/video/nvidia-uvm.ko
/lib/modules/3.13.0-88-generic/kernel/drivers/video/nvidia
/lib/modules/3.13.0-88-generic/kernel/drivers/video/nvidia/nvidiafb.ko
/lib/modules/3.13.0-88-generic/kernel/drivers/net/ethernet/nvidia
/lib/modules/3.13.0-116-generic/kernel/drivers/video/nvidia
/lib/modules/3.13.0-116-generic/kernel/drivers/video/nvidia/nvidiafb.ko
/lib/modules/3.13.0-116-generic/kernel/drivers/net/ethernet/nvidia
$ ls -l  /usr/src/nvidia* | grep uvm
-rw-r--r-- 1 root root    21408 Jun 13  2016 nv_uvm_interface.c
-rw-r--r-- 1 root root    24889 Jun 13  2016 nv_uvm_interface.h
-rw-r--r-- 1 root root     6092 Jun 13  2016 nv_uvm_types.h
drwxr-xr-x 2 root root     4096 Jun 13  2016 uvm
$ ls -l  /usr/src/nvidia* | grep nvidia
-rw-r--r-- 1 root root     9952 Jun 13  2016 nvidia-modules-common.mk
-rw-r--r-- 1 root root      769 Jun 13  2016 nvidia-sources.mk
$ which nvidia-modprobe
/usr/bin/nvidia-modprobe
$ nvidia-modprobe --version

nvidia-modprobe:  version 352.39  (buildmeister@swio-display-x64-rhel04-18) 
Fri Aug 14 17:40:56 PDT 2015
$ dpkg -l | grep nvidia
ii  nvidia-docker                            1.0.1-1                                    amd64        NVIDIA Docker container tools

@HDVUCAIR
Copy link

HDVUCAIR commented May 4, 2017

I've run into problems recently on redhat systems with the latest kernel updates and the 375.39 driver. This is separate from nvidia-docker, but may be your problem. Right now, I have to apply a patch prior to installing the nvidia driver. Instructions here.

Without the patch, the nvidia kernel modules failed to compile and load - hence no output from the nvidia-smi program.

Once I managed to patch and install the 375.39 driver, I still have a few problems with the nvidia-docker installation. For some reason, the service no longers starts and I have to manually run the nvidia-docker-plugin (similar to the instructions for non-CentOS/nonDebian systems). For now, I'm fine with that and will wait until the dust settles and either Redhat or NVidia fix the problems on the newest kernel.

John.

@3XX0
Copy link
Member

3XX0 commented Nov 14, 2017

Closing as it most likely a driver installation issue and/or fixed in 2.0/master.

@3XX0 3XX0 closed this as completed Nov 14, 2017
@linrio
Copy link

linrio commented Nov 5, 2018

I encountered a similar error when upgrading nvidia-docker from 1.0.0 to 1.0.1:

$ nvidia-smi
modprobe: FATAL: Module nvidia not found.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

My driver was pretty recent (361). I'm not sure how it was installed (someone else installed it for me), but I "fixed" the situation by installing the latest driver from the APT repository (nvidia-375-dev=375.51-0ubuntu1) and then nvidia-modprobe=375.51-0ubuntu1. Then nvidia-smi worked.

@flx42, @3XX0 do either of you have any ideas as to why a preexisting driver installation would be broken by the nvidia-docker deb installer? Does the deb installer assume that the drivers are installed in a certain way? If my "driver wasn't installed properly", what does that mean?

Thanks!

How to install the latest driver from APT repository (nvidia-375-dev=375.51-0ubuntu1) ?

@beratkurar
Copy link

I committed the container into a docker image. Then I recreate another container using this docker image and the problem was gone.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants