Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

error #35 - installed CUDA driver version older than runtime even though 361.42 driver is installed? #191

Closed
Motherboard opened this issue Sep 6, 2016 · 10 comments

Comments

@Motherboard
Copy link

Motherboard commented Sep 6, 2016

Using EC2 Amazon machine, with nvidia drivers version 361.42, and nvidia-docker, nvidia-docker-plugin installed and running.

running latest DIGITS (4.0) shows in the log:

cudaRuntimeGetVersion() failed with error #35

nvidia-docker volume ls on my machine shows

nvidia-docker nvidia_driver_361.42

there are no CUDA bin files (e.g. deviceQuery or nvidia-smi) that I could find in the DIGITS docker, but running

nvidia-docker run --rm nvidia/cuda nvidia-smi

results in

+------------------------------------------------------+                       
| NVIDIA-SMI 361.42     Driver Version: 361.42         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   35C    P8    17W / 125W |     11MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

trying to nvidia-docker build a dockerfile based on nvidia/cuda:7.0-cudnn4-devel-ubuntu14.04 which clones the master branch of caffe and compiles it with cudnn enabled fails on the beginning of testing with the following error:

    Cuda number of devices: 0
    Setting to use device 0
    Current device id: 0
    Current device name: 
    Note: Randomizing tests' orders with a seed of 21847 .
    [==========] Running 2081 tests from 277 test cases.
    [----------] Global test environment set-up.
    [----------] 50 tests from NeuronLayerTest/3, where TypeParam = caffe::GPUDevice<double>
    [ RUN      ] NeuronLayerTest/3.TestSigmoidGradient
    E0905 10:18:15.161348   263 common.cpp:113] Cannot create Cublas handle. Cublas won't be available.
    E0905 10:18:15.162796   263 common.cpp:120] Cannot create Curand generator. Curand won't be available.
    F0905 10:18:15.162914   263 syncedmem.hpp:18] Check failed: error == cudaSuccess (35 vs. 0)  CUDA driver version is insufficient for CUDA runtime version

But oddly enough, beniz/deepdetect_gpu does seem to work properly with the GPU...

Any Ideas?

@3XX0
Copy link
Member

3XX0 commented Sep 7, 2016

Looks like your driver wasn't installed properly. How did you install it?

@Motherboard
Copy link
Author

It's Ubuntu 15.10 (GNU/Linux 4.2.0-42-generic x86_64), this is what I did from the beginning:

$ sudo apt-get update
$ sudo apt-get install --no-install-recommends -y gcc make libc-dev
$ wget -P /tmp http://us.download.nvidia.com/XFree86/Linux-x86_64/361.42/NVIDIA-Linux-x86_64-361.42.run
$ sudo sh /tmp/NVIDIA-Linux-x86_64-361.42.run --silent
$ wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0-rc.3/nvidia-docker_1.0.0.rc.3-1_amd64.deb
$ sudo dpkg -i /tmp/nvidia-docker_.deb && rm /tmp/nvidia-docker_.deb
$ sudo apt-get install dkms build-essential linux-headers-generic
$ sudo nano /etc/modprobe.d/blacklist-nouveau.conf

adding the following lines:

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

save and quit

$echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
$sudo update-initramfs -u

I may have had to re-run the nvidia installer again at this stage. (exactly the same 2 lines as before)

And finally
$sudo usermod -aG docker ubuntu
$sudo service nvidia-docker start

made sure both docker and nvidia-docker-plugin services are up:

$service nvidia-docker status
$service docker status

And as mentioned above, the nvidia/cuda docker is able to run nvidia-smi and show the GPU and driver versions show as expected, and beniz/deepdetect_gpu does seem to work properly with the GPU.

@3XX0
Copy link
Member

3XX0 commented Sep 7, 2016

What's the the output of ldconfig -p | grep libcuda and sudo ls -lR /var/lib/nvidia-docker | grep libcuda

@Motherboard
Copy link
Author

$ldconfig -p | grep libcuda

libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so

$sudo ls -lR /var/lib/nvidia-docker | grep libcuda

lrwxrwxrwx 1 nvidia-docker nvidia-docker       17 Sep  1 09:36 libcuda.so -> libcuda.so.361.42
lrwxrwxrwx 1 nvidia-docker nvidia-docker       17 Sep  1 09:36 libcuda.so.1 -> libcuda.so.361.42
-rwxr-xr-x 2 root          root          16881416 Aug 31 22:54 libcuda.so.361.42

@lukeyeager
Copy link
Member

Hmm. I just got a similar unexpected error while playing with a Torch-based docker image.

THCudaCheck FAIL file=/torch/extra/cutorch/lib/THC/THCGeneral.c line=20 error=35 : CUDA driver version is insufficient for CUDA runtime version
/torch/install/bin/luajit: /torch/install/share/lua/5.1/trepl/init.lua:384: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /torch/extra/cutorch/lib/THC/THCGeneral.c:20
stack traceback:
    [C]: in function 'error'
    /torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'
    neural_style.lua:51: in function 'main'
    neural_style.lua:515: in main chunk
    [C]: in function 'dofile'
    /torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670
  • Host
    • Os: Ubuntu 16.04
    • Driver: 367.48
    • nvidia-docker: 1.0.0~rc.3-1
    • docker: 1.12.1-0~xenial
  • Image
    • Base: nvidia/cuda:7.5-cudnn5-devel-ubuntu14.04

@lukeyeager
Copy link
Member

... but a DIGITS image and an NVcaffe image work fine? Not sure what's happening here.

@lukeyeager
Copy link
Member

@3XX0 helped me figure out my problem. I was trying to use CUDA while building the image, but it's not available yet. When I changed the last step in my Dockerfile from a RUN to a CMD, everything worked fine. Nevermind!

@Motherboard what does this command do for you?

nvidia-docker run --rm --entrypoint 'digits/device_query.py' nvidia/digits

@Motherboard
Copy link
Author

Motherboard commented Sep 7, 2016

nvidia-docker doesn't like it when I don't give it all the volumes declared in the docker file, so

$ nvidia-docker run --rm --entrypoint 'digits/device_query.py' nvidia/digits

gives

docker: Error response from daemon: create f64b902e8ee8344f2a45a9e0420aa63b2d70349473229877a65cb9ac47152029: bad volume format: f64b902e8ee8344f2a45a9e0420aa63b2d70349473229877a65cb9ac47152029.
But
$ nvidia-docker run --rm -v /home/ubuntu/notebook:/data -v /home/ubuntu/jobs:/jobs --entrypoint 'digits/device_query.py' nvidia/digits

gives

Device #0:
>>> CUDA attributes:
  name                         GRID K520
  totalGlobalMem               4294770688
  clockRate                    797000
  major                        3
  minor                        0
>>> NVML attributes:
  Total memory                 4095 MB
  Used memory                  48 MB
  Memory utilization           0%
  GPU utilization              0%
  Temperature                  36 C

@Motherboard
Copy link
Author

I don't know what was previously wrong, but I've tried running digits again, and it seems to be fine...

Can't reproduce the error...

@rcoborod
Copy link

rcoborod commented Mar 4, 2017

I have pain a lot trying to use my GTX 860M in a Lenovo Y70 machine with i7 and intel integrated graphics card and one error is quite similar to the ones you are getting . I discover this regarding how to activate nvidia before any try to access to it thru drivers... Just for giving you ideas:

Just to open a possible solution path:
when I type: NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery, I get:

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

But if I try with $optirun NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
the result is the one we want....
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 860M"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 4044 MBytes (4240965632 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1020 MHz (1.02 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 860M
Result = PASS

That make me think all my problems are related to the way I invoque programs . Now I'm investigating how to make it work with torch for recurrent neural networks but with GPU....

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants