Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to initialize NVML: GPU access blocked by the operating system #1

Closed
bshillingford opened this issue Apr 8, 2015 · 23 comments
Closed

Comments

@bshillingford
Copy link

Hi, first of all, thanks for sharing these Dockerfiles. I've been trying to use your kaixhin/cuda, but I can't access the GPUs within the container. I'm fairly certain both the host and container are running the same CUDA versions, 7.0.28. But nvidia-smi always outputs Failed to initialize NVML: GPU access blocked by the operating system. Also nvidia-smi -a produces the same error, so I can't find a way to get more information about this error. Do you have any ideas what this could be caused by?

Thanks!

Brendan

Within the docker container:

$ docker run -ti -v `pwd`/NVIDIA_CUDA-7.0_Samples:/cudasamples --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm kaixhin/cuda /bin/bash
root@9279fc160f42:/# nvidia-smi 
Failed to initialize NVML: GPU access blocked by the operating system
root@9279fc160f42:/# /cudasamples/1_Utilities/deviceQuery/deviceQuery 
/cudasamples/1_Utilities/deviceQuery/deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
root@9279fc160f42:/# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_16_22:59:02_CST_2015
Cuda compilation tools, release 7.0, V7.0.27

On the host:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_16_22:59:02_CST_2015
Cuda compilation tools, release 7.0, V7.0.27
$ modinfo nvidia | grep version
version:        346.47
vermagic:       3.16.0-31-generic SMP mod_unload modversions 
$ nvidia-smi
Wed Apr  8 23:47:44 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.47     Driver Version: 346.47         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  On   | 0000:04:00.0     Off |                  N/A |
| 26%   28C    P8    14W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  On   | 0000:08:00.0     Off |                  N/A |
| 26%   28C    P8    14W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  On   | 0000:85:00.0     Off |                  N/A |
| 26%   29C    P8    13W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  On   | 0000:89:00.0     Off |                  N/A |
| 26%   28C    P8    14W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@bshillingford
Copy link
Author

Never mind, I fooled myself. My host has 346.47, but the version included in the CUDA installer is 346.46. Thanks!

@Kaixhin
Copy link
Owner

Kaixhin commented Apr 9, 2015

That's quite a subtle problem, so I'm glad you managed to solve the issue yourself. I'll be pushing an update soon which includes the CUDA driver version in the readmes so that people can check their hosts first.

@leopd
Copy link

leopd commented Jun 4, 2015

Got this same error with a mismatch of driver versions: 346.46 to 346.72. Make sure they match exactly!

@nh2
Copy link

nh2 commented Aug 10, 2015

Useful to know for passers-by: You cannot combine this Dockerfile with a host that has CUDA installed via one of the DEB packages from https://developer.nvidia.com/cuda-downloads, because those usually compare newer driver versions while the .run file contains older driver versions (e.g. cuda_7.0.28_linux.run still contains 346.46 when the DEBs already contain 346.82).

To get around this, either downgrade your host driver to also use the one from the .run file, or build a newer container based on the DEBs - I chose to do that and built it like this:

# Install CUDA
docker run -ti -v $HOME/Downloads/:/cuda-installer ubuntu:14.04 bash

# Inside the container:
dpkg -i /cuda-installer/cuda-repo-ubuntu1404-7-0-local_7.0-28_amd64.deb
apt-get update
apt-get install cuda
apt-get purge cuda-repo-ubuntu1404-7-0-local
apt-get clean
rm -rf /var/lib/apt/lists/*

# Outside, commit the container
docker commit CONTAINER_ID $USER/cuda-deb:7.0.28-346.82

(Not using a Dockerfile here because it makes it really hard to get a file from the host into the container without it blowing up the container disk space.)

@tengpeng
Copy link

tengpeng commented Feb 8, 2016

@nh2 I tried as your guidance. Everything looks fine. However, the problem persists.

My host is a centOS, but the docker is built on Ubuntu. Does that make a difference?

@tengpeng
Copy link

tengpeng commented Feb 9, 2016

#host 
nvidia-smi
Tue Feb  9 01:30:58 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.79     Driver Version: 352.79         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  Off  | 0000:01:00.0      On |                  N/A |
| 40%   27C    P8     1W /  38W |    274MiB /  2047MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     10505    G   /usr/bin/X                                      91MiB |
|    0     10647    G   /usr/bin/gnome-shell                            91MiB |
|    0     12965    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    82MiB |
+-----------------------------------------------------------------------------+
#docker
root@84b9f6e2d047:~# nvidia-smi
Failed to initialize NVML: GPU access blocked by the operating system
root@fefd58c66866:~# lsmod | grep nv
nvidia               8536971  54 
drm                   349210  5 i915,drm_kms_helper,nvidia
i2c_core               40582  7 drm,i915,i2c_i801,i2c_hid,drm_kms_helper,i2c_algo_bit,nvidia
root@fefd58c66866:~# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  352.79  Wed Jan 13 16:17:53 PST 2016
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) 

@Kaixhin
Copy link
Owner

Kaixhin commented Feb 9, 2016

@tengpeng The OS mismatch is almost certainly the problem when it comes to drivers. I am only going to support Ubuntu (but will take PRs for CentOS if anyone wants to maintain that), therefore I suggest trying NVIDIA's project and letting me know so that I can update my documentation.

@tengpeng
Copy link

tengpeng commented Feb 9, 2016

@Kaixhin I tried Nvidia's docker. I passed all tests and I can use the digits set. However, I fail to get the mxnet docker works.

sudo nvidia-docker run kaixhin/cuda-mxnet:latest nvidia-dmi
Failed to initialize NVML: Unknown Error

@Kaixhin
Copy link
Owner

Kaixhin commented Feb 10, 2016

@tengpeng Looks like you will have to build your own image using this Dockerfile for reference. It will need to start from a NVIDIA CentOS image and replace any Ubuntu-specific commands with CentOS-specific commands (like apt-get with yum).

@bshillingford
Copy link
Author

If you want to use nvidia's thing and build off of an ubuntu image instead
of CentOS, see this dockerfile as an example:

https://github.com/NVIDIA/nvidia-docker/blob/master/ubuntu-14.04/caffe/Dockerfile

i.e. you can start from cuda:7.0-cudnn4-runtime.

On Wed, Feb 10, 2016 at 10:40 AM, Kai Arulkumaran notifications@github.com
wrote:

@tengpeng https://github.com/tengpeng Looks like you will have to build
your own image using this Dockerfile
https://github.com/Kaixhin/dockerfiles/blob/master/cuda-mxnet/cuda_v7.5/Dockerfile
for reference. It will need to start from a NVIDIA CentOS image and replace
any Ubuntu-specific commands with CentOS-specific commands (like apt-get
with yum).


Reply to this email directly or view it on GitHub
#1 (comment).

@tengpeng
Copy link

@bshillingford Thanks!

@tengpeng
Copy link

@Kaixhin Thanks! Previously, I thought I could call docker/cuda to run docker/mxnet.

@Yunrui
Copy link

Yunrui commented Feb 24, 2016

@tengpeng have you fixed the issue? Please share your solution/dockerfile if you did. I am encountering the same issue. My host is cent os, with CUDA 352.79 installed. I am trying to find out how to make mxnet run on it.

@tengpeng
Copy link

@Yunrui I have successfully installed mxnet on my hard disk, rather then running a docker. If I understand the author of the docker correctly, the docker support Ubuntu only.

@jamborta
Copy link

jamborta commented Apr 17, 2016

Hi, I have a same problem, running on a g.2xlarge EC2 host instance with ubuntu 14.04, when I run

sudo docker run -ti --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm kaixhin/cuda bash

I get

Failed to initialize NVML: GPU access blocked by the operating system

I compared the driver, both the same:

ubuntu@ip-172-31-43-191:~$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.1)

any help would be appreciated

@nh2
Copy link

nh2 commented Apr 17, 2016

@jamborta I cannot give you detail advice but NVIDIA now provides their own docker containers (https://github.com/NVIDIA/nvidia-docker) which you might want to try.

@Kaixhin
Copy link
Owner

Kaixhin commented Apr 18, 2016

@jamborta it appears that this driver version has problems on EC2 (see #7), so you'll want to either try running the container on an instance with upgraded drivers (see @nh2's solution above) or use one of NVIDIA's official images - the latter is probably easier.

@jamborta
Copy link

jamborta commented Apr 18, 2016

@nh2 @Kaixhin thanks for the replies. I would prefer to use the docker images from this repo, as I'm using the torch version. not sure what you mean by not using the latest driver, in #7 it mentions 352.63, which is the one I have.

@Kaixhin
Copy link
Owner

Kaixhin commented Apr 18, 2016

@jamborta I found the issues with that driver version on this NVIDIA thread, which is why you will probably want to make your own Docker images using NVIDIA's official tools if you want to use EC2.

Otherwise, if you can (I don't use EC2 and don't know what's possible), you could try reinstalling the drivers and/or CUDA on the host using the runfile.

@YIFanH
Copy link

YIFanH commented Jul 18, 2016

@tengpeng I want to use the NVIDIA Driver Version: 352.79 to do some research,but the NVIDIA of working docker is 346.46,could you give me your docker-cuda-container link? Thank you for your replys.

@rangsimanketkaew
Copy link

Try updating the version of CUDA to the latest thing.
sudo nvidia-installer --update

@tommyjcarpenter
Copy link

Does the container need to be using the same version of cuda as the host machine?

If so, this is like an unprecedented runtime dependency that a container now has. Meaning, a Docker container now has specific docker hosts in which it can run on. Is there a way to even specify such a dependency? I had assumed Docker containers made no assumptions about the underlying Dockerhost.

@Kaixhin
Copy link
Owner

Kaixhin commented Oct 2, 2018

@tommyjcarpenter this is a kernel issue - you can find more details on the NVIDIA Docker repo. FYI all CUDA images in this repo now use their setup, so they are subject to whatever restrictions NVIDIA have needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants