Failed to initialize NVML: GPU access blocked by the operating system #1

bshillingford · 2015-04-08T22:55:07Z

Hi, first of all, thanks for sharing these Dockerfiles. I've been trying to use your kaixhin/cuda, but I can't access the GPUs within the container. I'm fairly certain both the host and container are running the same CUDA versions, 7.0.28. But nvidia-smi always outputs Failed to initialize NVML: GPU access blocked by the operating system. Also nvidia-smi -a produces the same error, so I can't find a way to get more information about this error. Do you have any ideas what this could be caused by?

Thanks!

Brendan

Within the docker container:

$ docker run -ti -v `pwd`/NVIDIA_CUDA-7.0_Samples:/cudasamples --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm kaixhin/cuda /bin/bash
root@9279fc160f42:/# nvidia-smi 
Failed to initialize NVML: GPU access blocked by the operating system
root@9279fc160f42:/# /cudasamples/1_Utilities/deviceQuery/deviceQuery 
/cudasamples/1_Utilities/deviceQuery/deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

root@9279fc160f42:/# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_16_22:59:02_CST_2015
Cuda compilation tools, release 7.0, V7.0.27

On the host:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_16_22:59:02_CST_2015
Cuda compilation tools, release 7.0, V7.0.27
$ modinfo nvidia | grep version
version:        346.47
vermagic:       3.16.0-31-generic SMP mod_unload modversions

$ nvidia-smi
Wed Apr  8 23:47:44 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.47     Driver Version: 346.47         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  On   | 0000:04:00.0     Off |                  N/A |
| 26%   28C    P8    14W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  On   | 0000:08:00.0     Off |                  N/A |
| 26%   28C    P8    14W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  On   | 0000:85:00.0     Off |                  N/A |
| 26%   29C    P8    13W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  On   | 0000:89:00.0     Off |                  N/A |
| 26%   28C    P8    14W / 250W |     15MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

bshillingford · 2015-04-08T23:25:16Z

Never mind, I fooled myself. My host has 346.47, but the version included in the CUDA installer is 346.46. Thanks!

Kaixhin · 2015-04-09T09:45:29Z

That's quite a subtle problem, so I'm glad you managed to solve the issue yourself. I'll be pushing an update soon which includes the CUDA driver version in the readmes so that people can check their hosts first.

leopd · 2015-06-04T20:49:22Z

Got this same error with a mismatch of driver versions: 346.46 to 346.72. Make sure they match exactly!

nh2 · 2015-08-10T22:53:48Z

Useful to know for passers-by: You cannot combine this Dockerfile with a host that has CUDA installed via one of the DEB packages from https://developer.nvidia.com/cuda-downloads, because those usually compare newer driver versions while the .run file contains older driver versions (e.g. cuda_7.0.28_linux.run still contains 346.46 when the DEBs already contain 346.82).

To get around this, either downgrade your host driver to also use the one from the .run file, or build a newer container based on the DEBs - I chose to do that and built it like this:

# Install CUDA
docker run -ti -v $HOME/Downloads/:/cuda-installer ubuntu:14.04 bash

# Inside the container:
dpkg -i /cuda-installer/cuda-repo-ubuntu1404-7-0-local_7.0-28_amd64.deb
apt-get update
apt-get install cuda
apt-get purge cuda-repo-ubuntu1404-7-0-local
apt-get clean
rm -rf /var/lib/apt/lists/*

# Outside, commit the container
docker commit CONTAINER_ID $USER/cuda-deb:7.0.28-346.82

(Not using a Dockerfile here because it makes it really hard to get a file from the host into the container without it blowing up the container disk space.)

tengpeng · 2016-02-08T06:58:10Z

@nh2 I tried as your guidance. Everything looks fine. However, the problem persists.

My host is a centOS, but the docker is built on Ubuntu. Does that make a difference?

tengpeng · 2016-02-09T06:47:50Z

#host 
nvidia-smi
Tue Feb  9 01:30:58 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.79     Driver Version: 352.79         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  Off  | 0000:01:00.0      On |                  N/A |
| 40%   27C    P8     1W /  38W |    274MiB /  2047MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     10505    G   /usr/bin/X                                      91MiB |
|    0     10647    G   /usr/bin/gnome-shell                            91MiB |
|    0     12965    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    82MiB |
+-----------------------------------------------------------------------------+

#docker
root@84b9f6e2d047:~# nvidia-smi
Failed to initialize NVML: GPU access blocked by the operating system
root@fefd58c66866:~# lsmod | grep nv
nvidia               8536971  54 
drm                   349210  5 i915,drm_kms_helper,nvidia
i2c_core               40582  7 drm,i915,i2c_i801,i2c_hid,drm_kms_helper,i2c_algo_bit,nvidia
root@fefd58c66866:~# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  352.79  Wed Jan 13 16:17:53 PST 2016
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)

Kaixhin · 2016-02-09T08:32:00Z

@tengpeng The OS mismatch is almost certainly the problem when it comes to drivers. I am only going to support Ubuntu (but will take PRs for CentOS if anyone wants to maintain that), therefore I suggest trying NVIDIA's project and letting me know so that I can update my documentation.

tengpeng · 2016-02-09T22:52:16Z

@Kaixhin I tried Nvidia's docker. I passed all tests and I can use the digits set. However, I fail to get the mxnet docker works.

sudo nvidia-docker run kaixhin/cuda-mxnet:latest nvidia-dmi
Failed to initialize NVML: Unknown Error

Kaixhin · 2016-02-10T10:40:48Z

@tengpeng Looks like you will have to build your own image using this Dockerfile for reference. It will need to start from a NVIDIA CentOS image and replace any Ubuntu-specific commands with CentOS-specific commands (like apt-get with yum).

bshillingford · 2016-02-10T12:49:04Z

If you want to use nvidia's thing and build off of an ubuntu image instead
of CentOS, see this dockerfile as an example:

https://github.com/NVIDIA/nvidia-docker/blob/master/ubuntu-14.04/caffe/Dockerfile

i.e. you can start from cuda:7.0-cudnn4-runtime.

On Wed, Feb 10, 2016 at 10:40 AM, Kai Arulkumaran notifications@github.com
wrote:

@tengpeng https://github.com/tengpeng Looks like you will have to build
your own image using this Dockerfile
https://github.com/Kaixhin/dockerfiles/blob/master/cuda-mxnet/cuda_v7.5/Dockerfile
for reference. It will need to start from a NVIDIA CentOS image and replace
any Ubuntu-specific commands with CentOS-specific commands (like apt-get
with yum).

—
Reply to this email directly or view it on GitHub
#1 (comment).

tengpeng · 2016-02-10T17:07:49Z

@bshillingford Thanks!

tengpeng · 2016-02-10T17:08:57Z

@Kaixhin Thanks! Previously, I thought I could call docker/cuda to run docker/mxnet.

Yunrui · 2016-02-24T13:55:02Z

@tengpeng have you fixed the issue? Please share your solution/dockerfile if you did. I am encountering the same issue. My host is cent os, with CUDA 352.79 installed. I am trying to find out how to make mxnet run on it.

tengpeng · 2016-02-24T15:49:35Z

@Yunrui I have successfully installed mxnet on my hard disk, rather then running a docker. If I understand the author of the docker correctly, the docker support Ubuntu only.

jamborta · 2016-04-17T23:31:21Z

Hi, I have a same problem, running on a g.2xlarge EC2 host instance with ubuntu 14.04, when I run

sudo docker run -ti --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm kaixhin/cuda bash

I get

Failed to initialize NVML: GPU access blocked by the operating system

I compared the driver, both the same:

ubuntu@ip-172-31-43-191:~$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.1)

any help would be appreciated

nh2 · 2016-04-17T23:54:15Z

@jamborta I cannot give you detail advice but NVIDIA now provides their own docker containers (https://github.com/NVIDIA/nvidia-docker) which you might want to try.

Kaixhin · 2016-04-18T08:20:08Z

@jamborta it appears that this driver version has problems on EC2 (see #7), so you'll want to either try running the container on an instance with upgraded drivers (see @nh2's solution above) or use one of NVIDIA's official images - the latter is probably easier.

jamborta · 2016-04-18T10:31:44Z

@nh2 @Kaixhin thanks for the replies. I would prefer to use the docker images from this repo, as I'm using the torch version. not sure what you mean by not using the latest driver, in #7 it mentions 352.63, which is the one I have.

Kaixhin · 2016-04-18T11:04:59Z

@jamborta I found the issues with that driver version on this NVIDIA thread, which is why you will probably want to make your own Docker images using NVIDIA's official tools if you want to use EC2.

Otherwise, if you can (I don't use EC2 and don't know what's possible), you could try reinstalling the drivers and/or CUDA on the host using the runfile.

YIFanH · 2016-07-18T10:36:48Z

@tengpeng I want to use the NVIDIA Driver Version: 352.79 to do some research,but the NVIDIA of working docker is 346.46,could you give me your docker-cuda-container link? Thank you for your replys.

rangsimanketkaew · 2017-03-22T12:26:24Z

Try updating the version of CUDA to the latest thing.
sudo nvidia-installer --update

tommyjcarpenter · 2018-10-02T14:43:00Z

Does the container need to be using the same version of cuda as the host machine?

If so, this is like an unprecedented runtime dependency that a container now has. Meaning, a Docker container now has specific docker hosts in which it can run on. Is there a way to even specify such a dependency? I had assumed Docker containers made no assumptions about the underlying Dockerhost.

Kaixhin · 2018-10-02T14:54:40Z

@tommyjcarpenter this is a kernel issue - you can find more details on the NVIDIA Docker repo. FYI all CUDA images in this repo now use their setup, so they are subject to whatever restrictions NVIDIA have needed.

bshillingford closed this as completed Apr 8, 2015

bshillingford reopened this Apr 8, 2015

bshillingford closed this as completed Apr 8, 2015

jchodera mentioned this issue Jun 11, 2015

Please install torch cBio/cbio-cluster#270

Closed

baude mentioned this issue Sep 8, 2019

Error while creating a container from tensorflow-1.14.0 image on a ppc64le architecture machine containers/podman#3955

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to initialize NVML: GPU access blocked by the operating system #1

Failed to initialize NVML: GPU access blocked by the operating system #1

bshillingford commented Apr 8, 2015

bshillingford commented Apr 8, 2015

Kaixhin commented Apr 9, 2015

leopd commented Jun 4, 2015

nh2 commented Aug 10, 2015

tengpeng commented Feb 8, 2016

tengpeng commented Feb 9, 2016

Kaixhin commented Feb 9, 2016

tengpeng commented Feb 9, 2016

Kaixhin commented Feb 10, 2016

bshillingford commented Feb 10, 2016

tengpeng commented Feb 10, 2016

tengpeng commented Feb 10, 2016

Yunrui commented Feb 24, 2016

tengpeng commented Feb 24, 2016

jamborta commented Apr 17, 2016 •

edited

Loading

nh2 commented Apr 17, 2016

Kaixhin commented Apr 18, 2016

jamborta commented Apr 18, 2016 •

edited

Loading

Kaixhin commented Apr 18, 2016

YIFanH commented Jul 18, 2016

rangsimanketkaew commented Mar 22, 2017

tommyjcarpenter commented Oct 2, 2018

Kaixhin commented Oct 2, 2018

Failed to initialize NVML: GPU access blocked by the operating system #1

Failed to initialize NVML: GPU access blocked by the operating system #1

Comments

bshillingford commented Apr 8, 2015

bshillingford commented Apr 8, 2015

Kaixhin commented Apr 9, 2015

leopd commented Jun 4, 2015

nh2 commented Aug 10, 2015

tengpeng commented Feb 8, 2016

tengpeng commented Feb 9, 2016

Kaixhin commented Feb 9, 2016

tengpeng commented Feb 9, 2016

Kaixhin commented Feb 10, 2016

bshillingford commented Feb 10, 2016

tengpeng commented Feb 10, 2016

tengpeng commented Feb 10, 2016

Yunrui commented Feb 24, 2016

tengpeng commented Feb 24, 2016

jamborta commented Apr 17, 2016 • edited Loading

nh2 commented Apr 17, 2016

Kaixhin commented Apr 18, 2016

jamborta commented Apr 18, 2016 • edited Loading

Kaixhin commented Apr 18, 2016

YIFanH commented Jul 18, 2016

rangsimanketkaew commented Mar 22, 2017

tommyjcarpenter commented Oct 2, 2018

Kaixhin commented Oct 2, 2018

jamborta commented Apr 17, 2016 •

edited

Loading

jamborta commented Apr 18, 2016 •

edited

Loading