Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available #464

Closed
lxzh opened this issue Sep 11, 2017 · 15 comments

Comments

@lxzh
Copy link

lxzh commented Sep 11, 2017

After install docker and nvidia-docker, when i try to run nvidia-smi throght NVIDIA-Caffe image, but there is a warning:
The NVIDIA Driver was not detected. GPU functionality will not be available.
The total logs:

nvidia-docker run --rm 1e07735bc788 nvidia-smi

==================
== NVIDIA Caffe ==
==================

NVIDIA Release 17.03 (build 12375)

Container image Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
Copyright (c) 2014, 2015, The Regents of the University of California (Regents)
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use 'nvidia-docker run' to start this container; see
   https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .

/usr/local/bin/nvidia_entrypoint.sh: line 31: exec: nvidia-smi: not found
@3XX0
Copy link
Member

3XX0 commented Sep 11, 2017

Does it work with the nvidia/cuda image?
Did you modify the original 17.03 image? it might be similar to #457

@lxzh
Copy link
Author

lxzh commented Sep 11, 2017

I haven't modify the orifinal 17.03 image.
it does not work with the nvidia/cuda image,

nvidia-docker run --rm nvidia/cuda nvidia-smi
container_linux.go:247: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH"
docker: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH".

@3XX0
Copy link
Member

3XX0 commented Sep 12, 2017

What's the output of sudo journalctl -u nvidia-docker?

@lxzh
Copy link
Author

lxzh commented Sep 12, 2017

My OS is ubuntu 14.04.3 which is not support journalctl journalctl: command not found
this is thelatest content of nvidia-docker.log by command:
cat /var/log/upstart/nvidia-docker.log

/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:01 Loading NVIDIA unified memory
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:01 Loading NVIDIA management library
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Discovering GPU devices
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Error: nvml: Unknown Error
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Loading NVIDIA unified memory
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Loading NVIDIA management library
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Discovering GPU devices
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Error: nvml: Unknown Error
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Loading NVIDIA unified memory
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Loading NVIDIA management library
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Discovering GPU devices
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Error: nvml: Unknown Error
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Loading NVIDIA unified memory
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Loading NVIDIA management library
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Discovering GPU devices
/usr/bin/nvidia-docker-plugin | 2017/09/12 10:52:02 Error: nvml: Unknown Error

@3XX0
Copy link
Member

3XX0 commented Sep 12, 2017

That's not good, is nvidia-smi working on the host? If not, your driver installation might have gone wrong.

Can you stop the daemon (sudo service stop nvidia-docker) and run in manually to get the a debug trace as instructed here

@lxzh
Copy link
Author

lxzh commented Sep 12, 2017

nvidia-smi command working fine on my host.
there is something wrong with stop nvidia-docekr service:

root@kirin:/etc/default# sudo service stop nvidia-docker
stop: unrecognized service
root@kirin:/etc/default# sudo service nvidia-docker stop
stop: Unknown instance: 
root@kirin:/etc/default# sudo service nvidia-docker restart
stop: Unknown instance: 
nvidia-docker start/post-stop, process 5198
root@kirin:/etc/default# sudo service nvidia-docker stop
stop: Unknown instance: 

The whole log is as below(I am sorry i can't attach a file as attachment):

@lxzh
Copy link
Author

lxzh commented Sep 12, 2017

nvidia-smi log:

root@kirin:/etc/default# nvidia-smi
Tue Sep 12 11:28:58 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  ERR!                Off  | 0000:4B:00.0      On |                  N/A |
| 23%   37C    P8    10W / 250W |     43MiB / 11170MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1440    G   /usr/bin/X                                      40MiB |
+-----------------------------------------------------------------------------+

@lxzh
Copy link
Author

lxzh commented Sep 12, 2017

nvidia-smi -q log:
root@kirin:/etc/default# nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Tue Sep 12 11:30:23 2017
Driver Version : 375.26

Attached GPUs : 1
GPU 0000:4B:00.0
Product Name : Unknown Error
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321117105335
GPU UUID : GPU-0cdc2da2-a7b7-737c-6551-4514fa7e0070
Minor Number : 0
VBIOS Version : 86.02.39.00.01
MultiGPU Board : No
Board ID : 0x4b00
GPU Part Number : 900-1G611-0050-000
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x4B
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0610DE
Bus Id : 0000:4B:00.0
Sub System Id : 0x85E21043
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 23 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 11170 MiB
Used : 43 MiB
Free : 11127 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 2 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 37 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
Power Readings
Power Management : Supported
Power Draw : 10.28 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5505 MHz
Video : 1708 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1440
Type : G
Name : /usr/bin/X
Used GPU Memory : 40 MiB

@lxzh
Copy link
Author

lxzh commented Sep 12, 2017

root@kirin:/etc/default# sudo nvidia-docker-plugin
nvidia-docker-plugin | 2017/09/12 11:31:15 Loading NVIDIA unified memory
nvidia-docker-plugin | 2017/09/12 11:31:15 Loading NVIDIA management library
nvidia-docker-plugin | 2017/09/12 11:31:15 Discovering GPU devices
nvidia-docker-plugin | 2017/09/12 11:31:15 Error: nvml: Unknown Error
root@kirin:/etc/default# dmesg | grep -i nvidia
[ 9.900668] nvidia: module license 'NVIDIA' taints kernel.
[ 9.903857] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 9.906932] nvidia-nvlink: Nvlink Core is being initialized, major device number 247
[ 9.906948] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 375.26 Thu Dec 8 18:36:43 PST 2016 (using threaded interrupts)
[ 9.910522] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 375.26 Thu Dec 8 18:04:14 PST 2016
[ 9.911811] [drm] [nvidia-drm] [GPU ID 0x00004b00] Loading driver
[ 9.982120] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 246
[ 10.445858] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:4b:00.1/sound/card1/input14
[ 10.445913] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/0000:4b:00.1/sound/card1/input15
[ 10.445956] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:4b:00.1/sound/card1/input16
[ 10.446000] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:4b:00.1/sound/card1/input17
[ 10.780878] NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
[ 10.904484] init: nvidia-docker main process (777) terminated with status 1
[ 10.904493] init: nvidia-docker main process ended, respawning
[ 11.488958] init: nvidia-prime main process (1183) terminated with status 127
[ 11.673292] init: nvidia-docker main process (1108) terminated with status 1
[ 11.673301] init: nvidia-docker main process ended, respawning
[ 12.439932] init: nvidia-docker main process (1395) terminated with status 1
[ 12.439941] init: nvidia-docker main process ended, respawning
[ 13.083939] nvidia-modeset: Allocated GPU:0 (GPU-0cdc2da2-a7b7-737c-6551-4514fa7e0070) @ PCI:0000:4b:00.0
[ 13.084377] init: nvidia-docker main process (1542) terminated with status 1
[ 13.084385] init: nvidia-docker main process ended, respawning
[ 13.097214] init: nvidia-docker main process (1556) terminated with status 1
[ 13.097221] init: nvidia-docker main process ended, respawning
[ 13.105832] init: nvidia-docker main process (1570) terminated with status 1
[ 13.105839] init: nvidia-docker respawning too fast, stopped
[ 2270.745125] init: nvidia-docker main process (5186) terminated with status 1
[ 2270.745148] init: nvidia-docker main process ended, respawning
[ 2270.762785] init: nvidia-docker main process (5200) terminated with status 1
[ 2270.762805] init: nvidia-docker main process ended, respawning
[ 2270.779454] init: nvidia-docker main process (5215) terminated with status 1
[ 2270.779473] init: nvidia-docker main process ended, respawning
[ 2270.797194] init: nvidia-docker main process (5229) terminated with status 1
[ 2270.797214] init: nvidia-docker main process ended, respawning
[ 2270.812970] init: nvidia-docker main process (5243) terminated with status 1
[ 2270.812990] init: nvidia-docker main process ended, respawning
[ 2270.829314] init: nvidia-docker main process (5258) terminated with status 1
[ 2270.829331] init: nvidia-docker respawning too fast, stopped

@lxzh
Copy link
Author

lxzh commented Sep 12, 2017

root@kirin:/etc/default# NV_DEBUG=1 nvidia-docker-plugin
nvidia-docker-plugin | 2017/09/12 11:31:59 Loading NVIDIA unified memory
nvidia-docker-plugin | 2017/09/12 11:31:59 Loading NVIDIA management library
nvidia-docker-plugin | 2017/09/12 11:31:59 Discovering GPU devices
nvidia-docker-plugin | 2017/09/12 11:31:59 Error: nvml: Unknown Error
nvidia-docker-plugin | 2017/09/12 11:31:59 /go/src/github.com/NVIDIA/nvidia-docker/src/nvidia-docker-plugin/main.go:48 (0x404790)
/usr/local/go/src/runtime/asm_amd64.s:437 (0x468cce)
/usr/local/go/src/runtime/panic.go:423 (0x438b89)
/usr/local/go/src/log/log.go:334 (0x48bf41)
/go/src/github.com/NVIDIA/nvidia-docker/src/nvidia-docker-plugin/main.go:38 (0x40461a)
/go/src/github.com/NVIDIA/nvidia-docker/src/nvidia-docker-plugin/main.go:75 (0x404e0c)
/usr/local/go/src/runtime/proc.go:111 (0x43b070)
/usr/local/go/src/runtime/asm_amd64.s:1721 (0x46b021)
root@kirin:/etc/default# ltrace -f nvidia-docker-plugin
[pid 5450] __libc_start_main(0x46bf60, 1, 0x7fff690fe1d8, 0x752bd0 <unfinished ...>
[pid 5450] pthread_once(0xd003a8, 0x728db0, 0x7fff690fe1e8, 0 <unfinished ...>
[pid 5450] malloc(104) = 0x25ac010
[pid 5450] pthread_mutexattr_init(0x7fff690fe030, 0x25ac070, 0x25ac010, 0x7fb9adc83760) = 0
[pid 5450] pthread_mutexattr_settype(0x7fff690fe030, 1, 0x25ac010, 0x7fb9adc83760) = 0
[pid 5450] pthread_mutex_init(0xd008c0, 0x7fff690fe030, 0, 0x7fb9adc83760) = 0
[pid 5450] pthread_mutexattr_destroy(0x7fff690fe030, 0x7fff690fe030, 0, 0) = 0
[pid 5450] pthread_mutexattr_init(0x7fff690fe040, 0x7fff690fe030, 0, 0) = 0
[pid 5450] pthread_mutexattr_settype(0x7fff690fe040, 1, 0, 0) = 0
[pid 5450] pthread_mutex_init(0xd00900, 0x7fff690fe040, 0, 0) = 0
[pid 5450] pthread_mutexattr_destroy(0x7fff690fe040, 0x7fff690fe040, 0, 0) = 0
[pid 5450] __cxa_atexit(0x72a7f0, 0, 0, 0) = 0
[pid 5450] <... pthread_once resumed> ) = 0
[pid 5450] __cxa_atexit(0x72a830, 0, 0xcbbaa8, -1) = 0
[pid 5450] pthread_attr_init(0x7fff690fe080, 0x46a640, 0xbfebfbff, 0xcdbce0) = 0
[pid 5450] pthread_attr_getstacksize(0x7fff690fe080, 0x7fff690fe078, 0xbfebfbff, 0) = 0
[pid 5450] pthread_attr_destroy(0x7fff690fe080, 1, 0x800000, 0) = 0
[pid 5450] malloc(24) = 0x25ac080
[pid 5450] sigfillset(<31-32>) = 0
[pid 5450] pthread_sigmask(2, 0x7fff690fdec0, 0x7fff690fdf40, 0x7fb9adc83760) = 0
[pid 5450] pthread_attr_init(0x7fff690fde80, 0x7fff690fdec0, 0, 0) = 0
[pid 5450] pthread_attr_getstacksize(0x7fff690fde80, 0x7fff690fde78, 0, 0) = 0
[pid 5450] pthread_create(0x7fff690fde70, 0x7fff690fde80, 0x711c40, 0x25ac080) = 0
[pid 5450] pthread_sigmask(2, 0x7fff690fdf40, 0, -1 <unfinished ...>
[pid 5451] free(0x25ac080 <unfinished ...>
[pid 5450] <... pthread_sigmask resumed> ) = 0
[pid 5451] <... free resumed> ) =
[pid 5450] malloc(24) = 0x25ac080
[pid 5450] sigfillset(
<31-32>) = 0
[pid 5450] pthread_sigmask(2, 0x7fff690fddf0, 0x7fff690fde70, 0x7fb9adc83760) = 0
[pid 5450] pthread_attr_init(0x7fff690fddb0, 0x7fff690fddf0, 0, 0) = 0
[pid 5450] pthread_attr_getstacksize(0x7fff690fddb0, 0x7fff690fdda8, 0, 0) = 0
[pid 5450] pthread_create(0x7fff690fdda0, 0x7fff690fddb0, 0x711c40, 0x25ac080) = 0
[pid 5450] pthread_sigmask(2, 0x7fff690fde70, 0, -1 <unfinished ...>
[pid 5452] free(0x25ac080 <unfinished ...>
[pid 5450] <... pthread_sigmask resumed> ) = 0
[pid 5452] <... free resumed> ) =
[pid 5450] malloc(24) = 0x25ac080
[pid 5450] sigfillset(<31-32>) = 0
[pid 5450] pthread_sigmask(2, 0x7fff690fdca0, 0x7fff690fdd20, 0x7fb9adc83760) = 0
[pid 5450] pthread_attr_init(0x7fff690fdc60, 0x7fff690fdca0, 0, 0) = 0
[pid 5452] malloc(24 <unfinished ...>
[pid 5450] pthread_attr_getstacksize(0x7fff690fdc60, 0x7fff690fdc58, 0, 0) = 0
[pid 5450] pthread_create(0x7fff690fdc50, 0x7fff690fdc60, 0x711c40, 0x25ac080) = 0
[pid 5450] pthread_sigmask(2, 0x7fff690fdd20, 0, -1 <unfinished ...>
[pid 5453] free(0x25ac080 <unfinished ...>
[pid 5452] <... malloc resumed> ) = 0x7fb9a80008c0
[pid 5450] <... pthread_sigmask resumed> ) = 0
[pid 5453] <... free resumed> ) =
[pid 5452] sigfillset(
<31-32>) = 0
[pid 5452] pthread_sigmask(2, 0x7fb9ad082c30, 0x7fb9ad082cb0, 0x7fb9a8000020) = 0
[pid 5452] pthread_attr_init(0x7fb9ad082bf0, 0x7fb9ad082c30, 0, 0) = 0
[pid 5452] pthread_attr_getstacksize(0x7fb9ad082bf0, 0x7fb9ad082be8, 0, 0) = 0
[pid 5452] pthread_create(0x7fb9ad082be0, 0x7fb9ad082bf0, 0x711c40, 0x7fb9a80008c0 <unfinished ...>
[pid 5454] free(0x7fb9a80008c0 <unfinished ...>
[pid 5452] <... pthread_create resumed> ) = 0
[pid 5454] <... free resumed> ) =
[pid 5452] pthread_sigmask(2, 0x7fb9ad082cb0, 0, -1) = 0
[pid 5450] pthread_mutex_lock(0xcffe20, 0xcdbce0, 0xc820057f18, 0) = 0
[pid 5450] pthread_cond_broadcast(0xcffe60, 0, 0, 0) = 0
[pid 5450] pthread_mutex_unlock(0xcffe20, 1, 0, 0) = 0
nvidia-docker-plugin | 2017/09/12 11:32:36 Loading NVIDIA unified memory
[pid 5455] --- Called exec() ---
[pid 5455] __libc_start_main(0x4017a0, 3, 0x7ffc562295e8, 0x401290 <unfinished ...>
[pid 5455] __strdup(0x7ffc5622a76c, 0x7ffc562295e8, 0x7ffc562295e8, 0) = 0xded010
[pid 5455] free(0xded010) =
[pid 5455] __strdup(0x7ffc5622a76f, 0x7ffc562295e8, 0x7ffc562295e8, 0) = 0xded010
[pid 5455] __strtol_internal("0", 0x7ffc56229360, 0) = 0
[pid 5455] free(0xded010) =
[pid 5455] fopen("/proc/modules", "r") = 0xded090
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, -1) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 118) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 105) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 102) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 120) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 120) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 105) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 102) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 102) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 102) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 120) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 120) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 102) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 102) = 1

@lxzh
Copy link
Author

lxzh commented Sep 12, 2017

[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 98) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 98) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 115) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 108) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 111) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 120) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 105) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 105) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 120) = 1
[pid 5455] fscanf(0xded090, 0x405a6c, 0x7ffc56229340, 115) = 1
[pid 5455] fclose(0xded090) = 0
[pid 5455] fopen("/proc/devices", "r") = 0xded090
[pid 5455] fgets("Character devices:\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] ferror(0xded090) = 0
[pid 5455] fgets(" 1 mem\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 1 mem\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 4 /dev/vc/0\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 4 /dev/vc/0\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 4 tty\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 4 tty\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 4 ttyS\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 4 ttyS\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 5 /dev/tty\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 5 /dev/tty\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 5 /dev/console\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 5 /dev/console\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 5 /dev/ptmx\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 5 /dev/ptmx\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 5 ttyprintk\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 5 ttyprintk\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 6 lp\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 6 lp\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 7 vcs\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 7 vcs\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 10 misc\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 10 misc\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 13 input\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 13 input\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 21 sg\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 21 sg\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 29 fb\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 29 fb\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 89 i2c\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 89 i2c\n", "nvidia-uvm") = nil
[pid 5455] fgets(" 99 ppdev\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr(" 99 ppdev\n", "nvidia-uvm") = nil
[pid 5455] fgets("108 ppp\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr("108 ppp\n", "nvidia-uvm") = nil
[pid 5455] fgets("116 alsa\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr("116 alsa\n", "nvidia-uvm") = nil
[pid 5455] fgets("128 ptm\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr("128 ptm\n", "nvidia-uvm") = nil
[pid 5455] fgets("136 pts\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr("136 pts\n", "nvidia-uvm") = nil
[pid 5455] fgets("180 usb\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr("180 usb\n", "nvidia-uvm") = nil
[pid 5455] fgets("189 usb_device\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr("189 usb_device\n", "nvidia-uvm") = nil
[pid 5455] fgets("195 nvidia-frontend\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr("195 nvidia-frontend\n", "nvidia-uvm") = nil
[pid 5455] fgets("216 rfcomm\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr("216 rfcomm\n", "nvidia-uvm") = nil
[pid 5455] fgets("226 drm\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr("226 drm\n", "nvidia-uvm") = nil
[pid 5455] fgets("246 nvidia-uvm\n", 255, 0xded090) = 0x7ffc56229260
[pid 5455] strstr("246 nvidia-uvm\n", "nvidia-uvm") = "nvidia-uvm\n"
[pid 5455] sscanf(0x7ffc56229260, 0x405b3b, 0x7ffc5622936c, 0) = 1
[pid 5455] fclose(0xded090) = 0
[pid 5455] __xstat(1, "/dev/nvidia-uvm", 0x7ffc56229160) = 0
[pid 5455] __xstat(1, "/dev/nvidia-uvm-tools", 0x7ffc56229160) = 0
[pid 5455] +++ exited (status 0) +++
[pid 5450] --- SIGCHLD (Child exited) ---
nvidia-docker-plugin | 2017/09/12 11:32:36 Loading NVIDIA management library
[pid 5450] setenv("CUDA_DISABLE_UNIFIED_MEMORY", "1", 1) = 0
[pid 5450] setenv("CUDA_CACHE_DISABLE", "1", 1) = 0
[pid 5450] unsetenv("@\352\n \310") =
[pid 5450] dlopen("libnvidia-ml.so.1", 257) = 0x25ac650
[pid 5450] nvmlInit_v2(0x7fb9ae6dd968, 0x25ac5f8, 0x25ac650, 0) = 0
nvidia-docker-plugin | 2017/09/12 11:32:36 Discovering GPU devices
[pid 5450] nvmlDeviceGetCount_v2(0xc82007cf50, 0xcdbce0, 0xc820057b40, 0xc820057ba8) = 0
[pid 5450] nvmlDeviceGetHandleByIndex_v2(0, 0xc82008c048, 0xc82008c048, 0xc820057a98) = 0
[pid 5450] nvmlDeviceGetName(0x7fb9a7fe67a8, 0xc820078580, 64, 0xc820078580) = 999
[pid 5450] nvmlErrorString(999, 0xcdbce0, 0xc820057978, 0xc8200579e8) = 0x7fb9a7dbe074
nvidia-docker-plugin | 2017/09/12 11:32:36 Error: nvml: Unknown Error
[pid 5450] nvmlShutdown(0xc820057af0, 0xcdbce0, 0xc820057a98, 0xc820057af0) = 0
[pid 5450] dlclose(0x25ac650) = 0
[pid 5454] +++ exited (status 1) +++
[pid 5453] +++ exited (status 1) +++
[pid 5452] +++ exited (status 1) +++
[pid 5451] +++ exited (status 1) +++
[pid 5450] +++ exited (status 1) +++
root@kirin:/etc/default# nvidia-smi topo -m
GPU0 CPU Affinity
GPU0 X 0-11

Legend:

X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

@lxzh
Copy link
Author

lxzh commented Sep 12, 2017

I am so sorry i can't even put much text at one time for the reason of information security of our company.

@3XX0
Copy link
Member

3XX0 commented Sep 12, 2017

It's clearly a driver issue as shown in nvidia-smi (ERR!), can you try upgrading to the latest stable drivers (i.e. 384.XX)?

@lxzh
Copy link
Author

lxzh commented Sep 12, 2017

Success, after upgrade the driver to 384.69. Thank you very much.

nvidia-docker run --rm 1e07735bc788 nvidia-smi

== NVIDIA Caffe ==
==================

NVIDIA Release 17.03 (build 12375)

Container image Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
Copyright (c) 2014, 2015, The Regents of the University of California (Regents)
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

Tue Sep 12 06:25:42 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.69                 Driver Version: 384.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:4B:00.0  On |                  N/A |
| 23%   37C    P8    10W / 250W |     53MiB / 11171MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

nvidia-docker run --rm nvidia/cuda:latest nvidia-smi

Tue Sep 12 06:29:17 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.69                 Driver Version: 384.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:4B:00.0  On |                  N/A |
| 23%   37C    P8    10W / 250W |     53MiB / 11171MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

nvidia-docker run -it --name ljf_caffe -v /home:/home nvcr.io/nvidia/caffe:17.03 /bin/bash

==================
== NVIDIA Caffe ==
==================

NVIDIA Release 17.03 (build 12375)

Container image Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
Copyright (c) 2014, 2015, The Regents of the University of California (Regents)
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

@lxzh lxzh closed this as completed Sep 12, 2017
@paolorota
Copy link

I feel obliged to reopen this, with last upgrades I have the same problem on 5/5 machines.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use 'nvidia-docker run' to start this container; see https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker

So basically docker does not see the drivers and therefore Tensorflow does not work anymore.

I am using the following image: nvcr.io/nvidia/tensorflow:19.08-py3 and I have installed the following nvidia drivers: 390.116

image

Docker version:
Docker version 19.03.1, build 74b1e89

Anyone can help? I tried to purge drivers, reinstall a different version but the result is the same. nvidia-docker is not found, I run docker normally as described in the guide.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants