Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Driver timeout error during initialization while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi" #1133

Closed
silencekev opened this issue Nov 22, 2019 · 12 comments

Comments

@silencekev
Copy link

Hi,
I'm trying to get nvidia-docker work in fullstack simulator flow. I hit driver timeout issue while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi".

1. Issue or feature description

Here are the messages I see:

root@fsf-linux-x64:~/host-shared# docker run -it --rm --gpus 1 ubuntu /bin/bash
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: timed out\\n\""": unknown.
ERRO[0098] error waiting for container: context canceled

I found a similar issue #628, which is fixed by running nvidia-persistenced. I did same thing and confirmed persistence mode is enabled by running nvidia-smi in host.
root@fsf-linux-x64:~/host-shared# nvidia-smi
Fri Nov 22 06:34:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 445.00 Driver Version: 445.00 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=======+====================+======================|
| 0 Graphics Device On | 00000000:00:01.0 Off | N/A |
| N/A ERR! N/A N/A / N/A | 0MiB / 860MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 Graphics Device On | 00000000:00:02.0 Off | N/A |
| N/A ERR! N/A N/A / N/A | 0MiB / 860MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|==============================================|
| No running processes found |
+-----------------------------------------------------------------------------+

I also checked dmsg of driver. I don't see obvious error.

2. Steps to reproduce the issue

This is running on simulator other than silicon. It's a long story to setup the simulator env.

3. Information

docker_issue_report_info.txt

driver_info.log
dmesg.log

since this is running on simulator(fmodel), it's expected to be slow. can Persistence mode disable the time out check? is there any other way to disable time out check?

@RenaudWasTaken
Copy link
Contributor

You didn't attach the most important command: nvidia-container-cli -k -d /dev/tty info

My guess is that your driver installation isn't correctly setup. You should try to install and run a normal cuda sample (e.g: deviceQuery).

Please re-install your driver: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

@silencekev
Copy link
Author

silencekev commented Nov 25, 2019

It's in docker_issue_report_info.txt attached. Paste it again here.

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
    NVRM version: 445.00
    CUDA version: 11.0

Device Index: 0
Device Minor: 0
Model: Graphics Device
Brand: Tesla
GPU UUID: GPU-f6725d35-79da-1beb-34cb-e45266743217
Bus Location: 00000000:00:01.0
Architecture: 8.0

Device Index: 1
Device Minor: 1
Model: Graphics Device
Brand: Tesla
GPU UUID: GPU-f6725d35-79da-1beb-34cb-e45266743217
Bus Location: 00000000:00:02.0
Architecture: 8.0

In host, I can run pass a cuda application named "acos" successfully. Again, I'm running the container in a visualized QEMU OS of FSF flow.

@RenaudWasTaken
Copy link
Contributor

On my machine the command nvidia-container-cli -k -d /dev/tty info attaches a lot of log information that are omitted here.
e.g:

I1125 03:14:01.688101 16493 nvc.c:281] initializing library context (version=1.0.5, build=13b836390888f7b7c7dca115d16d7e28ab15a836)
I1125 03:14:01.688303 16493 nvc.c:255] using root /
I1125 03:14:01.688372 16493 nvc.c:256] using ldcache /etc/ld.so.cache
I1125 03:14:01.688400 16493 nvc.c:257] using unprivileged user 1000:1000
W1125 03:14:01.690219 16494 nvc.c:186] failed to set inheritable capabilities
W1125 03:14:01.690339 16494 nvc.c:187] skipping kernel modules load due to failure
I1125 03:14:01.690843 16495 driver.c:133] starting driver service
I1125 03:14:01.751051 16493 nvc_info.c:437] requesting driver information with ''
I1125 03:14:01.751314 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.410.48
I1125 03:14:01.751473 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.410.48
I1125 03:14:01.751520 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.410.48
I1125 03:14:01.751559 16493 nvc_info.c:153] skipping /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.410.48
I1125 03:14:01.751605 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.410.48
I1125 03:14:01.751649 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.48
I1125 03:14:01.751706 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.410.48
I1125 03:14:01.751774 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.410.48
I1125 03:14:01.751814 16493 nvc_info.c:151] selecting /usr/local/cuda-10.0/lib64/libnvidia-ml.so.410.48
I1125 03:14:01.751860 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.410.48 over /usr/local/cuda-10.0/lib64/libnv

@RenaudWasTaken
Copy link
Contributor

Typically the timeout errors are seen because of a bad driver install.
Note that the timeout is of 10 seconds and not configurable.

@silencekev
Copy link
Author

Oh, I didn't realize those messages are important. I tried the command again, but it complains driver error this time:
root@fsf-linux-x64:/mnt/tmp/input# nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I1125 09:57:37.507340 1396 nvc.c:281] initializing library context (version=1.0.5, build=13b836390888f7b7c7dca115d16d7e28ab15a836)
I1125 09:57:37.507420 1396 nvc.c:255] using root /
I1125 09:57:37.507430 1396 nvc.c:256] using ldcache /etc/ld.so.cache
I1125 09:57:37.507438 1396 nvc.c:257] using unprivileged user 65534:65534
I1125 09:57:37.528841 1397 nvc.c:191] loading kernel module nvidia
I1125 09:57:37.529745 1397 nvc.c:203] loading kernel module nvidia_uvm
I1125 09:57:37.564577 1397 nvc.c:211] loading kernel module nvidia_modeset
I1125 09:57:37.610527 1416 driver.c:133] starting driver service
W1125 09:58:03.487165 1396 driver.c:220] terminating driver service (forced)
I1125 10:02:20.196894 1396 driver.c:233] driver service terminated with signal 15
nvidia-container-cli: initialization error: driver error: timed out

Let me try to re-install the driver.

Note that the timeout is of 10 seconds and not configurable.
If I want to change the value, I have to compile the source code by myself, right?

@silencekev
Copy link
Author

It looks "nvidia-container-cli -k -d /dev/tty info" failed just because I forgot to enable persistence mode.
After enabling it, I got the information:
root@fsf-linux-x64:/mnt/tmp/input# nvidia-container-cli -k -d /dev/tty info |& tee info.txt

-- WARNING, the following logs are for debugging purposes only --

I1125 10:30:07.223710 1612 nvc.c:281] initializing library context (version=1.0.5, build=13b836390888f7b7c7dca115d16d7e28ab15a836)
I1125 10:30:07.223804 1612 nvc.c:255] using root /
I1125 10:30:07.223823 1612 nvc.c:256] using ldcache /etc/ld.so.cache
I1125 10:30:07.223838 1612 nvc.c:257] using unprivileged user 65534:65534
I1125 10:30:07.252028 1614 nvc.c:191] loading kernel module nvidia
I1125 10:30:07.252466 1614 nvc.c:203] loading kernel module nvidia_uvm
I1125 10:30:07.252598 1614 nvc.c:211] loading kernel module nvidia_modeset
I1125 10:30:07.293319 1615 driver.c:133] starting driver service
I1125 10:30:12.479316 1612 nvc_info.c:437] requesting driver information with ''
I1125 10:30:12.515486 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.445.00
I1125 10:30:12.515642 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.445.00
I1125 10:30:12.515728 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.445.00
I1125 10:30:12.515811 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.445.00
I1125 10:30:12.515887 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.445.00
I1125 10:30:12.515986 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.445.00
I1125 10:30:12.516169 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.445.00
I1125 10:30:12.516252 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.445.00
I1125 10:30:12.516351 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.445.00
I1125 10:30:12.516454 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.445.00
I1125 10:30:12.516523 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.445.00
I1125 10:30:12.516594 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.445.00
I1125 10:30:12.516665 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.445.00
I1125 10:30:12.516779 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.445.00
I1125 10:30:12.516856 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.445.00
I1125 10:30:12.516955 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.445.00
I1125 10:30:12.517032 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.445.00
I1125 10:30:12.517107 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.445.00
I1125 10:30:12.517210 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.445.00
I1125 10:30:12.517405 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.445.00
I1125 10:30:12.517554 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.445.00
I1125 10:30:12.517632 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.445.00
I1125 10:30:12.517705 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.445.00
I1125 10:30:12.517792 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.445.00
W1125 10:30:12.517840 1612 nvc_info.c:306] missing compat32 library libnvidia-ml.so
W1125 10:30:12.517860 1612 nvc_info.c:306] missing compat32 library libnvidia-cfg.so
W1125 10:30:12.517876 1612 nvc_info.c:306] missing compat32 library libcuda.so
W1125 10:30:12.517891 1612 nvc_info.c:306] missing compat32 library libnvidia-opencl.so
W1125 10:30:12.517906 1612 nvc_info.c:306] missing compat32 library libnvidia-ptxjitcompiler.so
W1125 10:30:12.517922 1612 nvc_info.c:306] missing compat32 library libnvidia-fatbinaryloader.so
W1125 10:30:12.517937 1612 nvc_info.c:306] missing compat32 library libnvidia-compiler.so
W1125 10:30:12.517952 1612 nvc_info.c:306] missing compat32 library libvdpau_nvidia.so
W1125 10:30:12.517968 1612 nvc_info.c:306] missing compat32 library libnvidia-encode.so
W1125 10:30:12.517983 1612 nvc_info.c:306] missing compat32 library libnvidia-opticalflow.so
W1125 10:30:12.517998 1612 nvc_info.c:306] missing compat32 library libnvcuvid.so
W1125 10:30:12.518014 1612 nvc_info.c:306] missing compat32 library libnvidia-eglcore.so
W1125 10:30:12.518029 1612 nvc_info.c:306] missing compat32 library libnvidia-glcore.so
W1125 10:30:12.518044 1612 nvc_info.c:306] missing compat32 library libnvidia-tls.so
W1125 10:30:12.518059 1612 nvc_info.c:306] missing compat32 library libnvidia-glsi.so
W1125 10:30:12.518074 1612 nvc_info.c:306] missing compat32 library libnvidia-fbc.so
W1125 10:30:12.518089 1612 nvc_info.c:306] missing compat32 library libnvidia-ifr.so
W1125 10:30:12.518105 1612 nvc_info.c:306] missing compat32 library libnvidia-rtcore.so
W1125 10:30:12.518120 1612 nvc_info.c:306] missing compat32 library libnvoptix.so
W1125 10:30:12.518135 1612 nvc_info.c:306] missing compat32 library libGLX_nvidia.so
W1125 10:30:12.518149 1612 nvc_info.c:306] missing compat32 library libEGL_nvidia.so
W1125 10:30:12.518164 1612 nvc_info.c:306] missing compat32 library libGLESv2_nvidia.so
W1125 10:30:12.518179 1612 nvc_info.c:306] missing compat32 library libGLESv1_CM_nvidia.so
W1125 10:30:12.518194 1612 nvc_info.c:306] missing compat32 library libnvidia-glvkspirv.so
I1125 10:30:12.518617 1612 nvc_info.c:232] selecting /usr/bin/nvidia-smi
I1125 10:30:12.518682 1612 nvc_info.c:232] selecting /usr/bin/nvidia-debugdump
I1125 10:30:12.518724 1612 nvc_info.c:232] selecting /usr/bin/nvidia-persistenced
I1125 10:30:12.518774 1612 nvc_info.c:232] selecting /usr/bin/nvidia-cuda-mps-control
I1125 10:30:12.518821 1612 nvc_info.c:232] selecting /usr/bin/nvidia-cuda-mps-server
I1125 10:30:12.518875 1612 nvc_info.c:369] listing device /dev/nvidiactl
I1125 10:30:12.518891 1612 nvc_info.c:369] listing device /dev/nvidia-uvm
I1125 10:30:12.518906 1612 nvc_info.c:369] listing device /dev/nvidia-uvm-tools
I1125 10:30:12.518921 1612 nvc_info.c:369] listing device /dev/nvidia-modeset
I1125 10:30:12.518980 1612 nvc_info.c:273] listing ipc /run/nvidia-persistenced/socket
W1125 10:30:12.519016 1612 nvc_info.c:277] missing ipc /tmp/nvidia-mps
I1125 10:30:12.519034 1612 nvc_info.c:493] requesting device information with ''
I1125 10:30:13.037711 1612 nvc_info.c:523] listing device /dev/nvidia0 (GPU-f6725d35-79da-1beb-34cb-e45266743217 at 00000000:00:01.0)
I1125 10:30:13.525357 1612 nvc_info.c:523] listing device /dev/nvidia1 (GPU-f6725d35-79da-1beb-34cb-e45266743217 at 00000000:00:02.0)
I1125 10:30:13.558389 1612 nvc.c:318] shutting down library context
I1125 10:30:13.676495 1615 driver.c:192] W1125 10:30:13.713572 1612 driver.c:220] terminating driver service (forced)
I1125 10:30:14.302469 1612 driver.c:233] driver service terminated with signal 9
NVRM version: 445.00
CUDA version: 11.0

Device Index: 0
Device Minor: 0
Model: Graphics Device
Brand: Tesla
GPU UUID: GPU-f6725d35-79da-1beb-34cb-e45266743217
Bus Location: 00000000:00:01.0
Architecture: 8.0

Device Index: 1
Device Minor: 1
Model: Graphics Device
Brand: Tesla
GPU UUID: GPU-f6725d35-79da-1beb-34cb-e45266743217
Bus Location: 00000000:00:02.0
Architecture: 8.0

By the way, after this, my container run still failed due to driver time out.

root@fsf-linux-x64:/mnt/tmp/input# docker run --env CUDA_c702f783=0xa15c1ed9 --gpus 2 nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: timed out\\n\""": unknown.

since the simulator below is fmodel other than silicon, I expect slow gpu response. does this sounds make sense?

@silencekev
Copy link
Author

This is the callstack I see.

#0 0x00007f8c575a2bc4 in __GI___poll (fds=fds@entry=0x7fff140b9098, nfds=nfds@entry=1, timeout=timeout@entry=25000)
at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f8c575e166f in readunix (ctptr=0x55fc5a6c71e0 "\004", buf=0x55fc5a6c8330 "", len=4000) at clnt_unix.c:549
#2 0x00007f8c575dd460 in fill_input_buf (rstrm=0x55fc5a6c7300) at xdr_rec.c:567
#3 get_input_bytes (len=4, addr=0x7fff140b9154 "", rstrm=) at xdr_rec.c:585
#4 set_input_fragment (rstrm=) at xdr_rec.c:603
#5 xdrrec_getbytes (len=4, addr=0x7fff140b9150 "}", xdrs=) at xdr_rec.c:263
#6 xdrrec_getlong (xdrs=, lp=0x7fff140b91a0) at xdr_rec.c:219
#7 0x00007f8c575ea689 in __GI_xdr_u_long (xdrs=xdrs@entry=0x55fc5a6c72a8, ulp=ulp@entry=0x7fff140b9230) at xdr.c:215
#8 0x00007f8c575dbac1 in __GI_xdr_replymsg (xdrs=xdrs@entry=0x55fc5a6c72a8, rmsg=rmsg@entry=0x7fff140b9230) at rpc_prot.c:135
#9 0x00007f8c575e128a in clntunix_call (h=0x55fc5a6c72e0, proc=, xdr_args=0x7f8c56e78890,
args_ptr=0x7fff140b92f8 "\370plZ\374U", xdr_results=0x7f8c56e788b0, results_ptr=0x7fff140b9320 "", timeout=...) at clnt_unix.c:265
#10 0x00007f8c56e79279 in ?? () from /usr/local/lib/libnvidia-container.so.1
#11 0x00007f8c56e6dabb in ?? () from /usr/local/lib/libnvidia-container.so.1
#12 0x00007f8c56e71672 in nvc_init () from /usr/local/lib/libnvidia-container.so.1
#13 0x000055fc5a0a2c51 in ?? ()
#14 0x000055fc5a0a063b in ?? ()
#15 0x00007f8c574afb97 in __libc_start_main (main=0x55fc5a0a05f0, argc=10, argv=0x7fff140ba928, init=, fini=,
rtld_fini=, stack_end=0x7fff140ba918) at ../csu/libc-start.c:310
#16 0x000055fc5a0a069a in ?? ()

I think the Timeout happens on
driver_init_1_svc()
{
if (call_cuda(ctx, cuInit, 0) < 0)
goto fail;
}

Is this cuInit called through RPC? if yes, how to increase the timeout limit in the poll?
#0 0x00007f8c575a2bc4 in __GI___poll (fds=fds@entry=0x7fff140b9098, nfds=nfds@entry=1, timeout=timeout@entry=25000)

It looks to be the code in setup_rpc_client/CLSET_TIMEOUT. but the timeout value looks to be 10 in code, why is 25000 in the callstack?

@silencekev
Copy link
Author

It looks 25000 is passed in while calling clntunix_call(). clntunix_call() is hardcoded while linking libnvidia-container.so.1 to some dependency library?

@RenaudWasTaken
Copy link
Contributor

I believe we resolved this, through email.

@silencekev
Copy link
Author

yes, it's resolved. thanks for your help.

@thuzhf
Copy link

thuzhf commented Dec 28, 2021

@silencekev @RenaudWasTaken Could you please share with me how you resolved this problem. I encountered the very same problem. And don't know how to resolve it. My driver version is 470.57.02. And GPU type is: NVIDIA A100 80G PCIE.

@willn52
Copy link

willn52 commented Feb 15, 2022

@thuzhf I recently had this same issue.
In my case the solution involved testing persistence mode.
nvidia-persistenced
If this resolves your issue add it to systemd and you should be good to go.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants