Skip to content

x.cuda() not able to access cuda and hangs inside docker #149

@prashkmr

Description

@prashkmr

2. Steps to reprduce the issue

  1. sudo docker run --gpus 0 -it --rm nvcr.io/nvidia/pytorch:19.06-py3
  2. inside the container use the python terminal
    write this cide
    import torch
    import numpy as np
    x=torch.from_numpy(np.array([1,2]))
    y=x.cuda()

the terminal hangs and doesn't proceed

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
  • [ ]-- WARNING, the following logs are for debugging purposes only --

I1211 16:03:56.895535 862955 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977)
I1211 16:03:56.895630 862955 nvc.c:350] using root /
I1211 16:03:56.895650 862955 nvc.c:351] using ldcache /etc/ld.so.cache
I1211 16:03:56.895668 862955 nvc.c:352] using unprivileged user 1013:1014
I1211 16:03:56.895712 862955 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1211 16:03:56.895983 862955 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W1211 16:03:56.898601 862956 nvc.c:273] failed to set inheritable capabilities
W1211 16:03:56.898686 862956 nvc.c:274] skipping kernel modules load due to failure
I1211 16:03:56.899197 862957 rpc.c:71] starting driver rpc service
I1211 16:03:56.914387 862958 rpc.c:71] starting nvcgo rpc service
I1211 16:03:56.916258 862955 nvc_info.c:766] requesting driver information with ''
I1211 16:03:56.918835 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.525.60.13
I1211 16:03:56.918925 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.525.60.13
I1211 16:03:56.918982 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.525.60.13
I1211 16:03:56.919039 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.525.60.13
I1211 16:03:56.919135 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.525.60.13
I1211 16:03:56.919260 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.525.60.13
I1211 16:03:56.919321 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.525.60.13
I1211 16:03:56.919375 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.525.60.13
I1211 16:03:56.919451 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.525.60.13
I1211 16:03:56.919502 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.525.60.13
I1211 16:03:56.919551 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.525.60.13
I1211 16:03:56.919603 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.525.60.13
I1211 16:03:56.919716 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.525.60.13
I1211 16:03:56.919816 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.525.60.13
I1211 16:03:56.919887 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.525.60.13
I1211 16:03:56.919959 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.525.60.13
I1211 16:03:56.920067 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.525.60.13
I1211 16:03:56.920176 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.525.60.13
I1211 16:03:56.920761 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.60.13
I1211 16:03:56.920835 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.525.60.13
I1211 16:03:56.921120 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.525.60.13
I1211 16:03:56.921206 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.525.60.13
I1211 16:03:56.921283 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.525.60.13
I1211 16:03:56.921369 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.525.60.13
I1211 16:03:56.921484 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.525.60.13
I1211 16:03:56.921589 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.525.60.13
I1211 16:03:56.921696 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.525.60.13
I1211 16:03:56.921769 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.525.60.13
I1211 16:03:56.921873 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.525.60.13
I1211 16:03:56.921978 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.525.60.13
I1211 16:03:56.922051 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.525.60.13
I1211 16:03:56.922209 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.525.60.13
W1211 16:03:56.922309 862955 nvc_info.c:399] missing library libnvidia-nscq.so
W1211 16:03:56.922323 862955 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W1211 16:03:56.922338 862955 nvc_info.c:399] missing library libnvidia-pkcs11.so
W1211 16:03:56.922353 862955 nvc_info.c:399] missing library libvdpau_nvidia.so
W1211 16:03:56.922368 862955 nvc_info.c:399] missing library libnvidia-ifr.so
W1211 16:03:56.922382 862955 nvc_info.c:399] missing library libnvidia-cbl.so
W1211 16:03:56.922397 862955 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W1211 16:03:56.922412 862955 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W1211 16:03:56.922427 862955 nvc_info.c:403] missing compat32 library libcudadebugger.so
W1211 16:03:56.922441 862955 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W1211 16:03:56.922454 862955 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W1211 16:03:56.922468 862955 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W1211 16:03:56.922483 862955 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W1211 16:03:56.922495 862955 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W1211 16:03:56.922507 862955 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W1211 16:03:56.922522 862955 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W1211 16:03:56.922534 862955 nvc_info.c:403] missing compat32 library libnvidia-tls.so
W1211 16:03:56.922547 862955 nvc_info.c:403] missing compat32 library libnvidia-glsi.so
W1211 16:03:56.922562 862955 nvc_info.c:403] missing compat32 library libnvidia-fbc.so
W1211 16:03:56.922576 862955 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W1211 16:03:56.922590 862955 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W1211 16:03:56.922605 862955 nvc_info.c:403] missing compat32 library libnvoptix.so
W1211 16:03:56.922620 862955 nvc_info.c:403] missing compat32 library libGLX_nvidia.so
W1211 16:03:56.922635 862955 nvc_info.c:403] missing compat32 library libEGL_nvidia.so
W1211 16:03:56.922649 862955 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so
W1211 16:03:56.922663 862955 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so
W1211 16:03:56.922676 862955 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so
W1211 16:03:56.922691 862955 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I1211 16:03:56.924411 862955 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I1211 16:03:56.924457 862955 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I1211 16:03:56.924495 862955 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I1211 16:03:56.924559 862955 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I1211 16:03:56.924600 862955 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W1211 16:03:56.924729 862955 nvc_info.c:425] missing binary nv-fabricmanager
W1211 16:03:56.924793 862955 nvc_info.c:349] missing firmware path /lib/firmware/nvidia/525.60.13/gsp.bin
I1211 16:03:56.924846 862955 nvc_info.c:529] listing device /dev/nvidiactl
I1211 16:03:56.924862 862955 nvc_info.c:529] listing device /dev/nvidia-uvm
I1211 16:03:56.924877 862955 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I1211 16:03:56.924889 862955 nvc_info.c:529] listing device /dev/nvidia-modeset
I1211 16:03:56.924944 862955 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W1211 16:03:56.924993 862955 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W1211 16:03:56.925026 862955 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I1211 16:03:56.925042 862955 nvc_info.c:822] requesting device information with ''
I1211 16:03:56.931600 862955 nvc_info.c:713] listing device /dev/nvidia0 (GPU-15132fc8-6058-2032-32e5-8d742940df99 at 00000000:03:00.0)
NVRM version: 525.60.13
CUDA version: 12.0

Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 3090
Brand: GeForce
GPU UUID: GPU-15132fc8-6058-2032-32e5-8d742940df99
Bus Location: 00000000:03:00.0
Architecture: 8.6
I1211 16:03:56.931670 862955 nvc.c:434] shutting down library context
I1211 16:03:56.931753 862958 rpc.c:95] terminating nvcgo rpc service
I1211 16:03:56.932462 862955 rpc.c:135] nvcgo rpc service terminated successfully
I1211 16:03:56.935845 862957 rpc.c:95] terminating driver rpc service
I1211 16:03:56.936030 862955 rpc.c:135] driver rpc service terminated successfully

  • Kernel version from uname -a
    Linux visionlab 5.4.0-135-generic Creation of images with cuDNN{4,5} nvidia-docker#152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • Any relevant kernel output lines from dmesg

  • Driver information from nvidia-smi -a
    ==============NVSMI LOG==============

Timestamp : Sun Dec 11 21:35:37 2022
Driver Version : 525.60.13
CUDA Version : 12.0

Attached GPUs : 1
GPU 00000000:03:00.0
Product Name : NVIDIA GeForce RTX 3090
Product Brand : GeForce
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-15132fc8-6058-2032-32e5-8d742940df99
Minor Number : 0
VBIOS Version : 94.02.42.40.72
MultiGPU Board : No
Board ID : 0x300
Board Part Number : N/A
GPU Part Number : 2204-300-A1
Module ID : 0
Inforom Version
Image Version : G001.0000.03.03
OEM Object : 2.0
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x220410DE
Bus Id : 00000000:03:00.0
Sub System Id : 0x145410DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Device Current : 1
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 24576 MiB
Reserved : 317 MiB
Used : 553 MiB
Free : 23705 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 7 MiB
Free : 249 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 51 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 83 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 23.19 W
Power Limit : 370.00 W
Default Power Limit : 370.00 W
Enforced Power Limit : 370.00 W
Min Power Limit : 100.00 W
Max Power Limit : 370.00 W
Clocks
Graphics : 0 MHz
SM : 0 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2115 MHz
SM : 2115 MHz
Memory : 9751 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 0.000 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2052
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 55 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2252
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 12 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 79713
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 53 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 105354
Type : G
Name : /usr/lib/firefox/firefox
Used GPU Memory : 296 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 150301
Type : G
Name : /usr/share/code/code --type=gpu-process --disable-color-correct-rendering --enable-crashpad --crashpad-handler-pid=150289 --enable-crash-reporter=220d09c7-ace4-4d8a-a019-f5551f55bb4e,no_channel --user-data-dir=/home/vishesh/.config/Code --gpu-preferences=UAAAAAAAAAAgAAAIAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAABgAAAAAAAAAGAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= --shared-files --field-trial-handle=0,5632948566441798013,1364472480275554198,131072 --disable-features=PlzServiceWorker,SpareRendererForSitePerProcess
Used GPU Memory : 27 MiB

  • Docker version from docker version
    Client: Docker Engine - Community
    Version: 20.10.21
    API version: 1.41
    Go version: go1.18.7
    Git commit: baeda1f
    Built: Tue Oct 25 18:02:21 2022
    OS/Arch: linux/amd64
    Context: default
    Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.21
API version: 1.41 (minimum version 1.12)
Go version: go1.18.7
Git commit: 3056208
Built: Tue Oct 25 18:00:04 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.12
GitCommit: a05d175400b1145e5e6a735a6710579d181e7fb0
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'

  • NVIDIA container library version from nvidia-container-cli -V
    cli-version: 1.11.0
    lib-version: 1.11.0
    build date: 2022-09-06T09:21+00:00
    build revision: c8f267be0bac1c654d59ad4ea5df907141149977
    build compiler: x86_64-linux-gnu-gcc-7 7.5.0
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

  • NVIDIA container library logs (see troubleshooting)

  • Docker command, image and tag used
    given at the top

  1. sudo docker run --gpus 0 -it --rm nvcr.io/nvidia/pytorch:19.06-py3

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions