Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-cli.real: initialization error: change root failed: no such file or directory: unknown on Flatcar OS #134

Open
12 tasks
Roshnichauhan1 opened this issue Sep 12, 2023 · 0 comments

Comments

@Roshnichauhan1
Copy link

1. Issue or feature description

Hello,
Running a GCP VM with base OS as flatcar linux, the vm has Tesla T4 GPU. When trying to run any docker container it is unable to use the GPUs.

2. Steps to reproduce the issue

Please note flatcar OS has an immutable filesystem. The base OS is pulled from the official flatcar alpha-release.
To install nvidia container runtime used the below command.
docker run --rm --privileged
-v "/etc/docker:/etc/docker"
-v "/run/nvidia:/run/nvidia"
-v "/run/docker.sock:/run/docker.sock"
-v "/opt/nvidia-runtime:/opt/nvidia-runtime"
-e "RUNTIME=docker"
-e "RUNTIME_ARGS=--socket /run/docker.sock"
-e "DOCKER_SOCKET=/run/docker.sock"
nvcr.io/nvidia/k8s/container-toolkit:v1.14.1-ubuntu20.04
"/opt/nvidia-runtime"

 Restart docker. sudo systemctl restart docker
 
 docker info | grep -i nvidia

Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia nvidia-cdi nvidia-legacy runc
Default Runtime: nvidia

which nvidia-container-cli
/opt/nvidia-runtime/toolkit/nvidia-container-cli
The nvidia-runtime is installed in a custom path. added this to $PATH as well
The docker daemon has the exact path for runtime set.

3. Information to attach (optional if deemed irrelevant)

  • nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0912 13:31:23.410040 3345 nvc.c:376] initializing library context (version=1.14.1, build=1eb5a30a6ad0415550a9df632ac8832bf7e2bbba)
I0912 13:31:23.410076 3345 nvc.c:350] using root /
I0912 13:31:23.410087 3345 nvc.c:351] using ldcache /etc/ld.so.cache
I0912 13:31:23.410097 3345 nvc.c:352] using unprivileged user 500:500
I0912 13:31:23.410132 3345 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0912 13:31:23.410268 3345 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0912 13:31:23.411705 3346 nvc.c:273] failed to set inheritable capabilities
W0912 13:31:23.411757 3346 nvc.c:274] skipping kernel modules load due to failure
I0912 13:31:23.412126 3347 rpc.c:71] starting driver rpc service
I0912 13:31:23.986322 3356 rpc.c:71] starting nvcgo rpc service
I0912 13:31:24.113234 3345 nvc_info.c:798] requesting driver information with ''
I0912 13:31:24.126048 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libvdpau_nvidia.so.525.105.17
I0912 13:31:24.151152 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvoptix.so.525.105.17
I0912 13:31:24.163793 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-tls.so.525.105.17
I0912 13:31:24.188779 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-rtcore.so.525.105.17
I0912 13:31:24.194283 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-ptxjitcompiler.so.525.105.17
I0912 13:31:24.203207 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-opticalflow.so.525.105.17
I0912 13:31:24.231203 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-opencl.so.525.105.17
I0912 13:31:24.251847 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-nvvm.so.525.105.17
I0912 13:31:24.270706 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-ngx.so.525.105.17
I0912 13:31:24.270832 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-ml.so.525.105.17
I0912 13:31:24.287942 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-glvkspirv.so.525.105.17
I0912 13:31:24.308553 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-glsi.so.525.105.17
I0912 13:31:24.321221 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-glcore.so.525.105.17
I0912 13:31:24.336188 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-fbc.so.525.105.17
I0912 13:31:24.344048 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-encode.so.525.105.17
I0912 13:31:24.363013 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-eglcore.so.525.105.17
I0912 13:31:24.381439 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-compiler.so.525.105.17
I0912 13:31:24.396212 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-cfg.so.525.105.17
I0912 13:31:24.425090 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvidia-allocator.so.525.105.17
I0912 13:31:24.449588 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libnvcuvid.so.525.105.17
I0912 13:31:24.457108 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libcudadebugger.so.525.105.17
I0912 13:31:24.457234 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libcuda.so.525.105.17
I0912 13:31:24.482342 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libGLX_nvidia.so.525.105.17
I0912 13:31:24.493154 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libGLESv2_nvidia.so.525.105.17
I0912 13:31:24.493876 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libGLESv1_CM_nvidia.so.525.105.17
I0912 13:31:24.505424 3345 nvc_info.c:176] selecting /opt/nvidia/525.105.17/6.1.50-flatcar/usr/lib64/libEGL_nvidia.so.525.105.17
W0912 13:31:24.505513 3345 nvc_info.c:402] missing library libnvidia-nscq.so
W0912 13:31:24.505532 3345 nvc_info.c:402] missing library libnvidia-gpucomp.so
W0912 13:31:24.505568 3345 nvc_info.c:402] missing library libnvidia-fatbinaryloader.so
W0912 13:31:24.505582 3345 nvc_info.c:402] missing library libnvidia-pkcs11.so
W0912 13:31:24.505593 3345 nvc_info.c:402] missing library libnvidia-pkcs11-openssl3.so
W0912 13:31:24.505607 3345 nvc_info.c:402] missing library libnvidia-ifr.so
W0912 13:31:24.505617 3345 nvc_info.c:402] missing library libnvidia-cbl.so
W0912 13:31:24.505628 3345 nvc_info.c:406] missing compat32 library libnvidia-ml.so
W0912 13:31:24.505642 3345 nvc_info.c:406] missing compat32 library libnvidia-cfg.so
W0912 13:31:24.505657 3345 nvc_info.c:406] missing compat32 library libnvidia-nscq.so
W0912 13:31:24.505669 3345 nvc_info.c:406] missing compat32 library libcuda.so
W0912 13:31:24.505680 3345 nvc_info.c:406] missing compat32 library libcudadebugger.so
W0912 13:31:24.505692 3345 nvc_info.c:406] missing compat32 library libnvidia-opencl.so
W0912 13:31:24.505711 3345 nvc_info.c:406] missing compat32 library libnvidia-gpucomp.so
W0912 13:31:24.505722 3345 nvc_info.c:406] missing compat32 library libnvidia-ptxjitcompiler.so
W0912 13:31:24.505734 3345 nvc_info.c:406] missing compat32 library libnvidia-fatbinaryloader.so
W0912 13:31:24.505750 3345 nvc_info.c:406] missing compat32 library libnvidia-allocator.so
W0912 13:31:24.505759 3345 nvc_info.c:406] missing compat32 library libnvidia-compiler.so
W0912 13:31:24.505767 3345 nvc_info.c:406] missing compat32 library libnvidia-pkcs11.so
W0912 13:31:24.505774 3345 nvc_info.c:406] missing compat32 library libnvidia-pkcs11-openssl3.so
W0912 13:31:24.505781 3345 nvc_info.c:406] missing compat32 library libnvidia-nvvm.so
W0912 13:31:24.505789 3345 nvc_info.c:406] missing compat32 library libnvidia-ngx.so
W0912 13:31:24.505796 3345 nvc_info.c:406] missing compat32 library libvdpau_nvidia.so
W0912 13:31:24.505803 3345 nvc_info.c:406] missing compat32 library libnvidia-encode.so
W0912 13:31:24.505810 3345 nvc_info.c:406] missing compat32 library libnvidia-opticalflow.so
W0912 13:31:24.505825 3345 nvc_info.c:406] missing compat32 library libnvcuvid.so
W0912 13:31:24.505839 3345 nvc_info.c:406] missing compat32 library libnvidia-eglcore.so
W0912 13:31:24.505850 3345 nvc_info.c:406] missing compat32 library libnvidia-glcore.so
W0912 13:31:24.505861 3345 nvc_info.c:406] missing compat32 library libnvidia-tls.so
W0912 13:31:24.505871 3345 nvc_info.c:406] missing compat32 library libnvidia-glsi.so
W0912 13:31:24.505883 3345 nvc_info.c:406] missing compat32 library libnvidia-fbc.so
W0912 13:31:24.505894 3345 nvc_info.c:406] missing compat32 library libnvidia-ifr.so
W0912 13:31:24.505910 3345 nvc_info.c:406] missing compat32 library libnvidia-rtcore.so
W0912 13:31:24.505925 3345 nvc_info.c:406] missing compat32 library libnvoptix.so
W0912 13:31:24.505937 3345 nvc_info.c:406] missing compat32 library libGLX_nvidia.so
W0912 13:31:24.505953 3345 nvc_info.c:406] missing compat32 library libEGL_nvidia.so
W0912 13:31:24.505968 3345 nvc_info.c:406] missing compat32 library libGLESv2_nvidia.so
W0912 13:31:24.505979 3345 nvc_info.c:406] missing compat32 library libGLESv1_CM_nvidia.so
W0912 13:31:24.505990 3345 nvc_info.c:406] missing compat32 library libnvidia-glvkspirv.so
W0912 13:31:24.506000 3345 nvc_info.c:406] missing compat32 library libnvidia-cbl.so
I0912 13:31:24.506240 3345 nvc_info.c:302] selecting /opt/bin/nvidia-smi
I0912 13:31:24.506309 3345 nvc_info.c:302] selecting /opt/bin/nvidia-debugdump
I0912 13:31:24.506352 3345 nvc_info.c:302] selecting /opt/bin/nvidia-persistenced
I0912 13:31:24.506423 3345 nvc_info.c:302] selecting /opt/bin/nvidia-cuda-mps-control
I0912 13:31:24.506465 3345 nvc_info.c:302] selecting /opt/bin/nvidia-cuda-mps-server
W0912 13:31:24.506708 3345 nvc_info.c:428] missing binary nv-fabricmanager
W0912 13:31:24.506758 3345 nvc_info.c:471] missing firmware path /usr/lib/firmware/nvidia/525.105.17/gsp*.bin
I0912 13:31:24.506805 3345 nvc_info.c:561] listing device /dev/nvidiactl
I0912 13:31:24.506815 3345 nvc_info.c:561] listing device /dev/nvidia-uvm
I0912 13:31:24.506821 3345 nvc_info.c:561] listing device /dev/nvidia-uvm-tools
I0912 13:31:24.506831 3345 nvc_info.c:561] listing device /dev/nvidia-modeset
W0912 13:31:24.506870 3345 nvc_info.c:352] missing ipc path /var/run/nvidia-persistenced/socket
W0912 13:31:24.506907 3345 nvc_info.c:352] missing ipc path /var/run/nvidia-fabricmanager/socket
W0912 13:31:24.506938 3345 nvc_info.c:352] missing ipc path /tmp/nvidia-mps
I0912 13:31:24.506949 3345 nvc_info.c:854] requesting device information with ''
I0912 13:31:24.513376 3345 nvc_info.c:745] listing device /dev/nvidia0 (GPU-15ee4e6f-c7f7-34f9-b76d-90d920d3b8a9 at 00000000:00:04.0)
NVRM version: 525.105.17
CUDA version: 12.0

Device Index: 0
Device Minor: 0
Model: Tesla T4
Brand: Nvidia
GPU UUID: GPU-15ee4e6f-c7f7-34f9-b76d-90d920d3b8a9
Bus Location: 00000000:00:04.0
Architecture: 7.5
I0912 13:31:24.513449 3345 nvc.c:434] shutting down library context
I0912 13:31:24.513580 3356 rpc.c:95] terminating nvcgo rpc service
I0912 13:31:24.514246 3345 rpc.c:135] nvcgo rpc service terminated successfully
I0912 13:31:24.612794 3347 rpc.c:95] terminating driver rpc service
I0912 13:31:24.613062 3345 rpc.c:135] driver rpc service terminated successfully

  • Kernel version from uname -a
    Linux 6.1.50-flatcar Add README image nvidia-docker#1 SMP PREEMPT_DYNAMIC Fri Sep 1 20:08:33 -00 2023 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux

  • dmesg | grep -iE 'secure|nvidia'
    [ 0.000368] Secure boot disabled
    [ 10.291502] systemd[1]: /usr/lib/systemd/system/nvidia.service:9: Unknown key name 'RemainsAfterExit' in section 'Service', ignoring.
    [ 19.040786] nvidia: loading out-of-tree module taints kernel.
    [ 19.046691] nvidia: module license 'NVIDIA' taints kernel.
    [ 19.081282] nvidia: module verification failed: signature and/or required key missing - tainting kernel
    [ 19.192681] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
    [ 19.219859] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.105.17 Tue Mar 28 18:02:59 UTC 2023
    [ 19.315344] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.105.17 Tue Mar 28 22:18:37 UTC 2023
    [ 21.894900] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
    [ 21.943546] nvidia-uvm: Loaded the UVM driver, major device number 244.
    [ 22.652528] nvidia 0000:00:04.0: Direct firmware load for nvidia/525.105.17/gsp_tu10x.bin failed with error -2
    [ 614.869434] nvidia 0000:00:04.0: Direct firmware load for nvidia/525.105.17/gsp_tu10x.bin failed with error -2
    [ 4173.236622] nvidia 0000:00:04.0: Direct firmware load for nvidia/525.105.17/gsp_tu10x.bin failed with error -2
    [ 4476.758972] nvidia 0000:00:04.0: Direct firmware load for nvidia/525.105.17/gsp_tu10x.bin failed with error -2
    [ 4691.022984] nvidia 0000:00:04.0: Direct firmware load for nvidia/525.105.17/gsp_tu10x.bin failed with error -2

  • nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Tue Sep 12 13:33:53 2023
Driver Version : 525.105.17
CUDA Version : 12.0

Attached GPUs : 1
GPU 00000000:00:04.0
Product Name : Tesla T4
Product Brand : NVIDIA
Product Architecture : Turing
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1560120004432
GPU UUID : GPU-15ee4e6f-c7f7-34f9-b76d-90d920d3b8a9
Minor Number : 0
VBIOS Version : 90.04.A7.00.01
MultiGPU Board : No
Board ID : 0x4
Board Part Number : 900-2G183-6300-T00
GPU Part Number : 1EB8-895-A1
Module ID : 1
Inforom Version
Image Version : G183.0200.00.02
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x04
Domain : 0x0000
Device Id : 0x1EB810DE
Bus Id : 00000000:00:04.0
Sub System Id : 0x12A210DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Device Current : 3
Device Max : 3
Host Max : N/A
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 15360 MiB
Reserved : 258 MiB
Used : 0 MiB
Free : 15101 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 57 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 27.62 W
Power Limit : 70.00 W
Default Power Limit : 70.00 W
Enforced Power Limit : 70.00 W
Min Power Limit : 60.00 W
Max Power Limit : 70.00 W
Clocks
Graphics : 585 MHz
SM : 585 MHz
Memory : 5000 MHz
Video : 780 MHz
Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Default Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1590 MHz
SM : 1590 MHz
Memory : 5001 MHz
Video : 1470 MHz
Max Customer Boost Clocks
Graphics : 1590 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes : None

  • Docker version from docker version

  • Docker version 20.10.24, build e78084afe5
    docker-compose version 1.28.4, build cabd5cfb

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'

  • NVIDIA container library version from nvidia-container-cli -V

  • nvidia-container-cli -V
    cli-version: 1.14.1
    lib-version: 1.14.1
    build date: 2023-09-07T16:05+00:00
    build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba
    build compiler: x86_64-linux-gnu-gcc-7 7.5.0
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

  • NVIDIA container library logs (see troubleshooting)

  • Docker command, image and tag used

  • docker run --runtime=nvidia nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
    docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook Add README image nvidia-docker#1: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: change root failed: no such file or directory: unknown.
    ERRO[0001] error waiting for container: context canceled

I've tried using other images i.e. ubuntu as well but same error. Used --gpus all in place of runtime=nvidia but that does not work either.
docker run --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled

I've read several articles where same error was reported but none of those worked for me.

@elezar elezar transferred this issue from NVIDIA/nvidia-docker Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant