Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Apr 29, 2025

This change adds a config option to trigger the --cuda-compat-mode=ldconfig option added to the nvidia-container-cli in NVIDIA/libnvidia-container#307.

The nvidia-container-runtime.modes.legacy.cuda-compat-mode option controls the behaviour of both the nvidia-container-runtime-hook and the nvidia-container-runtime. The main motivation being that the hook-based CUDA Forward Compat support that was added as part of the v1.17.5 release requires the nvidia-container-runtime and as such does not work when only the Docker --gpus flag is specified.

Testing:

$ cat check-compat.sh
#!/bin/bash
set -x
ldconfig -p | grep libcuda
ls -al  /usr/lib/x86_64-linux-gnu/libcuda.so*
mount | grep libcuda

cuda-compat-mode unset (upgrades)

--gpus flag

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=runc -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcudadebugger.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcudadebugger.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
lrwxrwxrwx 1 root root       12 May 13 13:15 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:15 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.515.105.01
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)

nvidia runtime

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=nvidia -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcudadebugger.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcudadebugger.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
lrwxrwxrwx 1 root root       12 May 13 13:15 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:15 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.515.105.01
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)

cuda-compat-mode=ldconfig

$ nvidia-ctk config | grep cuda-compat-mode
cuda-compat-mode = "ldconfig"

--gpus flag

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=runc -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcudadebugger.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcudadebugger.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
lrwxrwxrwx 1 root root       12 May 13 13:16 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:16 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.515.105.01
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)

nvidia runtime

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=nvidia -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcudadebugger.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcudadebugger.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
lrwxrwxrwx 1 root root       12 May 13 13:16 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:16 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.515.105.01
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)

cuda-compat-mode=mount

$ nvidia-ctk config | grep cuda-compat-mode
cuda-compat-mode = "mount"

--gpus flag

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=runc -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcudadebugger.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 /usr/lib/x86_64-linux-gnu/libcuda.so.570.124.06
lrwxrwxrwx 1 root root       12 May 13 13:11 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:11 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.570.124.06
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
-rw-r--r-- 1 root root 71373816 Feb 26 02:04 /usr/lib/x86_64-linux-gnu/libcuda.so.570.124.06
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)
overlay on /usr/lib/x86_64-linux-gnu/libcuda.so.570.124.06 type overlay (ro,nosuid,nodev,relatime,lowerdir=/var/lib/docker/overlay2/l/YNAMYYCHXHBXE5PZBTHKRLESOU:/var/lib/docker/overlay2/l/2KDSGP7VKFMJR4C3GFANVX4QBN:/var/lib/docker/overlay2/l/RQGLGB33HTKTBTSRXW53V7PNXQ:/var/lib/docker/overlay2/l/PPCDKWEUS6OY7NNDRHKSWKFUHY:/var/lib/docker/overlay2/l/SF4RIIOHIGRUCTKUXZB6FDSZD7:/var/lib/docker/overlay2/l/VORH6BSJPVXUCIXSQPI4V7YOZ5,upperdir=/var/lib/docker/overlay2/9bd3f6b8aaa8519b0af1256454858dddb38d14e2682aa50a57b0253f9e00cd61/diff,workdir=/var/lib/docker/overlay2/9bd3f6b8aaa8519b0af1256454858dddb38d14e2682aa50a57b0253f9e00cd61/work,xino=off)
overlay on /usr/lib/x86_64-linux-gnu/libcudadebugger.so.570.124.06 type overlay (ro,nosuid,nodev,relatime,lowerdir=/var/lib/docker/overlay2/l/YNAMYYCHXHBXE5PZBTHKRLESOU:/var/lib/docker/overlay2/l/2KDSGP7VKFMJR4C3GFANVX4QBN:/var/lib/docker/overlay2/l/RQGLGB33HTKTBTSRXW53V7PNXQ:/var/lib/docker/overlay2/l/PPCDKWEUS6OY7NNDRHKSWKFUHY:/var/lib/docker/overlay2/l/SF4RIIOHIGRUCTKUXZB6FDSZD7:/var/lib/docker/overlay2/l/VORH6BSJPVXUCIXSQPI4V7YOZ5,upperdir=/var/lib/docker/overlay2/9bd3f6b8aaa8519b0af1256454858dddb38d14e2682aa50a57b0253f9e00cd61/diff,workdir=/var/lib/docker/overlay2/9bd3f6b8aaa8519b0af1256454858dddb38d14e2682aa50a57b0253f9e00cd61/work,xino=off)

nvidia runtime

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=nvidia -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcudadebugger.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 /usr/lib/x86_64-linux-gnu/libcuda.so.570.124.06
lrwxrwxrwx 1 root root       12 May 13 13:12 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:12 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.570.124.06
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
-rw-r--r-- 1 root root 71373816 Feb 26 02:04 /usr/lib/x86_64-linux-gnu/libcuda.so.570.124.06
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)
overlay on /usr/lib/x86_64-linux-gnu/libcuda.so.570.124.06 type overlay (ro,nosuid,nodev,relatime,lowerdir=/var/lib/docker/overlay2/l/S76G4EOPKJMKO7EANJT54FYT35:/var/lib/docker/overlay2/l/2KDSGP7VKFMJR4C3GFANVX4QBN:/var/lib/docker/overlay2/l/RQGLGB33HTKTBTSRXW53V7PNXQ:/var/lib/docker/overlay2/l/PPCDKWEUS6OY7NNDRHKSWKFUHY:/var/lib/docker/overlay2/l/SF4RIIOHIGRUCTKUXZB6FDSZD7:/var/lib/docker/overlay2/l/VORH6BSJPVXUCIXSQPI4V7YOZ5,upperdir=/var/lib/docker/overlay2/67e07f9319d2c2df1ead5fd2b837a9e67402237d32e6f77f9a3168852af6c3e6/diff,workdir=/var/lib/docker/overlay2/67e07f9319d2c2df1ead5fd2b837a9e67402237d32e6f77f9a3168852af6c3e6/work,xino=off)
overlay on /usr/lib/x86_64-linux-gnu/libcudadebugger.so.570.124.06 type overlay (ro,nosuid,nodev,relatime,lowerdir=/var/lib/docker/overlay2/l/S76G4EOPKJMKO7EANJT54FYT35:/var/lib/docker/overlay2/l/2KDSGP7VKFMJR4C3GFANVX4QBN:/var/lib/docker/overlay2/l/RQGLGB33HTKTBTSRXW53V7PNXQ:/var/lib/docker/overlay2/l/PPCDKWEUS6OY7NNDRHKSWKFUHY:/var/lib/docker/overlay2/l/SF4RIIOHIGRUCTKUXZB6FDSZD7:/var/lib/docker/overlay2/l/VORH6BSJPVXUCIXSQPI4V7YOZ5,upperdir=/var/lib/docker/overlay2/67e07f9319d2c2df1ead5fd2b837a9e67402237d32e6f77f9a3168852af6c3e6/diff,workdir=/var/lib/docker/overlay2/67e07f9319d2c2df1ead5fd2b837a9e67402237d32e6f77f9a3168852af6c3e6/work,xino=off)

cuda-compat-mode=disabled

$ nvidia-ctk config | grep cuda-compat-mode
cuda-compat-mode = "disabled"

--gpus flag

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=runc -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
lrwxrwxrwx 1 root root       12 May 13 13:17 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:17 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.515.105.01
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)

nvidia runtime

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=nvidia -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
lrwxrwxrwx 1 root root       12 May 13 13:18 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:18 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.515.105.01
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)

cuda-compat-mode=hook

--gpus flag

Note that this has CUDA Forward Compat disabled

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=runc -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
lrwxrwxrwx 1 root root       12 May 13 13:19 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:19 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.515.105.01
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)

nvidia runtime

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=nvidia -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcudadebugger.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcudadebugger.so.1
        libcuda.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so.1
        libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so
        libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
lrwxrwxrwx 1 root root       12 May 13 13:20 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:20 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.515.105.01
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)

nvidia runtime with disable-cuda-compat-hook = true

$ docker run --rm -ti --gpus=all -e NVIDIA_DISABLE_REQUIRE=1 --runtime=runc -v $(pwd):/work -w /work nvidia/cuda:12.8.1-base-ubuntu20.04 bash -c "./check-compat.sh"
+ ldconfig -p
+ grep libcuda
        libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
+ ls -al /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
lrwxrwxrwx 1 root root       12 May 13 13:22 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       21 May 13 13:22 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.515.105.01
-rw-r--r-- 1 root root 20988032 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01
+ mount
+ grep libcuda
/dev/sda2 on /usr/lib/x86_64-linux-gnu/libcuda.so.515.105.01 type ext4 (ro,nosuid,nodev,relatime,nodelalloc,errors=remount-ro,stripe=64)

@elezar elezar self-assigned this Apr 29, 2025
@elezar elezar force-pushed the add-cuda-compat-mode branch 2 times, most recently from 3b27c30 to 1db1b88 Compare April 29, 2025 13:54
@elezar elezar changed the title TODO Add CUDA forward compat support using a folder Apr 29, 2025
@elezar elezar force-pushed the add-cuda-compat-mode branch 4 times, most recently from 977723e to aa0cb99 Compare April 30, 2025 12:53
@elezar elezar changed the title Add CUDA forward compat support using a folder Add nvidia-container-cli.compat-mode config option Apr 30, 2025
Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the test only tests with the "ldconfig" method and nothing else. Is this intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just to check that the default config is generated correctly. We don't currently have a setting to override this, but we probably want to expose it.

Comment on lines 26 to 29
CUDACompatModeMount = "mount"
CUDACompatModeLdconfig = "ldconfig"
CUDACompatModeHook = "hook"
CUDACompatModeDisabled = "disabled"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the runtime has one more "mode" than the nvidia-container-cli, i.e. the hook mode -- is that the interpretation here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. Thinking about it now, I don't think we should put this setting in the nvidia-container-cli section, but rather in the nvidia-container-runtime section.

Comment on lines 68 to 70
// CUDACompatMode sets the mode to be used to make CUDA Forward Compat
// libraries discoverable in the container.
CUDACompatMode cudaCompatMode `toml:"cuda-compat-mode,omitempty"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this rather be at:

nvidia-container-runtime.cuda-compat-mode = ldconfig

or

nvidia-container-runtime.modes.legacy.cuda-compat-mode = ldconfig

I think the latter since it actually only affects the behaviour of the legacy runtime mode, with the hook always being injected in other modes except if it is explicitly opted out of by the feature flag.

Copy link
Contributor

@cdesiniotis cdesiniotis May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here, but I think your latter suggestion makes sense. At first, I thought this made more sense in the nvidia-container-cli section, but because of the value hook, which only applies when the NVIDIA Container Runtime is used, I think it is okay to nest this option under nvidia-container-runtime.modes.legacy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's cleaner this way. I have updated the implementation accordingly.

@elezar elezar force-pushed the add-cuda-compat-mode branch 2 times, most recently from d2daf38 to 742ff37 Compare May 9, 2025 12:19
@elezar elezar added the must-backport The changes in PR need to be backported to at least one stable release branch. label May 9, 2025
@elezar elezar added this to the v1.17.7 milestone May 9, 2025
args = append(args, "--no-cntlibs")
switch hook.NVIDIAContainerRuntimeConfig.Modes.Legacy.CUDACompatMode {
case config.CUDACompatModeLdconfig:
args = append(args, "--cuda-compat-mode=ldconfig")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating args = append(args, "<flag>") statements, can we assign the resolved flag to a local variable and append it to args after the switch-case block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a method to the hookConfig type to pull this logic into. I can see us doing something similar for the other flags as a follow up.

@elezar elezar force-pushed the add-cuda-compat-mode branch from 742ff37 to 50959cf Compare May 13, 2025 12:56
@elezar elezar mentioned this pull request May 13, 2025
This change adds an nvidia-container-runtime.modes.legacy.cuda-compat-mode
config option. This can be set to one of four values:

* ldconfig (default): the --cuda-compat-mode=ldconfig flag is passed to the nvidia-container-cli
* mount: the --cuda-compat-mode=mount flag is passed to the nvidia-conainer-cli
* disabled: the --cuda-compat-mode=disabled flag is passed to the nvidia-container-cli
* hook: the --cuda-compat-mode=disabled flag is passed to the nvidia-container-cli AND the
  enable-cuda-compat hook is used to provide forward compatibility.

Note that the disable-cuda-compat-lib-hook feature flag will prevent the enable-cuda-compat
hook from being used. This change also means that the allow-cuda-compat-libs-from-container
feature flag no longer has any effect.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the add-cuda-compat-mode branch from 50959cf to f4981f0 Compare May 13, 2025 19:50
@elezar elezar marked this pull request as ready for review May 13, 2025 19:50
@elezar elezar merged commit 72b2ee9 into NVIDIA:main May 13, 2025
15 checks passed
@elezar elezar deleted the add-cuda-compat-mode branch May 13, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

must-backport The changes in PR need to be backported to at least one stable release branch.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants