-
Notifications
You must be signed in to change notification settings - Fork 512
Docker and GPUs
This is a planning document for supporting GPU computing in Docker/Podman with BOINC. The eventual goal is to support
- multiple GPU types (discrete: NVIDIA, AMD; integrated: Intel, Apple Silicon, ARM Mali, Qualcomm Adreno);
- multiple API/toolkits: CUDA, OpenCL, Metal;
- multiple OSs: Windows (with WSL), Linux, MacOS.
A BOINC project scientist wanting to do Docker/GPU computing would need to supply (for Intel and/or ARM):
- A Dockerfile specifying a base image and possibly some libraries.
- An executable that runs in the container. They'd build this on Linux, possibly in a container; we call this the 'build environment'.
The scientist would then create BUDA variants for each processor/GPU combination, with plan classes like 'docker_nvidia_opencl'.
If the scientist uses OpenCL, they could create a 'generic app' that can use any GPU that supports OpenCL, and maybe multicore CPU as well. (They'd use multiple BUDA variants, with the same executable). However, to maximize performance they might want to create versions for each GPU type.
We need to
- Create cookbooks showing project how to do the above.
- figure out what changes are needed in the client and docker_wrapper.
And we also need to define the 'execution environment':
- for Windows, what does our WSL distro (boinc-buda-runner) need to contain?
- for Linux/MacOS, what libraries (if any) does the volunteer need to install?
In a Debian container:
apt-get install wget
wget https://developer.download.nvidia.com/compute/cuda/repos/debian13/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-2
Instructions for building test app?
Carl created a WSL distro as follows:
wsl.exe --install Debian
??? wsl run Debian?
sudo apt-get update
sudo apt-get install podman
sudo apt-get install -y nvidia-container-toolkit
sudo apt-get podman (???)
Then, as described in https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html:
sudo apt-get update && sudo apt-get install -y --no-install-recommends \
ca-certificates \
curl \
gnupg2
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.19.0-1
sudo apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
NVidia suggests CDI (Container Device Interface) for podman: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html
check status in WSL:
sudo systemctl status nvidia-cdi-refresh.path
enable nvidia services in WSL:
sudo systemctl enable --now nvidia-cdi-refresh.path
sudo systemctl enable --now nvidia-cdi-refresh.service
podman example run command:
podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
I needed to manually regenerate CDI in WSL:
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
this lists my GPU:
nvidia-smi -L
outputs: GPU 0: NVIDIA GeForce RTX 5060 Ti (UUID: GPU-0c5b2b4f-7b5e-0e50-b7b6-934235d41d3e)
podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
outputs: GPU 0: NVIDIA GeForce RTX 5060 Ti (UUID: GPU-0c5b2b4f-7b5e-0e50-b7b6-934235d41d3e)
view CDI config file:
cat /etc/cdi/nvidia.yaml
can manually edit the CDI config file to reference a GPU name
sudo vi /etc/cdi/nvidia.yaml
cdiVersion: 0.3.0
kind: nvidia.com/gpu
devices:
- name: gpu0
containerEdits:
deviceNodes:
- path: /dev/dxg
major: 10
minor: 125
fileMode: 438
permissions: rwm
podman run --rm --device nvidia.com/gpu=gpu0 --security-opt=label=disable ubuntu nvidia-smi -L
so instead of "all" I put "gpu0". But it seems to just be a dumb label, I put my name "carl" and as long as my podman run gpu=carl it worked fine
using nvidia/cuda container in podman:
podman run -it --rm --device nvidia.com/gpu=gpu0 nvcr.io/nvidia/cuda:13.2.0-cudnn-runtime-ubuntu24.04 nvidia-smi -L
if go back to an /etc/cdi/nvidia.yaml with gpu=all set and run:
podman run -it --rm --device nvidia.com/gpu=gpu0 nvcr.io/nvidia/cuda:13.2.0-cudnn-runtime-ubuntu24.04 nvidia-smi
it seems then you get the GPU#0 ID and boinc or the application can set which GPU to run via environment variable: export CUDA_VISIBLE_DEVICES=0 (where 0 is the GPU # to run on)
podman run -e CUDA_VISIBLE_DEVICES=0 -it --rm --device nvidia.com/gpu=all localhost/boinc_cuda /cuda/add_test
podman run -e CUDA_VISIBLE_DEVICES=1 -it --rm --device nvidia.com/gpu=all localhost/boinc_cuda /cuda/add_test
Carl did the above starting with a Debian WSL distro.
- Can we use Alpine (with its Musl libc)?
- If not, is there a thin glibc distro we can use?
There are a lot of steps, and this is only for CUDA. Should we do this configuration:
-
Ourselves, hardwired into boinc-buda-runner? (downside: it might get big, with stuff that a particular host would never use)
-
As commands from the BOINC client? (could tailor these based on GPUs present on host).
-
As a script included in boinc-buda-runner? (e.g. the client would run config_cuda in the distro if an NVIDIA GPU is present).