Docker and GPUs

This is a planning document for supporting GPU computing in Docker/Podman with BOINC. The eventual goal is to support

multiple GPU types (discrete: NVIDIA, AMD; integrated: Intel, Apple Silicon, ARM Mali, Qualcomm Adreno);
multiple API/toolkits: CUDA, OpenCL, Metal;
multiple OSs: Windows (with WSL), Linux, MacOS.

A BOINC project scientist wanting to do Docker/GPU computing would need to supply (for Intel and/or ARM):

A Dockerfile specifying a base image and possibly some libraries.
An executable that runs in the container. They'd build this on Linux, possibly in a container; we call this the 'build environment'.

The scientist would then create BUDA variants for each processor/GPU combination, with plan classes like 'docker_nvidia_opencl'.

If the scientist uses OpenCL, they could create a 'generic app' that can use any GPU that supports OpenCL, and maybe multicore CPU as well. (They'd use multiple BUDA variants, with the same executable). However, to maximize performance they might want to create versions for each GPU type.

We need to

Create cookbooks showing project how to do the above.
figure out what changes are needed in the client and docker_wrapper.

And we also need to define the 'execution environment':

for Windows, what does our WSL distro (boinc-buda-runner) need to contain?
for Linux/MacOS, what libraries (if any) does the volunteer need to install?

Build enviroment

In a Debian container:

apt-get install wget
wget https://developer.download.nvidia.com/compute/cuda/repos/debian13/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-2

Instructions for building test app?

Execution environment: Windows

CUDA

Download the Debian 13 Slim tarball eg debian13_slim.tar.gz OR build it yourself from docker eg:

https://hub.docker.com/layers/library/debian/13-slim/images/sha256-16299d0907c6383b73d648a0eb69a1c76b753b859e4998a490f8fe3cd909bebf

# get the Debian 13 slim image
docker pull debian:13-slim

# save to local file
docker save -o debian13_slim.tar debian:13-slim
gzip debian13_slim.tar

Now in Windows you can create a WSL Linux using this small Debian13 distro, ie in an administrator Windows PowerShell run:


# NB: replace [username] with your Users\* directory
wsl --import debian_slim C:\Users\[username]\AppData\Local\wsl\debian_slim ./debian13_slim.tar.gz
# to launch this distro
wsl -d debian_slim

# we'll at least need sudo right away
apt update ; apt install -y sudo

# probably sensible to add a boinc user once in the WSL shell eg:
useradd boinc --shell /bin/bash
su - boinc - or set default login as user boinc

# probably add the boinc account to the /etc/sudoers file eg add a line after %sudo:
boinc    ALL=(ALL:ALL) ALL

# you probably have to make a boinc home dir:
mkdir /home/boinc
chown boinc:boinc /home/boinc

# do the rest from this new boinc account
su - boinc

Windows needs it's native drivers for NVidia of course, but containers launched from WSL also need the NVidia Container Toolkit to pass through CUDA instructions to the underlying operating system (so no NVidia drivers needed in WSL)

In the running WSL, eg wsl -d debian_slim, install NVidia Container Toolkit and podman https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

sudo apt-get update && sudo apt-get install -y --no-install-recommends \
   ca-certificates \
   curl \
   gnupg2

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.19.0-1
  sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

# important step is configuring WSL for the NVidia containers:
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# now install podman and run containers with CUDA GPU apps
sudo apt install -y podman

DELETE BELOW
wsl.exe --install Debian
??? wsl run Debian?
sudo apt-get update
sudo apt-get install podman
sudo apt-get install -y nvidia-container-toolkit
sudo apt-get podman (???)

Then, as described in https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html:

sudo apt-get update && sudo apt-get install -y --no-install-recommends \
   ca-certificates \
   curl \
   gnupg2

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.19.0-1
  sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

NVidia suggests CDI (Container Device Interface) for podman: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html

check status in WSL:

sudo systemctl status nvidia-cdi-refresh.path

enable nvidia services in WSL:

sudo systemctl enable --now nvidia-cdi-refresh.path
sudo systemctl enable --now nvidia-cdi-refresh.service

podman example run command:

podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L

I needed to manually regenerate CDI in WSL:

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

this lists my GPU:

nvidia-smi -L
outputs: GPU 0: NVIDIA GeForce RTX 5060 Ti (UUID: GPU-0c5b2b4f-7b5e-0e50-b7b6-934235d41d3e)

podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
outputs: GPU 0: NVIDIA GeForce RTX 5060 Ti (UUID: GPU-0c5b2b4f-7b5e-0e50-b7b6-934235d41d3e)

view CDI config file:

cat /etc/cdi/nvidia.yaml

can manually edit the CDI config file to reference a GPU name

sudo vi /etc/cdi/nvidia.yaml

cdiVersion: 0.3.0
kind: nvidia.com/gpu
devices:
    - name: gpu0
      containerEdits:
        deviceNodes:
            - path: /dev/dxg
              major: 10
              minor: 125
              fileMode: 438
              permissions: rwm

podman run --rm --device nvidia.com/gpu=gpu0 --security-opt=label=disable ubuntu nvidia-smi -L

so instead of "all" I put "gpu0". But it seems to just be a dumb label, I put my name "carl" and as long as my podman run gpu=carl it worked fine

using nvidia/cuda container in podman:

podman run -it --rm --device nvidia.com/gpu=gpu0 nvcr.io/nvidia/cuda:13.2.0-cudnn-runtime-ubuntu24.04 nvidia-smi -L

if go back to an /etc/cdi/nvidia.yaml with gpu=all set and run:

podman run -it --rm --device nvidia.com/gpu=gpu0 nvcr.io/nvidia/cuda:13.2.0-cudnn-runtime-ubuntu24.04 nvidia-smi

it seems then you get the GPU#0 ID and boinc or the application can set which GPU to run via environment variable: export CUDA_VISIBLE_DEVICES=0 (where 0 is the GPU # to run on)

podman run -e CUDA_VISIBLE_DEVICES=0 -it --rm --device nvidia.com/gpu=all localhost/boinc_cuda /cuda/add_test

podman run -e CUDA_VISIBLE_DEVICES=1 -it --rm --device nvidia.com/gpu=all localhost/boinc_cuda /cuda/add_test

OpenCL

Discussion

Carl did the above starting with a Debian WSL distro.

Can we use Alpine (with its Musl libc)?
If not, is there a thin glibc distro we can use?

There are a lot of steps, and this is only for CUDA. Should we do this configuration:

Ourselves, hardwired into boinc-buda-runner? (downside: it might get big, with stuff that a particular host would never use)
As commands from the BOINC client? (could tailor these based on GPUs present on host).
As a script included in boinc-buda-runner? (e.g. the client would run config_cuda in the distro if an NVIDIA GPU is present).

Docker and GPUs

Build enviroment

Execution environment: Windows

CUDA

OpenCL

Discussion

Linux

CUDA

OpenCL

MacOS

OpenCL

Metal

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!