Skip to content

Using docker-compose and cdi to passthrough gpu to container via podman #126

@raldone01

Description

@raldone01

Running docker and podman directly

Works:

  • sudo docker run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
  • sudo podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L

Does not work:

  • sudo docker run --rm --gpus all ubuntu nvidia-smi -L
  • sudo podman run --rm --gpus all ubuntu nvidia-smi -L

The --gpus all commands fail with the following output:

Error: crun: executable file `nvidia-smi` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found

The nvidia device files are also not present.

I have installed nvidia-container-toolkit and ran sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml.

Running emulated docker-compose

version: '3.8'
services:
  resource_test: # this is not working
    image: ubuntu:20.04
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu, utility, compute]
    tty: true
    stdin_open: true
    command:
      - bash
      - -c
      - |
        nvidia-smi -L
  runtime_test: # this is working
    image: ubuntu:20.04
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    tty: true
    stdin_open: true
    command:
      - bash
      - -c
      - |
        nvidia-smi -L

sudo docker-compose up

[+] Running 2/2
 ✔ Container pytorch_test-resource_test-1  Recreated                                                                                                   0.3s 
 ✔ Container pytorch_test-runtime_test-1   Recreated                                                                                                   0.3s 
Attaching to pytorch_test-resource_test-1, pytorch_test-runtime_test-1
pytorch_test-resource_test-1  | bash: nvidia-smi: command not found
pytorch_test-runtime_test-1   | GPU 0: NVIDIA T600 (UUID: GPU-XXXXX)
pytorch_test-resource_test-1 exited with code 127
pytorch_test-runtime_test-1 exited with code 0

I have no idea what is wrong and appreciate any advice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions