Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan is broken #140

Closed
9 tasks done
qhaas opened this issue Apr 1, 2021 · 8 comments
Closed
9 tasks done

vulkan is broken #140

qhaas opened this issue Apr 1, 2021 · 8 comments

Comments

@qhaas
Copy link

qhaas commented Apr 1, 2021

1. Issue or feature description

Vulkan appears to be broken on Enterprise Linux 8.3 x86-64 hosts, it has worked for us before, not sure what changed or when. OpenGL appears to be working fine, as the X window appears/renders such applications' GUIs as expected. Vulkan applications launch, the window briefly appears, then seg faults. I doubt it is a bug with the nvidia vulkan container since I can run opengl/vulkan applications after converting it into a singularity container.

2. Steps to reproduce the issue

  1. This BASH function is used to facilitate viewing the X-window on the host from the container:
# Based on: http://wiki.ros.org/docker/Tutorials/GUI
xForwardDockerRunArgs() {
 XAUTH=`mktemp`
 XSOCK='/tmp/.X11-unix'
 xauth nlist ${DISPLAY} | sed -e 's/^..../ffff/' | xauth -f ${XAUTH} nmerge -
 echo "-v ${XSOCK}:${XSOCK}:rw -v ${XAUTH}:${XAUTH}:rw -e XAUTHORITY=${XAUTH} -e DISPLAY=${DISPLAY}"
}
  1. Launch the nvidia vulkan container with docker run --net=host --rm -it $(xForwardDockerRunArgs) --gpus=all nvidia/vulkan:1.1.121-cuda-10.1-beta.1-ubuntu18.04
  2. Deploy glxgears and vulkan-smoketest inside container with apt-get update && apt-get install -y vulkan-utils mesa-utils
  3. Verify glxgears launches an X window with spinning gears and uses the GPU, this implies 'x-forwarding' to the host is working and OpenGL is using the GPU. Run glxgears in the container and after the window appears run nvidia-smi on the host.
  4. Run vulkan-smoketest and watch a black window briefly appear/disappear with Segmentation fault (core dumped) in the container's terminal. dmesg | tail on the host reports something like segfault at 0 ip ... sp ... error 4 in vulkan-smoketest...

For a sanity check with Singularity 3.7, convert the same image to a singularity image and run it, vulkan works fine:

$ cat vulkan.def 
Bootstrap: docker
From: nvidia/vulkan:1.1.121-cuda-10.1-beta.1-ubuntu18.04

%post
	apt-get update && apt-get install -y mesa-utils vulkan-utils && apt-get clean
$ singularity build --fakeroot vulkan.sif vulkan.def
...
$ singularity exec --nv vulkan.sif vulkan-smoketest

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info:
    nvidia-container-cli.txt

  • Kernel version from uname -a: Linux fedorarouge 4.18.0-240.15.1.el8_3.x86_64 NVIDIA/nvidia-docker#1 SMP Wed Feb 3 03:12:15 EST 2021 x86_64 x86_64 x86_64 GNU/Linux

  • Any relevant kernel output lines from dmesg

    [ 6991.274805] vulkan-smoketes[60028]: segfault at 0 ip 00005588fe864b7e sp 00007ffd3c596460 error 4 in vulkan-smoketest[5588fe84e000+2c000]
    [ 6991.274810] Code: 29 c8 48 c1 f8 03 48 39 c6 77 61 73 09 48 8d 04 f1 48 89 44 24 18 4c 89 ea 4c 89 e6 48 89 ef ff 15 77 57 21 00 48 8b 7c 24 10 <48> 8b 07 48 c7 83 f8 00 00 00 00 00 00 00 48 c7 83 00 01 00 00 ff
    
  • Driver information from nvidia-smi -a:
    nvidia-smi.txt

  • Docker version from docker version
    docker.txt

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    nvidia_rpm.txt

  • NVIDIA container library version from nvidia-container-cli -V:
    nvidia_container_cli.txt

  • NVIDIA container library logs (see troubleshooting): nvidia-container-toolkit.txt

  • Docker command, image and tag used: docker run --net=host --rm -it $(xForwardDockerRunArgs) --gpus=all nvidia/vulkan:1.1.121-cuda-10.1-beta.1-ubuntu18.04

@qhaas
Copy link
Author

qhaas commented Apr 6, 2021

Also tested on a PopOS 18.04 (Ubuntu 18.04 based) system with a Pascal GeForce, nvidia driver 460.67, nvidia-container 1.3.3, and docker CE 20.10.5... same error

@klueska
Copy link
Contributor

klueska commented Apr 13, 2021

Could this possibly be related to:
NVIDIA/libnvidia-container#134

@qhaas
Copy link
Author

qhaas commented Apr 13, 2021

Could this possibly be related to:
NVIDIA/libnvidia-container#134

Thanks; that is unlikely the same issue, since vulkan-smoketest works fine outside of containers (i.e. bare-metal) and it also works from inside a singularity container. The tested systems had but one GPU.

@diadatp
Copy link

diadatp commented Sep 7, 2021

@qhaas Did you figure out a workaround?

@denwi248
Copy link

denwi248 commented Nov 9, 2021

while granting access to the XAUTHORITY file and XSOCKET as you describe it, works for OpenGL, it does not work somehow for vulkan. You have to activate the display option for NVIDIA_DRIVER_CAPABILITIES which is not set in the nvidia/vulkan base image. driver-capabilities)

So instead of --gpus=all use --gpus='all,"capabilities=compute,utility,graphics,display"' --env DISPLAY:$DISPLAY.
This cost me days to find out 😞

@flxai
Copy link

flxai commented Oct 30, 2023

@elezar Can you please add some information on how this issue was solved?

@elezar
Copy link
Member

elezar commented Oct 30, 2023

@flxai I closed this issue due to inactivity. We are consolidating our repos and this includes where issues are handled.

Since this has not been addressed, I will reopen and transfer to the https://github.com/NVIDIA/nvidia-container-toolkit repo instead.

@elezar elezar reopened this Oct 30, 2023
@elezar elezar transferred this issue from NVIDIA/nvidia-docker Oct 30, 2023
@elezar
Copy link
Member

elezar commented Feb 5, 2024

@flxai I have had a chance to confirm the instructions for running a vulkan container image.

Assuming the following Dockerfile:

cat << EOF > Dockerfile
FROM ubuntu

RUN apt-get update && apt-get install -y --no-install-recommends \
    libglvnd0 \
    libgl1 \
    libglx0 \
    libegl1  \
    libgles2  \
    libxcb1-dev \
    wget \
    vulkan-tools \
    && rm -rf /var/lib/apt/lists/*

CMD ["vulkaninfo"]
EOF

Building it:

docker build -t vulkantest .

And run the image:

docker run --rm -ti --runtime=nvidia \
    -e NVIDIA_VISIBLE_DEVICES=all \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
        vulkantest | grep -B5 -A5 NVIDIA

This should output the same information as vulkaninfo on the host.

This was tested with v1.14.4 of the NVIDIA Container Toolkit. If this does not work as expected, please create a new issue against this repo.

@elezar elezar closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants