Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libnvidia-ml.so.1: file exists: unknown and nvidia-opencl-dev #371

Closed
romainGuiet opened this issue Feb 19, 2024 · 8 comments
Closed

libnvidia-ml.so.1: file exists: unknown and nvidia-opencl-dev #371

romainGuiet opened this issue Feb 19, 2024 · 8 comments

Comments

@romainGuiet
Copy link

Dear team, Dear community,

I'm building an image on my Windows machine (Win11 23H2) using WSL ( Ubuntu 22.04.2 LTS (GNU/Linux 5.15.133.1-microsoft-standard-WSL2 x86_64) ).

Ideally the image can be started locally or on a cluster to take advantage of bigger GPUs (see biop-desktop doc for more information about the image)

My issue is that I can no longer build an image that starts on my Windows machine because of the error :

>docker run -it --rm -p 8888:8888 --gpus device=0 biop/biop-fiji:20240205
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/6011c57d3118616b019e01106490fff1f1683485a0402225199f794a3184f4bb/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.

The older image is starting fine ( biop/biop-fiji:20231129 ), but whenever I try to build a new image, the build succeeds but I can't start it on my Windows machine ( for example this one as the error biop/biop-fiji:20240205 )

I'm aware that some functionalities (OpenCL for example) wouldn't be accessible via WSL, but so far the image could start so I can use the ones that work.

If I run docker run -it --rm -p 8888:8888 biop/biop-fiji:20240205, it starts BUT I can no longer use GPU processing...

The issue seems to come from the installation of nvidia-opencl-dev :

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev  \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

I tried to specify a version :

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev=11.5.1-1ubuntu1  \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

but without much success!

minimal building AND starting Dockerfile :

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-07.html 
# FROM nvcr.io/nvidia/pytorch:23.07-py3
ARG BASE_IMAGE=0.0.3
ARG ALIAS=biop/
FROM ${ALIAS}biop-vnc-base:${BASE_IMAGE}

USER root

RUN apt-get update -y \
    && apt upgrade -y \
    && apt-get install -y wget git unzip \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;


RUN chown -R biop:biop /home/biop/ \
    && chmod -R a+rwx /home/biop/   


#################################################################
# Container start
USER biop
WORKDIR /home/biop
ENTRYPOINT ["/usr/local/bin/jupyter"]
CMD ["lab", "--allow-root", "--ip=*", "--port=8888", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'", "--notebook-dir=/home/biop"]

minimal building BUT NON starting docker file

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-07.html 
# FROM nvcr.io/nvidia/pytorch:23.07-py3
ARG BASE_IMAGE=0.0.3
ARG ALIAS=biop/
FROM ${ALIAS}biop-vnc-base:${BASE_IMAGE}

USER root

RUN apt-get update -y \
    && apt upgrade -y \
    && apt-get install -y wget git unzip \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev=11.5.1-1ubuntu1  \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

RUN chown -R biop:biop /home/biop/ \
    && chmod -R a+rwx /home/biop/   


#################################################################
# Container start
USER biop
WORKDIR /home/biop
ENTRYPOINT ["/usr/local/bin/jupyter"]
CMD ["lab", "--allow-root", "--ip=*", "--port=8888", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'", "--notebook-dir=/home/biop"]

I'm looking forward to any suggestions you could have and I'll be happy to give more information if necessary...

Best regards,

Romain

@romainGuiet
Copy link
Author

A simpler minimal non-starting example

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-07.html 
FROM nvcr.io/nvidia/pytorch:23.07-py3

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

ENTRYPOINT ["/usr/local/bin/jupyter"]
CMD ["lab", "--allow-root", "--ip=*", "--port=8888", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'", "--notebook-dir=/home/biop"]

@elezar
Copy link
Member

elezar commented Feb 28, 2024

The issue with the existing file in the image is that the nvidia runtime is set as the default runimte, meaning that the .so.1 symlinks are generated at build time and present in the image.

Does one of the packages in your minimal example insall the libnvidia-ml.so.1 file, or is it present in the base image?

@romainGuiet
Copy link
Author

romainGuiet commented Feb 29, 2024

Hi @elezar ,
thank you for your suggestion. I just checked and the ˋlibnvidia-ml.so.1ˋ file is present in the base image ˋnvcr.io/nvidia/pytorch:23.07-py3ˋ , doing a ˋlsˋ command it appears in green (executable)

image

After installing nvidia-opencl-dev, with

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

it appears in cyan :
image

Can I prevent apt-get install to transform from green (executable) to cyan (symbolic link) ? or alternatively install something so it works ?

Thank you again,

R

@elezar
Copy link
Member

elezar commented Feb 29, 2024

It seems as if one of those packages are installing the CUDA driver in the container. Does nvidia-opencl-dev depend on the driver at all? (what provides the *.545.29.06 files in this case?

@elezar
Copy link
Member

elezar commented Feb 29, 2024

Note that when I checked the image locally, I could not find libnvidia-ml.so:

$ docker run --rm -ti --runtime=runc nvcr.io/nvidia/pytorch:23
.07-py3 find / | grep libnvidia
/usr/local/cuda-12.1/compat/lib.real/libnvidia-nvvm.so
/usr/local/cuda-12.1/compat/lib.real/libnvidia-ptxjitcompiler.so.530.30.02
/usr/local/cuda-12.1/compat/lib.real/libnvidia-nvvm.so.530.30.02
/usr/local/cuda-12.1/compat/lib.real/libnvidia-nvvm.so.4
/usr/local/cuda-12.1/compat/lib.real/libnvidia-ptxjitcompiler.so.1
/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so

This was for:

$ docker manifest inspect nvcr.io/nvidia/pytorch:23.07-py3
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
   "manifests": [
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 10194,
         "digest": "sha256:60b2fa36b72f08a63b3778cc657b4b168e931a9864bdf3f6fc6b50102045b913",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 9995,
         "digest": "sha256:1cc09bbcb94d34a1cc6279e7a753f257a7c5f08d73e3714a56f678de04c23785",
         "platform": {
            "architecture": "arm64",
            "os": "linux"
         }
      }
   ]
}

@romainGuiet
Copy link
Author

romainGuiet commented Feb 29, 2024

Does nvidia-opencl-dev depend on the driver at all?

How I can I check ?

(what provides the *.545.29.06 files in this case?

I'm not sure what I can do to check this either ...

Note that when I checked the image locally, I could not find libnvidia-ml.so

Me neither! I had to manually cd to usr/lib/x86_64-linux-gnu/ then ls and look for the file libnvidia-ml.so.1 which was there (screenshots above).
Using find always returns find: ‘libnvidia-ml.so.1’: No such file or directory.

Thank you again for your help , much appreciated!

@romainGuiet
Copy link
Author

romainGuiet commented Mar 15, 2024

Hi @elezar ,

Thank you for your previous suggestions, I'm still unsure what I can try next to solve the problem...

I find this "old" thread and this post

So I tested the image on Linux (Kuberneetes grid) and it started/worked , so the issues seem to be Windows-WSL2/Docker-desktop related.

Thank you again for your help,

R

@romainGuiet
Copy link
Author

romainGuiet commented Mar 26, 2024

From the container bash , running dpkg -l '*nvidia*' gave different results :

  • in nvcr.io/nvidia/pytorch:23.07-py3 : dpkg-query: no packages found matching *nvidia*
  • in the minimal non-starting example :

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===========================-==========================-============-=================================
un libgldispatch0-nvidia (no description available)
un libnvidia-compute (no description available)
ii libnvidia-compute-545:amd64 545.29.06-0ubuntu0.22.04.2 amd64 NVIDIA libcompute package
un libnvidia-ml.so.1 (no description available)
un nvidia-libopencl1 (no description available)
un nvidia-libopencl1-dev (no description available)
ii nvidia-opencl-dev:amd64 11.5.1-1ubuntu1 amd64 NVIDIA OpenCL development files
un nvidia-opencl-icd (no description available)

I noticed the libnvidia-compute-545 , so I modified the code :

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

to be

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev \
    && apt remove -y libnvidia-compute-545 \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

and now it works on both Windows and Linux .

Thank you again @elezar for the suggestions,

Best

R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants