libnvidia-ml.so.1: file exists: unknown and nvidia-opencl-dev #371

romainGuiet · 2024-02-19T16:48:17Z

Dear team, Dear community,

I'm building an image on my Windows machine (Win11 23H2) using WSL ( Ubuntu 22.04.2 LTS (GNU/Linux 5.15.133.1-microsoft-standard-WSL2 x86_64) ).

Ideally the image can be started locally or on a cluster to take advantage of bigger GPUs (see biop-desktop doc for more information about the image)

My issue is that I can no longer build an image that starts on my Windows machine because of the error :

>docker run -it --rm -p 8888:8888 --gpus device=0 biop/biop-fiji:20240205
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/6011c57d3118616b019e01106490fff1f1683485a0402225199f794a3184f4bb/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.

The older image is starting fine ( biop/biop-fiji:20231129 ), but whenever I try to build a new image, the build succeeds but I can't start it on my Windows machine ( for example this one as the error biop/biop-fiji:20240205 )

I'm aware that some functionalities (OpenCL for example) wouldn't be accessible via WSL, but so far the image could start so I can use the ones that work.

If I run docker run -it --rm -p 8888:8888 biop/biop-fiji:20240205, it starts BUT I can no longer use GPU processing...

The issue seems to come from the installation of nvidia-opencl-dev :

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev  \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

I tried to specify a version :

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev=11.5.1-1ubuntu1  \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

but without much success!

minimal building AND starting Dockerfile :

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-07.html 
# FROM nvcr.io/nvidia/pytorch:23.07-py3
ARG BASE_IMAGE=0.0.3
ARG ALIAS=biop/
FROM ${ALIAS}biop-vnc-base:${BASE_IMAGE}

USER root

RUN apt-get update -y \
    && apt upgrade -y \
    && apt-get install -y wget git unzip \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;


RUN chown -R biop:biop /home/biop/ \
    && chmod -R a+rwx /home/biop/   


#################################################################
# Container start
USER biop
WORKDIR /home/biop
ENTRYPOINT ["/usr/local/bin/jupyter"]
CMD ["lab", "--allow-root", "--ip=*", "--port=8888", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'", "--notebook-dir=/home/biop"]

minimal building BUT NON starting docker file

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-07.html 
# FROM nvcr.io/nvidia/pytorch:23.07-py3
ARG BASE_IMAGE=0.0.3
ARG ALIAS=biop/
FROM ${ALIAS}biop-vnc-base:${BASE_IMAGE}

USER root

RUN apt-get update -y \
    && apt upgrade -y \
    && apt-get install -y wget git unzip \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev=11.5.1-1ubuntu1  \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

RUN chown -R biop:biop /home/biop/ \
    && chmod -R a+rwx /home/biop/   


#################################################################
# Container start
USER biop
WORKDIR /home/biop
ENTRYPOINT ["/usr/local/bin/jupyter"]
CMD ["lab", "--allow-root", "--ip=*", "--port=8888", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'", "--notebook-dir=/home/biop"]

I'm looking forward to any suggestions you could have and I'll be happy to give more information if necessary...

Best regards,

Romain

The text was updated successfully, but these errors were encountered:

romainGuiet · 2024-02-28T11:14:02Z

A simpler minimal non-starting example

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-07.html 
FROM nvcr.io/nvidia/pytorch:23.07-py3

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

ENTRYPOINT ["/usr/local/bin/jupyter"]
CMD ["lab", "--allow-root", "--ip=*", "--port=8888", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'", "--notebook-dir=/home/biop"]

elezar · 2024-02-28T11:29:34Z

The issue with the existing file in the image is that the nvidia runtime is set as the default runimte, meaning that the .so.1 symlinks are generated at build time and present in the image.

Does one of the packages in your minimal example insall the libnvidia-ml.so.1 file, or is it present in the base image?

romainGuiet · 2024-02-29T07:35:43Z

Hi @elezar ,
thank you for your suggestion. I just checked and the ˋlibnvidia-ml.so.1ˋ file is present in the base image ˋnvcr.io/nvidia/pytorch:23.07-py3ˋ , doing a ˋlsˋ command it appears in green (executable)

After installing nvidia-opencl-dev, with

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

it appears in cyan :

Can I prevent apt-get install to transform from green (executable) to cyan (symbolic link) ? or alternatively install something so it works ?

Thank you again,

R

elezar · 2024-02-29T07:47:38Z

It seems as if one of those packages are installing the CUDA driver in the container. Does nvidia-opencl-dev depend on the driver at all? (what provides the *.545.29.06 files in this case?

elezar · 2024-02-29T09:35:58Z

Note that when I checked the image locally, I could not find libnvidia-ml.so:

$ docker run --rm -ti --runtime=runc nvcr.io/nvidia/pytorch:23
.07-py3 find / | grep libnvidia
/usr/local/cuda-12.1/compat/lib.real/libnvidia-nvvm.so
/usr/local/cuda-12.1/compat/lib.real/libnvidia-ptxjitcompiler.so.530.30.02
/usr/local/cuda-12.1/compat/lib.real/libnvidia-nvvm.so.530.30.02
/usr/local/cuda-12.1/compat/lib.real/libnvidia-nvvm.so.4
/usr/local/cuda-12.1/compat/lib.real/libnvidia-ptxjitcompiler.so.1
/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so

This was for:

$ docker manifest inspect nvcr.io/nvidia/pytorch:23.07-py3
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
   "manifests": [
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 10194,
         "digest": "sha256:60b2fa36b72f08a63b3778cc657b4b168e931a9864bdf3f6fc6b50102045b913",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 9995,
         "digest": "sha256:1cc09bbcb94d34a1cc6279e7a753f257a7c5f08d73e3714a56f678de04c23785",
         "platform": {
            "architecture": "arm64",
            "os": "linux"
         }
      }
   ]
}

romainGuiet · 2024-02-29T11:00:20Z

Does nvidia-opencl-dev depend on the driver at all?

How I can I check ?

(what provides the *.545.29.06 files in this case?

I'm not sure what I can do to check this either ...

Note that when I checked the image locally, I could not find libnvidia-ml.so

Me neither! I had to manually cd to usr/lib/x86_64-linux-gnu/ then ls and look for the file libnvidia-ml.so.1 which was there (screenshots above).
Using find always returns find: ‘libnvidia-ml.so.1’: No such file or directory.

Thank you again for your help , much appreciated!

romainGuiet · 2024-03-15T09:45:17Z

Hi @elezar ,

Thank you for your previous suggestions, I'm still unsure what I can try next to solve the problem...

I find this "old" thread and this post

So I tested the image on Linux (Kuberneetes grid) and it started/worked , so the issues seem to be Windows-WSL2/Docker-desktop related.

Thank you again for your help,

R

romainGuiet · 2024-03-26T12:46:00Z

From the container bash , running dpkg -l '*nvidia*' gave different results :

in nvcr.io/nvidia/pytorch:23.07-py3 : dpkg-query: no packages found matching *nvidia*
in the minimal non-starting example :

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===========================-==========================-============-=================================
un libgldispatch0-nvidia (no description available)
un libnvidia-compute (no description available)
ii libnvidia-compute-545:amd64 545.29.06-0ubuntu0.22.04.2 amd64 NVIDIA libcompute package
un libnvidia-ml.so.1 (no description available)
un nvidia-libopencl1 (no description available)
un nvidia-libopencl1-dev (no description available)
ii nvidia-opencl-dev:amd64 11.5.1-1ubuntu1 amd64 NVIDIA OpenCL development files
un nvidia-opencl-icd (no description available)

I noticed the libnvidia-compute-545 , so I modified the code :

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev \
    && apt remove -y libnvidia-compute-535-server \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

to be

RUN apt-get update \
    && apt-get install -y nvidia-opencl-dev \
    && apt remove -y libnvidia-compute-545 \
    && apt-get autoremove --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /tmp/* \
    && find /var/log -type f -exec cp /dev/null \{\} \;

and now it works on both Windows and Linux .

Thank you again @elezar for the suggestions,

Best

R

romainGuiet mentioned this issue Feb 21, 2024

x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown. BIOP/BIOP-desktop#2

Closed

romainGuiet closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libnvidia-ml.so.1: file exists: unknown and nvidia-opencl-dev #371

libnvidia-ml.so.1: file exists: unknown and nvidia-opencl-dev #371

romainGuiet commented Feb 19, 2024

romainGuiet commented Feb 28, 2024

elezar commented Feb 28, 2024

romainGuiet commented Feb 29, 2024 •

edited

Loading

elezar commented Feb 29, 2024

elezar commented Feb 29, 2024

romainGuiet commented Feb 29, 2024 •

edited

Loading

romainGuiet commented Mar 15, 2024 •

edited

Loading

romainGuiet commented Mar 26, 2024 •

edited

Loading

libnvidia-ml.so.1: file exists: unknown and nvidia-opencl-dev #371

libnvidia-ml.so.1: file exists: unknown and nvidia-opencl-dev #371

Comments

romainGuiet commented Feb 19, 2024

minimal building AND starting Dockerfile :

minimal building BUT NON starting docker file

romainGuiet commented Feb 28, 2024

A simpler minimal non-starting example

elezar commented Feb 28, 2024

romainGuiet commented Feb 29, 2024 • edited Loading

elezar commented Feb 29, 2024

elezar commented Feb 29, 2024

romainGuiet commented Feb 29, 2024 • edited Loading

romainGuiet commented Mar 15, 2024 • edited Loading

romainGuiet commented Mar 26, 2024 • edited Loading

romainGuiet commented Feb 29, 2024 •

edited

Loading

romainGuiet commented Feb 29, 2024 •

edited

Loading

romainGuiet commented Mar 15, 2024 •

edited

Loading

romainGuiet commented Mar 26, 2024 •

edited

Loading