Error when setting up Docker image of 3DUnet with GPU support #24

Eddymorphling · 2023-11-21T19:02:16Z

Hi,
Just tried running the Docker GUI (followed all instructions) to run the 3DUNET notebook with GPU support and ended up with the following error when building the Docket image. Any clue what could be wrong? Thank you.

Eddymorphling · 2023-11-21T19:54:42Z

Edit: Adding some more context. I have an NVIDIA TITAN X Pascal GPU with the necessary NVIDIA/CUDA/cuDNN drivers setup on a Win10 PC.

jinxsfe · 2023-11-21T20:22:37Z

@esgomezm
I also meet same. do you have any ideas for that? appreciate

ctr26 · 2023-11-21T20:28:18Z

Does this fail?docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

ctr26 · 2023-11-21T20:28:42Z

Are you running as admin/sudo?

Eddymorphling · 2023-11-21T21:01:43Z

docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi This created a new nvidia/CUDA docket image and detects the GPU. But fails to launch the 3DUnet-GPU notebook

I also tried in admin/sudo mode sudo -E bash launch.sh and it fails too.

Eddymorphling · 2023-11-21T21:56:43Z

@ctr26 Is this related?

ctr26 · 2023-11-21T22:18:25Z

NVIDIA/nvidia-container-toolkit#289

Maybe this is? @IvanHCenalmor

ctr26 · 2023-11-21T22:49:38Z

If this is the fix specifically on windows WSL itd have to be upstreamed as an actual bug fix.

ctr26 · 2023-11-21T22:51:06Z

Otherwise if you're feeling brave

https://github.com/HenriquesLab/DL4MicEverywhere/blob/main/Dockerfile

You'd have to add the code to (I'm guessing) line 3 in this file.

Eddymorphling · 2023-11-21T22:56:38Z

Thanks. I will have to wait as I am not sure how to implement the fix.

IvanHCenalmor · 2023-11-22T00:07:55Z

Hi @Eddymorphling and @jinxsfe ,
I will try to replicate this issue in the Windows workstation that we have and to solve it.

@ctr26 and @Eddymorphling , thanks a lot for providing those links and feedback 🙏 ❤️

In both issues (the one given by @Eddymorphling here and the one given by @ctr26 here or here ) they mention that the problem is related with some 'ghost' files are automatically injected in the containerization (maybe because of using the default nvidia image mentioned here). And the solution for this can be to remove those files inside the container. Actually the last part of the error that you are getting is ...x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown. Which is one of the files that is removed in this proposal.

I will try to add these lines in the Dockerfile from the proposal and check if the issue can be solved with this or a similar command (that do not need to remove so many files):

RUN rm -rf \
    /usr/lib/x86_64-linux-gnu/libcuda.so* \
    /usr/lib/x86_64-linux-gnu/libnvcuvid.so* \
    /usr/lib/x86_64-linux-gnu/libnvidia-*.so* \
    /usr/local/cuda/compat/lib/*.515.65.01

Also they mention here that the installation of cuda-toolkit apart from nvidia-cuda-toolkit might be needed. That is something I also need to check.

In any case, I will try it as soon as I can and I will comment with you the results that I get.

Again thanks a lot for the feedback! 🫶

IvanHCenalmor · 2023-11-22T16:25:40Z

Okey, so still not fixed, but I want to share with you the updates that I have.

I tried to remove the files in /usr/lib/x86_64-linux-gnu, but apparently by just adding that command they are not removed and the error still persist.

I checked a lighter version of the Dockerfile with just the building from the nvidia image and some minor installations. To check at what point did these files appear and if they could be removed in this lighter Dockerfile. With this new Dockerfile, there was no problem neither when building it nor running it. And after checking, the files were inside the container, so I assume that those files are not the ones giving the error. Additionally, I have not been able to remove them, so I decided to change the strategy.

I continued from this lighter Dockerfile, adding step by step al the lines that I commented to check which one created the problem and apparently, the error appears from installing `nvidia-cuda-toolkit' inside the Dockerfile in this line. This error does not happen when building the image (because it is actually built) but when running it.

Therefore, I decided tried to add (in the actual Dockefile with all the code) cuda-toolkit within nvidia-cuda-toolkit as suggested here. The error still remained, so I tried to remove the nvidia-cuda-toolkit from the Dockerfile installation, just leaving cuda-toolkit. Removing the installation of the nvidia-cuda-toolkit library (keeping the rest of the code in the Dockerfile) SOLVED the error, BUT without the nvidia-cuda-toolkit library, TensorFlow is not able to use or initialize the GPUs.

Everything on TensorFlow is correctly installed and when running tf.test.is_built_with_cuda() is returns True, but when running tf.config.list_physical_devices("GPU") it return an empty list []. It also gives a warnings explaining the problem:

2023-11-22 14:31:36.102893: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-22 14:31:36.210519: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10778] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-22 14:31:36.210545: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: > Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-22 14:31:36.216230: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1533] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-22 14:31:36.277350: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-22 14:31:36.277826: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

To check if everything else was correctly installed, I used the Docker image from Pytorch to see if the GPUs were working and it actually works without any problem.

So, as a conclusion until know is that the problem is with the nvidia-cuda-toolkit library, that is the one causing the error, but that is also needed for the TensorFlow library to be able to use the GPUs. So here is a big conflict.

Now that I now that the problem is with that library and not with the file itself, I think that I have an idea of what I will try to change to fix this problem. I think that the origin of this problem might be because the paths to that library are not correctly configured or linked, so what I will try now is to check if the problem is with the PATHs in the docker image or on the local machine.

I will continue informing you on the progress and if you have any idea or feedback of the progress until know I will bee glad to read it 😄

ctr26 · 2023-11-22T16:48:34Z

If we use conda to install the cudatoolkit then you won't need to use apt-get to install it.

ctr26 · 2023-11-22T16:48:54Z

Is this only testable on windows?

IvanHCenalmor · 2023-11-22T18:19:14Z

Yeaaah, on the Linux machine works perfectly. It is when it comes to Windows that gives problems with nvidia-cuda-toolkit.

IvanHCenalmor · 2023-11-22T18:42:18Z

Here I attach you the feedback that I got in a terminal that was run inside the docker container that was built with cuda-toolkit and not nvidia-cuda-toolkit. Here there is the output from ls /usr/lib/x86_64-linux-gnu and the output of checking the TensorFlow GPU configuration: output.txt

esgomezm · 2024-02-09T19:40:18Z

Hey @Eddymorphling

We have considerably updated the tool and this issue should be solved. Could you give it a try please?

IvanHCenalmor · 2024-08-29T09:24:32Z

Hi @Eddymorphling ,

We haven't heard back from you in a while, so we’re going to close this issue for now. If you encounter any further issues or have any questions, please feel free to reopen this issue or create a new one. We’re here to help!

Thanks for your understanding,
Iván

Eddymorphling added the bug Something isn't working label Nov 21, 2023

Eddymorphling assigned IvanHCenalmor Nov 21, 2023

esgomezm added the good first issue Good for newcomers label Nov 22, 2023

IvanHCenalmor closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when setting up Docker image of 3DUnet with GPU support #24

Error when setting up Docker image of 3DUnet with GPU support #24

Eddymorphling commented Nov 21, 2023

Eddymorphling commented Nov 21, 2023

jinxsfe commented Nov 21, 2023

ctr26 commented Nov 21, 2023 via email •

edited

Loading

ctr26 commented Nov 21, 2023 via email •

edited

Loading

Eddymorphling commented Nov 21, 2023 •

edited

Loading

Eddymorphling commented Nov 21, 2023

ctr26 commented Nov 21, 2023

ctr26 commented Nov 21, 2023

ctr26 commented Nov 21, 2023

Eddymorphling commented Nov 21, 2023

IvanHCenalmor commented Nov 22, 2023

IvanHCenalmor commented Nov 22, 2023 •

edited

Loading

ctr26 commented Nov 22, 2023

ctr26 commented Nov 22, 2023

IvanHCenalmor commented Nov 22, 2023

IvanHCenalmor commented Nov 22, 2023

esgomezm commented Feb 9, 2024

IvanHCenalmor commented Aug 29, 2024

Error when setting up Docker image of 3DUnet with GPU support #24

Error when setting up Docker image of 3DUnet with GPU support #24

Comments

Eddymorphling commented Nov 21, 2023

Eddymorphling commented Nov 21, 2023

jinxsfe commented Nov 21, 2023

ctr26 commented Nov 21, 2023 via email • edited Loading

ctr26 commented Nov 21, 2023 via email • edited Loading

Eddymorphling commented Nov 21, 2023 • edited Loading

Eddymorphling commented Nov 21, 2023

ctr26 commented Nov 21, 2023

ctr26 commented Nov 21, 2023

ctr26 commented Nov 21, 2023

Eddymorphling commented Nov 21, 2023

IvanHCenalmor commented Nov 22, 2023

IvanHCenalmor commented Nov 22, 2023 • edited Loading

ctr26 commented Nov 22, 2023

ctr26 commented Nov 22, 2023

IvanHCenalmor commented Nov 22, 2023

IvanHCenalmor commented Nov 22, 2023

esgomezm commented Feb 9, 2024

IvanHCenalmor commented Aug 29, 2024

ctr26 commented Nov 21, 2023 via email •

edited

Loading

ctr26 commented Nov 21, 2023 via email •

edited

Loading

Eddymorphling commented Nov 21, 2023 •

edited

Loading

IvanHCenalmor commented Nov 22, 2023 •

edited

Loading