Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when setting up Docker image of 3DUnet with GPU support #24

Closed
Eddymorphling opened this issue Nov 21, 2023 · 18 comments
Closed

Error when setting up Docker image of 3DUnet with GPU support #24

Eddymorphling opened this issue Nov 21, 2023 · 18 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@Eddymorphling
Copy link

Hi,
Just tried running the Docker GUI (followed all instructions) to run the 3DUNET notebook with GPU support and ended up with the following error when building the Docket image. Any clue what could be wrong? Thank you.

image

@Eddymorphling Eddymorphling added the bug Something isn't working label Nov 21, 2023
@Eddymorphling
Copy link
Author

Edit: Adding some more context. I have an NVIDIA TITAN X Pascal GPU with the necessary NVIDIA/CUDA/cuDNN drivers setup on a Win10 PC.

@jinxsfe
Copy link

jinxsfe commented Nov 21, 2023

@esgomezm
Screenshot 2023-11-21 151908 I also meet same. do you have any ideas for that? appreciate

@ctr26
Copy link
Collaborator

ctr26 commented Nov 21, 2023 via email

@ctr26
Copy link
Collaborator

ctr26 commented Nov 21, 2023 via email

@Eddymorphling
Copy link
Author

Eddymorphling commented Nov 21, 2023

docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi This created a new nvidia/CUDA docket image and detects the GPU. But fails to launch the 3DUnet-GPU notebook

I also tried in admin/sudo mode sudo -E bash launch.sh and it fails too.

@Eddymorphling
Copy link
Author

@ctr26 Is this related?

@ctr26
Copy link
Collaborator

ctr26 commented Nov 21, 2023

NVIDIA/nvidia-container-toolkit#289

Maybe this is? @IvanHCenalmor

@ctr26
Copy link
Collaborator

ctr26 commented Nov 21, 2023

If this is the fix specifically on windows WSL itd have to be upstreamed as an actual bug fix.

@ctr26
Copy link
Collaborator

ctr26 commented Nov 21, 2023

Otherwise if you're feeling brave

https://github.com/HenriquesLab/DL4MicEverywhere/blob/main/Dockerfile

You'd have to add the code to (I'm guessing) line 3 in this file.

@Eddymorphling
Copy link
Author

Thanks. I will have to wait as I am not sure how to implement the fix.

@IvanHCenalmor
Copy link
Collaborator

Hi @Eddymorphling and @jinxsfe ,
I will try to replicate this issue in the Windows workstation that we have and to solve it.

@ctr26 and @Eddymorphling , thanks a lot for providing those links and feedback 🙏 ❤️

In both issues (the one given by @Eddymorphling here and the one given by @ctr26 here or here ) they mention that the problem is related with some 'ghost' files are automatically injected in the containerization (maybe because of using the default nvidia image mentioned here). And the solution for this can be to remove those files inside the container. Actually the last part of the error that you are getting is ...x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown. Which is one of the files that is removed in this proposal.

I will try to add these lines in the Dockerfile from the proposal and check if the issue can be solved with this or a similar command (that do not need to remove so many files):

RUN rm -rf \
    /usr/lib/x86_64-linux-gnu/libcuda.so* \
    /usr/lib/x86_64-linux-gnu/libnvcuvid.so* \
    /usr/lib/x86_64-linux-gnu/libnvidia-*.so* \
    /usr/local/cuda/compat/lib/*.515.65.01

Also they mention here that the installation of cuda-toolkit apart from nvidia-cuda-toolkit might be needed. That is something I also need to check.

In any case, I will try it as soon as I can and I will comment with you the results that I get.

Again thanks a lot for the feedback! 🫶

@IvanHCenalmor
Copy link
Collaborator

IvanHCenalmor commented Nov 22, 2023

Okey, so still not fixed, but I want to share with you the updates that I have.

I tried to remove the files in /usr/lib/x86_64-linux-gnu, but apparently by just adding that command they are not removed and the error still persist.

I checked a lighter version of the Dockerfile with just the building from the nvidia image and some minor installations. To check at what point did these files appear and if they could be removed in this lighter Dockerfile. With this new Dockerfile, there was no problem neither when building it nor running it. And after checking, the files were inside the container, so I assume that those files are not the ones giving the error. Additionally, I have not been able to remove them, so I decided to change the strategy.

I continued from this lighter Dockerfile, adding step by step al the lines that I commented to check which one created the problem and apparently, the error appears from installing `nvidia-cuda-toolkit' inside the Dockerfile in this line. This error does not happen when building the image (because it is actually built) but when running it.

Therefore, I decided tried to add (in the actual Dockefile with all the code) cuda-toolkit within nvidia-cuda-toolkit as suggested here. The error still remained, so I tried to remove the nvidia-cuda-toolkit from the Dockerfile installation, just leaving cuda-toolkit. Removing the installation of the nvidia-cuda-toolkit library (keeping the rest of the code in the Dockerfile) SOLVED the error, BUT without the nvidia-cuda-toolkit library, TensorFlow is not able to use or initialize the GPUs.

Everything on TensorFlow is correctly installed and when running tf.test.is_built_with_cuda() is returns True, but when running tf.config.list_physical_devices("GPU") it return an empty list []. It also gives a warnings explaining the problem:

2023-11-22 14:31:36.102893: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-22 14:31:36.210519: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10778] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-22 14:31:36.210545: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: > Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-22 14:31:36.216230: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1533] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-22 14:31:36.277350: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-22 14:31:36.277826: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.  

To check if everything else was correctly installed, I used the Docker image from Pytorch to see if the GPUs were working and it actually works without any problem.

So, as a conclusion until know is that the problem is with the nvidia-cuda-toolkit library, that is the one causing the error, but that is also needed for the TensorFlow library to be able to use the GPUs. So here is a big conflict.

Now that I now that the problem is with that library and not with the file itself, I think that I have an idea of what I will try to change to fix this problem. I think that the origin of this problem might be because the paths to that library are not correctly configured or linked, so what I will try now is to check if the problem is with the PATHs in the docker image or on the local machine.

I will continue informing you on the progress and if you have any idea or feedback of the progress until know I will bee glad to read it 😄

@ctr26
Copy link
Collaborator

ctr26 commented Nov 22, 2023

If we use conda to install the cudatoolkit then you won't need to use apt-get to install it.

@ctr26
Copy link
Collaborator

ctr26 commented Nov 22, 2023

Is this only testable on windows?

@IvanHCenalmor
Copy link
Collaborator

Yeaaah, on the Linux machine works perfectly. It is when it comes to Windows that gives problems with nvidia-cuda-toolkit.

@IvanHCenalmor
Copy link
Collaborator

Here I attach you the feedback that I got in a terminal that was run inside the docker container that was built with cuda-toolkit and not nvidia-cuda-toolkit. Here there is the output from ls /usr/lib/x86_64-linux-gnu and the output of checking the TensorFlow GPU configuration: output.txt

@esgomezm esgomezm added the good first issue Good for newcomers label Nov 22, 2023
@esgomezm
Copy link
Collaborator

esgomezm commented Feb 9, 2024

Hey @Eddymorphling

We have considerably updated the tool and this issue should be solved. Could you give it a try please?

@IvanHCenalmor
Copy link
Collaborator

Hi @Eddymorphling ,

We haven't heard back from you in a while, so we’re going to close this issue for now. If you encounter any further issues or have any questions, please feel free to reopen this issue or create a new one. We’re here to help!

Thanks for your understanding,
Iván

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants