Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dockerfile.gpu #6452

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

NisuSan
Copy link

@NisuSan NisuSan commented May 9, 2024

Create blank file.

Create blank file
@NisuSan
Copy link
Author

NisuSan commented May 9, 2024

It seems like I`ve found the solution. So, there are questions starts:

  1. Can I use own doker-image with preinstaled drivers for spead up the entire initialization process? If use nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 image as starting point, there is a 10-20 minutes spend only for installing new drivers.
  2. The same about LightGBM itself. Instalation is quite time costly, so I can make precompiled image with it.
  3. The whole resulted image have size about 10GB. It's may be optimized by use layers in docker and the final layer will be smth about 3-4GB.

What do you think about that optimizations?

@jameslamb
Copy link
Collaborator

Please use an official image from NVIDIA (or some other official base image like ubuntu:latest), not one that you own.

Please keep compilation of LightGBM from source, not pulling from pre-compiled sources.

I strongly recommend that you try changing the base image (first FROM statement) and modifying the existing Dockerfile, instead of completely re-writing this from scratch. If you do continue with completely re-writing it, you should expect more review comments and a longer time until this is completed.

@NisuSan
Copy link
Author

NisuSan commented May 9, 2024

Understood. I will only change base image, add missing driver and change the libnvidia-opencl.so.1 location (according to the new driver and new image) inside required file.

@NisuSan
Copy link
Author

NisuSan commented May 9, 2024

@microsoft-github-policy-service agree

Change the base image, add missing driver and change the libnvidia-opencl.so.1 location (according to the new driver and new image) inside required file.
@NisuSan
Copy link
Author

NisuSan commented May 10, 2024

I did all changes so what next? :)

@NisuSan NisuSan marked this pull request as ready for review May 10, 2024 08:13
@jameslamb
Copy link
Collaborator

I did all changes so what next?

Did you test this?

@@ -88,7 +89,7 @@ RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \

ENV PATH /usr/local/src/lightgbm/LightGBM:${PATH}

RUN /bin/bash -c "source activate py3 && cd /usr/local/src/lightgbm/LightGBM && sh ./build-python.sh install --precompile && source deactivate"
RUN /bin/bash -c "source activate py3 && cd /usr/local/src/lightgbm/LightGBM && sh ./build-python.sh install --gpu && source deactivate"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. lib_lightgbm.so has already been compiled a few lines up (the line running cmake --build build), so --precompile is necessary to build a Python package bundling it in.

Using --gpu makes that previous compilation unnecessary... and will not use the same OpenCL library and headers that was passed there.

This should be reverted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get it. I'll try to work with it today if I have spare time and ckeck everything one more time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I turn back to --precompile I got this errors:

[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That error suggests to me that you're passing {"device": "cuda"} through parameters. That isn't appropriate for this image, where the library hasn't been built with -DUSE_CUDA=1.

In this Dockerfile, lib_lightgbm is being built only with -DUSE_GPU=1, which means you'd need to pass {"device": "gpu"} through params.

Copy link
Author

@NisuSan NisuSan May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried different versions of building like cmake -DUSE_GPU=1 or cmake -DUSE_CUDA=1, and then in the installation command, I also tried all possible variants: sh ./build-python.sh install --gpu, sh ./build-python.sh install --cuda, and sh ./build-python.sh install --precompile as well. I even found your reply on StackOverflow and tried to change some installation steps, but it still didn't work.

The good news is that I fixed the missing files and driver in the Docker image, so now we just need to figure out how to install it properly :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb, @shiyu1994, Today I decide to install it with simple pip command like pip install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON 'lightgbm>=4.0.0' and after run code with device: cuda, I get already known error from this issue. This gave me an idea that promblem with instalation from the sorce can be inside the build-python.sh or cmakelists.txt files. I ask you to get look at this if you can

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult for me to help you because you're reporting error messages but not showing the code you ran the led to them.

This Dockerfile is about the -DUSE_GPU version of LightGBM (OpenCL-based), not the -DUSE_CUDA version (CUDA kernels). Please keep it that way.

Stop passing -DUSE_CUDA or using {"device": "cuda"} with images built from this Dockerfile.

@jameslamb jameslamb added the fix label May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants