Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

installing dropout_layer_norm #131

Open
sgunasekar opened this issue Mar 1, 2023 · 21 comments
Open

installing dropout_layer_norm #131

sgunasekar opened this issue Mar 1, 2023 · 21 comments

Comments

@sgunasekar
Copy link

I am trying to install drop_layer_norm to use with fused_dropout_add_ln but the pip installation is taking over an hour to install (and not yet done but also no errors). Is this normal?

@tridao
Copy link
Contributor

tridao commented Mar 1, 2023

It takes about 5-6 minutes to install if ninja is available to parallelize the build (ninja should be come with Pytorch but you can check ninja --version and use htop to check if many CPU cores are being used).

@sgunasekar
Copy link
Author

Thanks for the very quick response. I do have ninja installed so not sure where the issue is. The following are the specs.

ninja --version
1.11.1.git.kitware.jobserver-1

gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

The installation of main package works fine though.

@tridao
Copy link
Contributor

tridao commented Mar 1, 2023

Are multiple CPU cores being used to compile?
You can check with htop.

@sgunasekar
Copy link
Author

yup. 24 cores all seem to be fully occupied.
I do NOT appear to have the same issue with all the packages in csrc directory -- e.g., rotary, or fused_dense_lib

@sgunasekar
Copy link
Author

When I run with pip install -v, It also appears to be stuck at processing ln_fwd_xxx.cu and ln_bws_xxx.cu files --its still progressing, but very slowly.

@tridao
Copy link
Contributor

tridao commented Mar 1, 2023

dropout_layer_norm uses a lot of templating so compilation can be slow, but it takes about 5-6 mins in my case (and the main package takes 4-5 mins for similar reasons).
I'm not sure why it would take much longer to compile.

You can comment out these 2 lines if you're only using Ampere and that should reduce compilation time by 2x.

@sgunasekar
Copy link
Author

Thanks, Ill try this.. One of the warnings I get is this, but a little beyond me to know if this is an issue. though.

@sgunasekar
Copy link
Author

Even with the modification, it is running for over 1hr now. Btw, do you have a docker with all the flash-attn packages compiled?

@tridao
Copy link
Contributor

tridao commented Mar 1, 2023

We have a dockerfile here.

dropout_layer_norm isn't that important, you can disable it if you're using our training script with model.config.fused_dropout_add_ln=False.

@syorami
Copy link

syorami commented Mar 6, 2023

I'm facing the same issue. It stuck while installing dropout_layer_norm.

@sgunasekar
Copy link
Author

I figured that it is faster if we use 11.8 cuda but gets stuck on lower versions. Not sure why though. Another hack is to go through the list of .cu files in the setup and only install the fwd/bwd libs for the dimensions you care about.

@shijie-wu
Copy link

dropout_layer_norm isn't that important

@tridao out of curiosity, could you share some insight on the performance impact of fused_dropout_add_ln?

@tridao
Copy link
Contributor

tridao commented Apr 5, 2023

@shijie-wu I've seen performance improvement on the order of 1-4% with fused_dropout_add_ln, depending on model size.

@shijie-wu
Copy link

is there any rule of thumb you could share on the relationship between improvement and model size?

@tridao
Copy link
Contributor

tridao commented Apr 5, 2023

Larger models will get less improvement, since layer norm (and dropout & residual) will take less time relative to matrix multiply in the MLP and attention.

You can profile your model to see what fraction of time is spent on layer norm etc.

@shijie-wu
Copy link

on unrelated note, what is the typical speed up with fused_dense_lib in your experience?

@tridao
Copy link
Contributor

tridao commented Apr 5, 2023

I don't remember exactly, maybe on the order of 2-7%. Larger models will get less improvement, for the same reason above.
You can profile your model to get the exact numbers.

@shijie-wu
Copy link

thanks for sharing!

@qmdnls
Copy link

qmdnls commented Jun 15, 2023

To add another data point I am seeing exactly the same issue when compiling via python setup.py install until it eventually hangs (?) at the following point:

[46/57] /apps/cuda/RHEL8/cuda_11.6.0/bin/nvcc  -I/home/bjorn/flash-attention/csrc/layer_norm -I/home/bjorn/Megatron-LM/.env/lib/python3.8/site-packages/torch/include -I/home/bjorn/Megatron-LM/.env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/bjorn/Megatron-LM/.env/lib/python3.8/site-packages/torch/include/TH -I/home/bjorn/Megatron-LM/.env/lib/python3.8/site-packages/torch/include/THC -I/apps/cuda/RHEL8/cuda_11.6.0/include -I/home/bjorn/Megatron-LM/.env/include -c -c /home/bjorn/flash-attention/csrc/layer_norm/ln_fwd_2560.cu -o /home/bjorn/flash-attention/csrc/layer_norm/build/temp.linux-x86_64-cpython-38/ln_fwd_2560.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=dropout_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14

@tridao
Copy link
Contributor

tridao commented Jun 15, 2023

I've seen reports (#268) saying using the Nvidia using the Pytorch docker container makes compiling much faster. That's that docker I use.

@AaronWatson2975
Copy link

I can confirm I was seeing the same result, stuck on the exact same line [46/57] /apps/cuda/RHEL8/cuda_11.6.0/bin/nvcc for hours with no end in sight. This was when I was testing using a vanilla Runpod Pytorch v1 container, I could do everything else except I'd always get stuck on that line. This would still happen even if I installed ninja (couldn't get past flash-attn install without ninja, or it would take so long I never let it finish). This was using 128vCPUs, and I also noticed my usage never exceeded 20%, and it never seemed to finish (waited 3 hours on one run).

After switching to the container mentioned nvcr.io/nvidia/pytorch:22.12-py3, not only did the setup go faster, but that line it was hanging on 46/57 took no longer than any other steps. I also noted my CPU usage stayed around 100% during the build of layer_norm.

Not sure if this is helpful for anyone but thought I'd share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants