installing dropout_layer_norm #131

sgunasekar · 2023-03-01T01:41:45Z

I am trying to install drop_layer_norm to use with fused_dropout_add_ln but the pip installation is taking over an hour to install (and not yet done but also no errors). Is this normal?

tridao · 2023-03-01T01:45:28Z

It takes about 5-6 minutes to install if ninja is available to parallelize the build (ninja should be come with Pytorch but you can check ninja --version and use htop to check if many CPU cores are being used).

sgunasekar · 2023-03-01T01:55:09Z

Thanks for the very quick response. I do have ninja installed so not sure where the issue is. The following are the specs.

ninja --version
1.11.1.git.kitware.jobserver-1

gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

The installation of main package works fine though.

tridao · 2023-03-01T01:56:16Z

Are multiple CPU cores being used to compile?
You can check with htop.

sgunasekar · 2023-03-01T02:02:34Z

yup. 24 cores all seem to be fully occupied.
I do NOT appear to have the same issue with all the packages in csrc directory -- e.g., rotary, or fused_dense_lib

sgunasekar · 2023-03-01T02:16:48Z

When I run with pip install -v, It also appears to be stuck at processing ln_fwd_xxx.cu and ln_bws_xxx.cu files --its still progressing, but very slowly.

tridao · 2023-03-01T02:32:07Z

dropout_layer_norm uses a lot of templating so compilation can be slow, but it takes about 5-6 mins in my case (and the main package takes 4-5 mins for similar reasons).
I'm not sure why it would take much longer to compile.

You can comment out these 2 lines if you're only using Ampere and that should reduce compilation time by 2x.

sgunasekar · 2023-03-01T02:38:58Z

Thanks, Ill try this.. One of the warnings I get is this, but a little beyond me to know if this is an issue. though.

sgunasekar · 2023-03-01T03:59:26Z

Even with the modification, it is running for over 1hr now. Btw, do you have a docker with all the flash-attn packages compiled?

tridao · 2023-03-01T04:03:14Z

We have a dockerfile here.

dropout_layer_norm isn't that important, you can disable it if you're using our training script with model.config.fused_dropout_add_ln=False.

syorami · 2023-03-06T03:20:21Z

I'm facing the same issue. It stuck while installing dropout_layer_norm.

sgunasekar · 2023-03-09T20:21:39Z

I figured that it is faster if we use 11.8 cuda but gets stuck on lower versions. Not sure why though. Another hack is to go through the list of .cu files in the setup and only install the fwd/bwd libs for the dimensions you care about.

shijie-wu · 2023-04-05T15:35:53Z

dropout_layer_norm isn't that important

@tridao out of curiosity, could you share some insight on the performance impact of fused_dropout_add_ln?

tridao · 2023-04-05T17:36:25Z

@shijie-wu I've seen performance improvement on the order of 1-4% with fused_dropout_add_ln, depending on model size.

shijie-wu · 2023-04-05T17:40:22Z

is there any rule of thumb you could share on the relationship between improvement and model size?

tridao · 2023-04-05T17:42:28Z

Larger models will get less improvement, since layer norm (and dropout & residual) will take less time relative to matrix multiply in the MLP and attention.

You can profile your model to see what fraction of time is spent on layer norm etc.

shijie-wu · 2023-04-05T17:49:59Z

on unrelated note, what is the typical speed up with fused_dense_lib in your experience?

tridao · 2023-04-05T17:52:57Z

I don't remember exactly, maybe on the order of 2-7%. Larger models will get less improvement, for the same reason above.
You can profile your model to get the exact numbers.

shijie-wu · 2023-04-05T18:19:45Z

thanks for sharing!

qmdnls · 2023-06-15T05:28:46Z

To add another data point I am seeing exactly the same issue when compiling via python setup.py install until it eventually hangs (?) at the following point:

[46/57] /apps/cuda/RHEL8/cuda_11.6.0/bin/nvcc  -I/home/bjorn/flash-attention/csrc/layer_norm -I/home/bjorn/Megatron-LM/.env/lib/python3.8/site-packages/torch/include -I/home/bjorn/Megatron-LM/.env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/bjorn/Megatron-LM/.env/lib/python3.8/site-packages/torch/include/TH -I/home/bjorn/Megatron-LM/.env/lib/python3.8/site-packages/torch/include/THC -I/apps/cuda/RHEL8/cuda_11.6.0/include -I/home/bjorn/Megatron-LM/.env/include -c -c /home/bjorn/flash-attention/csrc/layer_norm/ln_fwd_2560.cu -o /home/bjorn/flash-attention/csrc/layer_norm/build/temp.linux-x86_64-cpython-38/ln_fwd_2560.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=dropout_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14

tridao · 2023-06-15T05:31:29Z

I've seen reports (#268) saying using the Nvidia using the Pytorch docker container makes compiling much faster. That's that docker I use.

AaronWatson2975 · 2023-07-04T15:40:25Z

I can confirm I was seeing the same result, stuck on the exact same line [46/57] /apps/cuda/RHEL8/cuda_11.6.0/bin/nvcc for hours with no end in sight. This was when I was testing using a vanilla Runpod Pytorch v1 container, I could do everything else except I'd always get stuck on that line. This would still happen even if I installed ninja (couldn't get past flash-attn install without ninja, or it would take so long I never let it finish). This was using 128vCPUs, and I also noticed my usage never exceeded 20%, and it never seemed to finish (waited 3 hours on one run).

After switching to the container mentioned nvcr.io/nvidia/pytorch:22.12-py3, not only did the setup go faster, but that line it was hanging on 46/57 took no longer than any other steps. I also noted my CPU usage stayed around 100% during the build of layer_norm.

Not sure if this is helpful for anyone but thought I'd share.

radarFudan mentioned this issue May 16, 2023

Potential reasons for stuck building wheel for droput-layer-norm HazyResearch/safari#17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

installing dropout_layer_norm #131

installing dropout_layer_norm #131

sgunasekar commented Mar 1, 2023

tridao commented Mar 1, 2023

sgunasekar commented Mar 1, 2023

tridao commented Mar 1, 2023

sgunasekar commented Mar 1, 2023

sgunasekar commented Mar 1, 2023

tridao commented Mar 1, 2023

sgunasekar commented Mar 1, 2023

sgunasekar commented Mar 1, 2023

tridao commented Mar 1, 2023

syorami commented Mar 6, 2023

sgunasekar commented Mar 9, 2023

shijie-wu commented Apr 5, 2023

tridao commented Apr 5, 2023

shijie-wu commented Apr 5, 2023

tridao commented Apr 5, 2023

shijie-wu commented Apr 5, 2023

tridao commented Apr 5, 2023

shijie-wu commented Apr 5, 2023

qmdnls commented Jun 15, 2023 •

edited

tridao commented Jun 15, 2023

AaronWatson2975 commented Jul 4, 2023

installing dropout_layer_norm #131

installing dropout_layer_norm #131

Comments

sgunasekar commented Mar 1, 2023

tridao commented Mar 1, 2023

sgunasekar commented Mar 1, 2023

tridao commented Mar 1, 2023

sgunasekar commented Mar 1, 2023

sgunasekar commented Mar 1, 2023

tridao commented Mar 1, 2023

sgunasekar commented Mar 1, 2023

sgunasekar commented Mar 1, 2023

tridao commented Mar 1, 2023

syorami commented Mar 6, 2023

sgunasekar commented Mar 9, 2023

shijie-wu commented Apr 5, 2023

tridao commented Apr 5, 2023

shijie-wu commented Apr 5, 2023

tridao commented Apr 5, 2023

shijie-wu commented Apr 5, 2023

tridao commented Apr 5, 2023

shijie-wu commented Apr 5, 2023

qmdnls commented Jun 15, 2023 • edited

tridao commented Jun 15, 2023

AaronWatson2975 commented Jul 4, 2023

qmdnls commented Jun 15, 2023 •

edited