-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FusedLayerNorm vs torch.nn.LayerNorm #449
Comments
|
I'm trying that gist with 10,000 iterations on a V100 and
And it's even more if I convert the input to
|
I put it in a gist: https://gist.github.com/bryant1410/d88a42a4b1a3c2989a1db6c79f07e045 |
@bryant1410 Thanks for reporting! |
This is what I get on the DGX station (Tesla V100 GPU) with the master-py3-devel docker:
With pytorch and apex master compiled from source:
Pull request pytorch/pytorch#26201 ("upstream" means layernorm implementation ported from APEX):
|
Btw, I ran mine with the commit |
Is The Fused LN heavily optimized for transformer like application cause I do get big speed up for standard NLP representation
|
@ngoyal2707 Which upstream version are you using? The layernorm in upstream has been improved a lot recently. |
I observe the same issue as @ngoyal2707 on PyTorch 1.5 -- torch.nn.LayerNorm is slower than apex.FusedLayerNorm for shapes typical in NLP models. For example: (512, 16, 1024) with normalization over the last dimension is slower using torch.nn.LayerNorm. |
I also see a performance boost using the FusedLayerNorm for our NLP-based transformer. |
I just replaced all LayerNorm by the apex version in a model from Transformers library (Roberta based), and on a real dataset with sequence length on average of 200 tokens. So basically real life setup, I can't measure any difference. I have also run the benchmark and I get on the same machine :
@vgoklani is it a custom transformer or from an OSS library? |
I ran the gist with shape (32, 128, 768) which is common in Transformers on V100/CUDA10. What I got:
After changing the sequence length to 256:
@pommedeterresautee I suggest you provide your device and CUDA version and that'll be more helpful. |
@hitvoice 2080 TI and apex from master branch at the time of my precedent message, so June 16th |
I ran the gist provided by @ptrblck on a 2080TI GPU, following is the result:
|
What's the advantage of using the
FusedLayerNorm
overtorch.nn.LayerNorm
? I'm running into an issue with using TorchScript and I'm wondering if I can replace the former with the latter.The deeper question is: Is the apex version of layer norm significantly optimized over the standard pytorch version or is it simply a legacy of when pytorch did not have a built in layer norm function?
The text was updated successfully, but these errors were encountered: