New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It doesn't support the latest RTX 40-series card #15
Comments
my rtx4090 under nvidia docker has the same issue.
|
I am not sure if remove this assertion will work? I think right now only GH100 has 9.x compute capability. |
It didn't work, I tried. |
CUDA 12 released this week, does PyTorch need to be updated to CUDA 12 before this library can work with RTX GPUs? https://developer.nvidia.com/blog/cuda-toolkit-12-0-released-for-general-availability/ Filed issue: pytorch/pytorch#90988 |
I managed to workaround PyTorch while building TransformerEngine with CUDA 12 (using
|
As the error explains you would need a GPU with compute capability 9.0 while your 4090 uses compute capability 8.9. |
@ptrblck Thanks! Sorry, I'm a noob with CUDA. Why does it require compute capability 9.x? RTX 4090 has fp8 cores:
and the CUDA 12.0 announcement says that it supports Lovelace architecture:
Might be out of scope of this repo, but what's the intended way to invoke FP8 ops if not using the same cuBLASLt functions that is done in this repo? |
Hi All, First of all, I'm really sorry for the prolonged silence on this issue - I did not want to communicate anything before getting a full alignment internally. |
@ptrendx with cuda 12.1 now released, Ada FP8 support is there? I mean seeing is q1 and you mention q2 estimated support for Ada FP8, has been delahed to cuda 12.2? |
Hi @oscarbg. In order to support FP8 on Ada 2 things need to happen:
|
Thanks @ptrendx quick question, will Ada FP8 support include both E4M3 and E5M2 ? |
Yes, it supports both types (including mixing them, as in performing matrix multiply where 1 input is e.g. E4M3 and the other is E5M2), same as Hopper. |
@ptrendx thank you! |
Today a CUDA Toolkit 12.1 update 1 was released. It contains cuBLAS 12.1.3.1 enabling FP8 kernels for Ada. With this version of cuBLAS together with Transformer Engine 0.7 which added Ada to the compilation targets, FP8 computation is now supported on Ada. Let us know if you encounter any issues with it. |
When I try to install TransformerEngine again with a RTX 4090 card, I got the following error:
Does the transformer engine rely on flash-attention? I checked from here, saying the support for flash-attn with FP8 is not ready yet. Updates: Hi, I have tested by installing latest Pytorch from master branch and installed both Flash-attn & transformer engine successfully. I have also ran some basic test as mentioned in the home page of TransformerEngine and proved that it really works! I'll test more cases, if no problem, I will close the issue. Thank you very much! |
I ran some simple benchmark using basic MNIST example with optional FP8: I found out with and without flag |
@hxssgaa Did you built torch with CUDA 12.1 without issues? I've heard that it is still considered in development. |
FP8 is for transformer, not for simple task as mnist that does not use transfomer |
Yes, I built it successfully from master branch of PyTorch without any issues, although I'm not sure whether there would be performance issues since I haven't properly benchmarked it. I will also do some benchmark about transformer later. |
I tried to do some simple benchmark about Transformer layer with and without FP8, the code I use is from here, I did some modification as some helper file is missing. The code below is without FP8 and with attention layer from transformer engine, it achieves on average of
The code below is with FP8 and with attention layer from transformer engine, it's actually showing slower
Not sure whether it's because the pytorch is from a master branch rather than a stable version even though I build it successfully. I will check with the issue again once Pytorch has release official CUDA 12.1 support |
Hi, we are aware of the issue with the quickstart_utils.py not being served by the website properly and working on a fix. You can get it from here: https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/quickstart_utils.py As for the benchmarking questions:
Please note that CUDA synchronization takes some small time so timing by events is preferable for the most accurate measurement, but with 100 iterations the effect of that should be really small. |
Thanks for the detailed clarification, and providing the script. I revised the script according to the provided script and rerun the script again and confirmed enabling FP8 reduces the transformer inference time from 91.50ms to 65.02ms on a RTX 4090, I used the below script:
Note that I still encounter an error when using flash attention as mentioned here, but I think it's a Pytorch problem. With this experiment, I can confirm the FP8 support on AD102 is indeed working, I will close the issue. Thanks for the team! |
@ptrendx, do you know if this updated cublas will be in the pytorch ngc container v23.04? If I try to manually update it, I'm sure I'll mess it up 😅 |
@nbroad1881 Yes, 23.04 container has everything you need :-). |
in a conda env, I did: Then I did the pip install of transformer engine but got this error below. indeed cublas_v2.h is NOT in the include of the env but I still have this file in /miniconda3/pkgs/libcublas-dev-12.1.3.1-0/include Am I missing something or do I need to change the cmake because of the conda env ?
|
@vince62s, you should use the pytorch ngc container version 23.04 It makes the process much, much easier |
not really helping when you want a repo to rely on this. EDIT: this seems to work
|
Still can't get this to work.
Now when I try to run the following example from the documentation:
Error I'm running this on a RTX 4090. What am I missing? |
@AnubhabB |
Hi, the FP8 should be supported for RTX 40-series as well since it's based on the AD102 architecture which has FP8 capabilities. However running TransformerEngine on the RTX 4090 results in an error: "AssertionError: Device compute capability 9.x required for FP8 execution.". Thus, unable to take advantage of FP8.
The text was updated successfully, but these errors were encountered: