Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable Diffusion Meta AITemplate with >= 200% performance increase #1625

Open
AmericanPresidentJimmyCarter opened this issue Oct 4, 2022 · 18 comments
Labels
enhancement New feature or request

Comments

@AmericanPresidentJimmyCarter
Copy link

AmericanPresidentJimmyCarter commented Oct 4, 2022

AI Template from Meta proposes a 200% or more speedup in image generation.

Presently it is only available for the diffusers library: https://github.com/facebookincubator/AITemplate/tree/main/examples/05_stable_diffusion

Screenshot_2022-10-04_09-12-03

PT = pytorch, AIT = AI template implementation

@AmericanPresidentJimmyCarter
Copy link
Author

AmericanPresidentJimmyCarter commented Oct 4, 2022

Looking into what is needed for this to work:

We need to isolate all portions of the model for sampling from ldm/taming and create torch-like AIT version of them to be transpiled into c++.

https://facebookincubator.github.io/AITemplate/tutorial/how_to_infer_pt.html

A good example here is of the port of the attention module:
https://github.com/facebookincubator/AITemplate/blob/main/examples/05_stable_diffusion/modeling/attention.py

Then we run the compile.py script to build the library, then inference proceeds as normal.

@JustMaier
Copy link
Contributor

I wonder if there will be the same challenges implementing this as there were implementing this other performance enhancement:
#576

In the case of that issue, if I understand correctly, the improved method could only run under linux and there wasn't a clear way to cross-compile for windows and so collaborators were kind of stuck waiting for changes upstream to be made. Any idea if this is going to be compatible with windows?

I think another potential challenge is that AIT's hardware requirements (Ampere etc). Certainly if this could be implemented it could be behind a cmd opt, but would that mean that there may need to be multiple version of the core items you mentioned?

@AmericanPresidentJimmyCarter
Copy link
Author

I think another potential challenge is that AIT's hardware requirements (Ampere etc). Certainly if this could be implemented it could be behind a cmd opt, but would that mean that there may need to be multiple version of the core items you mentioned?

You would need code duplication to the AIT syntax and a flag to turn it on for different hardware requirements, yes. It's very frustrating that there is no way to easily use the existing torch code.

@0xdevalias
Copy link

0xdevalias commented Oct 31, 2022

I don't think this repo currently uses diffusers, but stumbled upon this PR:

Which has some comments talking about how it could potentially also make use of AITemplate in a future PR:

@0xdevalias
Copy link

AITemplate + xformers combination just dropped:

Done: facebookincubator/AITemplate#74

Originally posted by @antinucleon in facebookincubator/AITemplate#13 (comment)


Sync to v0.1.1 version

Impact on current examples:

  • Stable Diffusion: A100-40GB / CUDA 11.6, 50 steps (ms)

Batch 1

Module AIT v0.1 AIT v0.1.1 v0.1.1 Speedup
CLIP 0.87 0.87 1X
UNet 22.47 18.11 1.24X
VAE 37.43 20.14 1.85X
Sum of Three 1161.8 926.51 1.25X
Pipeline 1282.98 1013 1.26X
v0.1: 42.45 it/s v0.1.1: 53.30 it/s

Batch 16

Module v0.1 v0.1.1 Speedup
Pipeline 14931.95 11064.81 1.34X

  • BERT
    CUDA long sequence performance will be significantly boosted by using new mem_eff_attention codegen
  • VIT
    CUDA large resolution performance will be significantly boosted by using new mem_eff_attention codegen

@YourFriendlyNeighborhoodMONKE

A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range

He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve
... But the results speak for themselves!

Would be very interesting if someone investigated this further and figured out a way to port it to the webui

Here's his specs:
RTX 4090 FE (stock settings), WSL, cuda 11.6, latest AItemplate (13-11-2022), Intel 12700KF, Windows 11 22H2

Here's a screenshot he provided me:
image

@bbecausereasonss
Copy link

Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090.

@Maximus-CZ
Copy link

Maximus-CZ commented Nov 14, 2022 via email

@0xdevalias
Copy link

0xdevalias commented Nov 14, 2022

A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range

He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve

@YourFriendlyNeighborhoodMONKE Curious if your friend took notes along the way about how they managed to get things going, what issues they ran into, how they solved them, etc. Would be awesome knowledge to have shared here to make others lives easier, and/or bootstrap possibly getting it running for this repo!

@YourFriendlyNeighborhoodMONKE
Copy link

YourFriendlyNeighborhoodMONKE commented Nov 14, 2022

Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090.

I got about 7.5-8.5 out of the box, which is actually really bad for 4090, but yeah, it's because afaik there's still no support for Lovelace in pytorch and other areas as well - 3090's are probably beating those numbers out of the box

The easiest optimizations you could do would be CUDNN optimization which is just replacing some .dll files in the /venv/lib/site-packages/torch/lib/ - You can find the files by searching 4090 cudnn in a discussion thread here and also on r/StableDiffusion

xformers is fairly simple and straightforward too as auto's webui already supports it out of the box without compiling and all you really need to do is to put --xformers into webui.bat after %PYTHON% launch.py %* to get it installed

I got little under 20it/s after those two, which isn't as high as some are able to get, but I'm happy enough and will just wait for better 4090 support and things like AIT becoming available for easy Windows installation or included in the webui

Remember to back up everything before attempting!

@YourFriendlyNeighborhoodMONKE
Copy link

YourFriendlyNeighborhoodMONKE commented Nov 14, 2022

A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range
He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve

@YourFriendlyNeighborhoodMONKE Curious if your friend took notes along the way about how they managed to get things going, what issues they ran into, how they solved them, etc. Would be awesome knowledge to have shared here to make others lives easier, and/or bootstrap possibly getting it running for this repo!

I understand, but I doubt he has something, because the way I understood it, it took him a couple days of struggle and he seems like he's pretty advanced as well in these areas - These kinds of things at this stage tend to have quite varied errors to deal with which are hardware/software configuration specific too

I'll ask anyway!

Btw.
I saw hlky's comment stating that "at Stable Horde there are about 40 workers that can test AIT on various GPU's"
... So at least there's some interest out there to gather testing data!

@0xdevalias
Copy link

Another potential performance gain issue:

@0xdevalias
Copy link

It is just for inference so won't be helpful in training. I also tested it, it's good for inference but also takes a really long time to compile.

Just FYI - The compilation time with the latest open-source version has been improved a lot from our first release. In our experiences, it can be 4X faster for the models where computation-intensive ops are mostly GEMM-family ops. We've made similar improvement for Conv ops in our internal version, which will be sync-ed to the open-source repo later. Stay tuned. Thanks.

Originally posted by @chenyang78 in facebookincubator/AITemplate#102 (comment)

@mezotaken mezotaken added the enhancement New feature or request label Jan 12, 2023
@Boom-Hacker
Copy link

why not disable to improve the nvidia cards?To kill amd?

@Boom-Hacker
Copy link

why not disable to improve the nvidia cards?To kill amd?

why not disable to improve the nvidia cards?u want to kill amd?

@Boom-Hacker
Copy link

every primpt change need rebuild,it cost above 2mins

@bigmover
Copy link

bigmover commented Feb 6, 2024

Btw

Hi guys! I am newbie of stable diffusion webui. I don't know whether AITemplate is available on stable diffusion webui。Any plan to support it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants