Stable Diffusion Meta AITemplate with >= 200% performance increase #1625

AmericanPresidentJimmyCarter · 2022-10-04T13:12:36Z

AI Template from Meta proposes a 200% or more speedup in image generation.

Presently it is only available for the diffusers library: https://github.com/facebookincubator/AITemplate/tree/main/examples/05_stable_diffusion

PT = pytorch, AIT = AI template implementation

AmericanPresidentJimmyCarter · 2022-10-04T16:59:03Z

Looking into what is needed for this to work:

We need to isolate all portions of the model for sampling from ldm/taming and create torch-like AIT version of them to be transpiled into c++.

https://facebookincubator.github.io/AITemplate/tutorial/how_to_infer_pt.html

A good example here is of the port of the attention module:
https://github.com/facebookincubator/AITemplate/blob/main/examples/05_stable_diffusion/modeling/attention.py

Then we run the compile.py script to build the library, then inference proceeds as normal.

JustMaier · 2022-10-04T19:08:21Z

I wonder if there will be the same challenges implementing this as there were implementing this other performance enhancement:
#576

In the case of that issue, if I understand correctly, the improved method could only run under linux and there wasn't a clear way to cross-compile for windows and so collaborators were kind of stuck waiting for changes upstream to be made. Any idea if this is going to be compatible with windows?

I think another potential challenge is that AIT's hardware requirements (Ampere etc). Certainly if this could be implemented it could be behind a cmd opt, but would that mean that there may need to be multiple version of the core items you mentioned?

AmericanPresidentJimmyCarter · 2022-10-04T19:15:56Z

I think another potential challenge is that AIT's hardware requirements (Ampere etc). Certainly if this could be implemented it could be behind a cmd opt, but would that mean that there may need to be multiple version of the core items you mentioned?

You would need code duplication to the AIT syntax and a flag to turn it on for different hardware requirements, yes. It's very frustrating that there is no way to easily use the existing torch code.

0xdevalias · 2022-10-31T23:00:34Z

I don't think this repo currently uses diffusers, but stumbled upon this PR:

Up to 2x speedup on GPUs using memory efficient attention huggingface/diffusers#532

Which has some comments talking about how it could potentially also make use of AITemplate in a future PR:

Up to 2x speedup on GPUs using memory efficient attention huggingface/diffusers#532 (comment)
- That being said, integrating support for AITemplate wouldn't be too hard I believe, maybe for a next PR if you think this could be valuable :)
Up to 2x speedup on GPUs using memory efficient attention huggingface/diffusers#532 (comment)
- Regarding AITemplate, I think those numbers were before they integrated xformers
Up to 2x speedup on GPUs using memory efficient attention huggingface/diffusers#532 (comment)
- It's making Aitemplate even faster and more memory efficient

0xdevalias · 2022-11-10T05:11:11Z

AITemplate + xformers combination just dropped:

Done: facebookincubator/AITemplate#74

Originally posted by @antinucleon in facebookincubator/AITemplate#13 (comment)

v0.1.1 facebookincubator/AITemplate#74

Sync to v0.1.1 version

Impact on current examples:

Stable Diffusion: A100-40GB / CUDA 11.6, 50 steps (ms)

Batch 1

Module AIT v0.1 AIT v0.1.1 v0.1.1 Speedup
CLIP 0.87 0.87 1X
UNet 22.47 18.11 1.24X
VAE 37.43 20.14 1.85X
Sum of Three 1161.8 926.51 1.25X
Pipeline 1282.98 1013 1.26X
v0.1: 42.45 it/s v0.1.1: 53.30 it/s

Batch 16

Module v0.1 v0.1.1 Speedup
Pipeline 14931.95 11064.81 1.34X

BERT
CUDA long sequence performance will be significantly boosted by using new mem_eff_attention codegen

VIT
CUDA large resolution performance will be significantly boosted by using new mem_eff_attention codegen

YourFriendlyNeighborhoodMONKE · 2022-11-13T17:32:56Z

A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range

He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve
... But the results speak for themselves!

Would be very interesting if someone investigated this further and figured out a way to port it to the webui

Here's his specs:
RTX 4090 FE (stock settings), WSL, cuda 11.6, latest AItemplate (13-11-2022), Intel 12700KF, Windows 11 22H2

Here's a screenshot he provided me:

bbecausereasonss · 2022-11-14T04:35:25Z

Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090.

Maximus-CZ · 2022-11-14T05:01:52Z

I am getting stable 11 it/s on my 3080..?

…

On Mon, 14 Nov 2022, 05:35 becausereasons, ***@***.***> wrote: Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090. — Reply to this email directly, view it on GitHub <#1625 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMP7EHRUAMKRRF3ANBBM7LWIG6RRANCNFSM6AAAAAAQ4R7LWI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

0xdevalias · 2022-11-14T08:28:24Z

A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range

He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve

@YourFriendlyNeighborhoodMONKE Curious if your friend took notes along the way about how they managed to get things going, what issues they ran into, how they solved them, etc. Would be awesome knowledge to have shared here to make others lives easier, and/or bootstrap possibly getting it running for this repo!

YourFriendlyNeighborhoodMONKE · 2022-11-14T15:11:12Z

Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090.

I got about 7.5-8.5 out of the box, which is actually really bad for 4090, but yeah, it's because afaik there's still no support for Lovelace in pytorch and other areas as well - 3090's are probably beating those numbers out of the box

The easiest optimizations you could do would be CUDNN optimization which is just replacing some .dll files in the /venv/lib/site-packages/torch/lib/ - You can find the files by searching 4090 cudnn in a discussion thread here and also on r/StableDiffusion

xformers is fairly simple and straightforward too as auto's webui already supports it out of the box without compiling and all you really need to do is to put --xformers into webui.bat after %PYTHON% launch.py %* to get it installed

I got little under 20it/s after those two, which isn't as high as some are able to get, but I'm happy enough and will just wait for better 4090 support and things like AIT becoming available for easy Windows installation or included in the webui

Remember to back up everything before attempting!

YourFriendlyNeighborhoodMONKE · 2022-11-14T15:18:37Z

A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range
He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve

@YourFriendlyNeighborhoodMONKE Curious if your friend took notes along the way about how they managed to get things going, what issues they ran into, how they solved them, etc. Would be awesome knowledge to have shared here to make others lives easier, and/or bootstrap possibly getting it running for this repo!

I understand, but I doubt he has something, because the way I understood it, it took him a couple days of struggle and he seems like he's pretty advanced as well in these areas - These kinds of things at this stage tend to have quite varied errors to deal with which are hardware/software configuration specific too

I'll ask anyway!

Btw.
I saw hlky's comment stating that "at Stable Horde there are about 40 workers that can test AIT on various GPU's"
... So at least there's some interest out there to gather testing data!

0xdevalias · 2022-11-14T23:02:44Z

Another potential performance gain issue:

[Feature Request]: Explore NVIDIA/TransformerEngine for speed/efficiency #4721

0xdevalias · 2022-11-24T06:15:23Z

A few semi-related issues about exploring using AITemplate with Dreambooth:

0xdevalias · 2022-11-24T08:22:44Z

It is just for inference so won't be helpful in training. I also tested it, it's good for inference but also takes a really long time to compile.

Just FYI - The compilation time with the latest open-source version has been improved a lot from our first release. In our experiences, it can be 4X faster for the models where computation-intensive ops are mostly GEMM-family ops. We've made similar improvement for Conv ops in our internal version, which will be sync-ed to the open-source repo later. Stay tuned. Thanks.

Originally posted by @chenyang78 in facebookincubator/AITemplate#102 (comment)

Boom-Hacker · 2023-09-13T16:02:37Z

why not disable to improve the nvidia cards？To kill amd?

Boom-Hacker · 2023-09-13T23:28:43Z

why not disable to improve the nvidia cards？To kill amd?

why not disable to improve the nvidia cards？u want to kill amd?

Boom-Hacker · 2023-09-15T10:34:39Z

every primpt change need rebuild,it cost above 2mins

bigmover · 2024-02-06T08:03:12Z

Btw

Hi guys! I am newbie of stable diffusion webui. I don't know whether AITemplate is available on stable diffusion webui。Any plan to support it?

This was referenced Oct 5, 2022

Show images on a batch as they are completed, not at the end #1683

Closed

240% Speed increase with Meta AI-Template implementation #1682

Closed

x02Sylvie mentioned this issue Oct 7, 2022

xformers attention #1851

Merged

0xdevalias mentioned this issue Nov 10, 2022

Up to 2x speedup on GPUs using memory efficient attention huggingface/diffusers#532

Merged

0xdevalias mentioned this issue Nov 14, 2022

[Feature Request]: Explore potential StableDiffusion speed benefits from implementing kernl (Up to 12X faster GPU inference) #4096

Open

1 task

0xdevalias mentioned this issue Nov 14, 2022

[Feature Request]: colossalai integration #4606

Open

1 task

0xdevalias mentioned this issue Nov 14, 2022

[Feature Request]: Explore NVIDIA/TransformerEngine for speed/efficiency #4721

Open

1 task

0xdevalias mentioned this issue Nov 24, 2022

Add StableDiffusion Dreambooth example facebookincubator/AITemplate#102

Open

mezotaken added the enhancement New feature or request label Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable Diffusion Meta AITemplate with >= 200% performance increase #1625

Stable Diffusion Meta AITemplate with >= 200% performance increase #1625

AmericanPresidentJimmyCarter commented Oct 4, 2022 •

edited

Loading

AmericanPresidentJimmyCarter commented Oct 4, 2022 •

edited

Loading

JustMaier commented Oct 4, 2022

AmericanPresidentJimmyCarter commented Oct 4, 2022

0xdevalias commented Oct 31, 2022 •

edited

Loading

0xdevalias commented Nov 10, 2022

YourFriendlyNeighborhoodMONKE commented Nov 13, 2022

bbecausereasonss commented Nov 14, 2022

Maximus-CZ commented Nov 14, 2022 via email

0xdevalias commented Nov 14, 2022 •

edited

Loading

YourFriendlyNeighborhoodMONKE commented Nov 14, 2022 •

edited

Loading

YourFriendlyNeighborhoodMONKE commented Nov 14, 2022 •

edited

Loading

0xdevalias commented Nov 14, 2022

0xdevalias commented Nov 24, 2022 •

edited

Loading

0xdevalias commented Nov 24, 2022

Boom-Hacker commented Sep 13, 2023

Boom-Hacker commented Sep 13, 2023

Boom-Hacker commented Sep 15, 2023

bigmover commented Feb 6, 2024

Stable Diffusion Meta AITemplate with >= 200% performance increase #1625

Stable Diffusion Meta AITemplate with >= 200% performance increase #1625

Comments

AmericanPresidentJimmyCarter commented Oct 4, 2022 • edited Loading

AmericanPresidentJimmyCarter commented Oct 4, 2022 • edited Loading

JustMaier commented Oct 4, 2022

AmericanPresidentJimmyCarter commented Oct 4, 2022

0xdevalias commented Oct 31, 2022 • edited Loading

0xdevalias commented Nov 10, 2022

YourFriendlyNeighborhoodMONKE commented Nov 13, 2022

bbecausereasonss commented Nov 14, 2022

Maximus-CZ commented Nov 14, 2022 via email

0xdevalias commented Nov 14, 2022 • edited Loading

YourFriendlyNeighborhoodMONKE commented Nov 14, 2022 • edited Loading

YourFriendlyNeighborhoodMONKE commented Nov 14, 2022 • edited Loading

0xdevalias commented Nov 14, 2022

0xdevalias commented Nov 24, 2022 • edited Loading

0xdevalias commented Nov 24, 2022

Boom-Hacker commented Sep 13, 2023

Boom-Hacker commented Sep 13, 2023

Boom-Hacker commented Sep 15, 2023

bigmover commented Feb 6, 2024

AmericanPresidentJimmyCarter commented Oct 4, 2022 •

edited

Loading

AmericanPresidentJimmyCarter commented Oct 4, 2022 •

edited

Loading

0xdevalias commented Oct 31, 2022 •

edited

Loading

0xdevalias commented Nov 14, 2022 •

edited

Loading

YourFriendlyNeighborhoodMONKE commented Nov 14, 2022 •

edited

Loading

YourFriendlyNeighborhoodMONKE commented Nov 14, 2022 •

edited

Loading

0xdevalias commented Nov 24, 2022 •

edited

Loading