Implement Deepcache Optimization #14210

aria1th · 2023-12-05T17:05:46Z

Description

DeepCache, Yet another optimization

For adjacent timesteps, the result of each layers can be considered 'almost-same' for some cases.

We can just cache them.

Note : this is more beneficial when we have very many step such as DDIM environment.

It won't produce dramatic improvement in few-step inference, especially LCM.

The implementation was modified with gist and patched compatibility too.

Speed benchmark with 1.5 models, results will be added:

Vanilla 512x704, 23 step, DPM++ SDE Karras sampler, 2x with Anime6B, 5-sample inference
2.67 it/s

Hypertile(All)

3.74 it/s

DeepCache

3.02 it/s

DeepCache + HyperTile
4.59 it/s

Compatibility

The optimization is compatible with Controlnet, at least.
(2.6 it/s, 512x680 2x vs 2.0it/s (without))
With both, we can achieve 4.7 it/s - yes - it is faster because it reuses whole cache in hires pass.

Should be tested

We can currently change checkpoint with Refiner / Hires.fix
Then, should be invalidate the cache? or, should we just use it?

Screenshots/videos:

Works with Hypertile too,

Checklist:

I have read contributing wiki page
I have performed a self-review of my own code
My code follows the style guidelines
My code passes tests

gel-crabs · 2023-12-05T22:03:35Z

To test this on SDXL, go to forward_timestep_embed_patch.py and replace "ldm" with "sgm"

FurkanGozukara · 2023-12-05T22:46:14Z

sounding nice

the hyper tile didnt bring any speed on SDXL

how about this?

gel-crabs · 2023-12-05T23:16:47Z

sounding nice

the hyper tile didnt bring any speed on SDXL

how about this?

Enormous speed boost. Around 2-3x faster when it kicks in. However, I'm currently unable to get good quality results with it; I think the forward timestep embed patch might need to be further adapted to the SGM version, I'm not sure though.

aria1th · 2023-12-05T23:19:00Z

@gel-crabs I will do more test within 18 hours, but I guess this should work (as they share same structure)
@FurkanGozukara The XL Code has released 5 hours ago, but I will have chance to implement this within day... not immediately. But the code seems to be very large....

aria1th · 2023-12-05T23:20:50Z

@gel-crabs I guess we might have to adjust the indexes of in-out blocks, XL Unet is more deeper, so using shallow parts earlier would lead to cache 'noisy' semantic informations.

Note: current implementation is quite different from original paper, follows the gist snippet... and its more suitable for frequently used samplers

gel-crabs · 2023-12-06T00:20:24Z

@gel-crabs I will do more test within 18 hours, but I guess this should work (as they share same structure) @FurkanGozukara The XL Code has released 5 hours ago, but I will have chance to implement this within day... not immediately. But the code seems to be very large....

I adapted it to use the SGM code and the results are the exact same, so it doesn't need to be further adapted to SGM. I'm gonna do some testing with the in-out blocks and see how it goes.

aria1th · 2023-12-06T02:38:45Z

Temporary update : I think the implementation should be modified to follow the original paper again.

Original paper says that we should sample the values for nearby steps, not by duration basis.

Although we can only optimize the last final steps, for SD XL, I don't think current one is accurate... thus this should be fixed again.

Block indexes : 0, 0, 0

gel-crabs · 2023-12-06T02:40:07Z

Alright, I think I've gotten the correct blocks for SDXL:

So pretty much just the Cache In Block Indexes changed to 8 and 7.

Still quality loss, the contrast is noticeably higher which I've found is caused by the cache mid.

aria1th · 2023-12-06T06:57:02Z

768x768 test

Hypertile only

7.86it/s

Index 0, 0, 0

Cache Rate 27.23%, 8.03it/s

Index 8, 8, 8

Cache Rate 27.23%, 8.61it/s

Index 0, 0, 5

Cache rate 42.37% 10.8 it/s

0, 0, 6

Cache rate 45.4% 11.1it./s

0, 0, 8

Cache rate 51.45%, 11.51 it/s

0, 0, 8 + Cache out start timestep 600

46.2%, 10.42 it/s

0, 0, 8 + Cache out start timestep 600 + interval 50

34.9%, 9.18it/s

@gel-crabs

I think we can use 0, 0, 8 for most case

VainF · 2023-12-06T10:27:49Z

Very interesting results. Thanks for your effort @aria1th! If need any assistance, please feel free to reach out to us at any time.

FurkanGozukara · 2023-12-06T10:44:49Z

cache looks like degraded quality significantly? @aria1th

also hyper tile looks like do not degrade quality right?

aria1th · 2023-12-06T10:52:43Z

@FurkanGozukara yes, its quality is degraded in xl type models - it requires more experiments or.. maybe re-implementation. It did not happen to 1.5-types though.

gel-crabs · 2023-12-06T15:49:17Z

@FurkanGozukara yes, its quality is degraded in xl type models - it requires more experiments or.. maybe re-implementation. It did not happen to 1.5-types though.

I have a feeling it has something to do with the extra IN/MID/OUT blocks in SDXL. For instance in SD 1.5 IN710 corresponds to a layer, while in SDXL the equivalent is IN710-719 (so 10 blocks compared to 1).

The Elements tab in the SuperMerger extension is really good for showing this information. The middle block has 9 extra blocks in SDXL as well, so I'm betting it has something to do with that.

gel-crabs · 2023-12-06T16:09:01Z

Oops, didn't see the new update. MUCH less quality loss than before. I'm gonna keep testing and see what I can find.

So the settings are this, right?

In block index: 0
In block index 2: 0
Out block index: 8

gel-crabs · 2023-12-06T16:55:02Z

Sorry for the spam, results and another question:

So with these settings on SDXL:

In block index: 8
In block index 2: 8
Out block index: 0
All starts set to 800, plus timestep refresh set to 50

I get next to no quality loss (even an upgrade!), however, the speedup is smaller, pretty much equivalent to a second HyperTile. So my question is: does the block cache index have any effect on the blocks before or after it? For instance, if the out block index is set to 8, does it cache the ones before it as well?

I ask this because there is another output block with the same resolution, which could be cached in addition to the output block cached already. I've gotten similarly high quality (and faster) results with in-blocks set to 7 and 8, which are the same resolution on SDXL.

If it gels with DeepCache I think a second Cache Out Block Index could result in a further speedup.

aria1th · 2023-12-06T23:40:54Z

@gel-crabs I fixed some explanations - for in types, it applies to after-index, thus -1-> all caching
For out types, it applies to before-index, thus 9-> all

Timestep is kinda important feature- if we use 1000, it means we won't refresh any cache after once we get it.

This stands for 1.5 type models - which means they seems to already know what to draw, at first cache point (!). This somehow explains few more things too... anyway

However, XL models seems to have problem with this - they have to refresh cache frequently, they are very dynamic with it.

Unfortunately, refreshing cache directly leads to cache failure rate increase, thus less performance increase...

I'll test with mid blocks too.

aria1th · 2023-12-06T23:50:09Z

I should also explain why quality gets degraded even when we use less cache than all caching- its about input-output mismatching.

To summarize, there are corresponding pairs to each caches, (as UNet blocks).

In another words, if we increase input block id level, then we have to decrease output block id level.

(Images will be attached for further reference)

However, I guess I should use more recent implementation- or convert from pipeline... I'll be able to do this in about 12-24 hours.
https://gist.github.com/laksjdjf/435c512bc19636e9c9af4ee7bea9eb86

aria1th · 2023-12-07T15:44:00Z

New implementation - should be tested though
https://github.com/aria1th/sd-webui-deepcache-standalone

SD 1.5

512x704 test, with 40% disable for initial steps

Steps: 23, Sampler: DPM++ SDE Karras, CFG scale: 8, Seed: 3335110679, Size: 512x704, Model hash: 8c838299ab, VAE hash: 79e225b92f, VAE: blessed2.vae.pt, Denoising strength: 0.5, Hypertile U-Net: True, Hypertile U-Net max depth: 2, Hypertile U-Net max tile size: 64, Hypertile U-Net swap size: 12, Hypertile VAE: True, Hypertile VAE swap size: 2, Hires upscale: 2, Hires upscaler: R-ESRGAN 4x+ Anime6B, Version: v1.7.0-RC-16-geb2b1679

Enabled, Reusing cache for HR steps

5.68it/s

Enabled:

4.66it/s

Vanilla with Hypertile:

2.21it/s

Vanilla without Hypertile

1.21it/s
Vanilla with DeepCache Only

2.83it/s

SD XL :

1girl
Negative prompt: easynegative, nsfw
Steps: 23, Sampler: DPM++ SDE Karras, CFG scale: 8, Seed: 3335110679, Size: 768x768, Model hash: 9a0157cad2, VAE hash: 235745af8d, VAE: sdxl_vae(1).safetensors, Denoising strength: 0.5, Hypertile U-Net: True, Hypertile U-Net max depth: 2, Hypertile U-Net max tile size: 64, Hypertile U-Net swap size: 12, Hypertile VAE: True, Hypertile VAE swap size: 2, Hires upscale: 2, Hires upscaler: R-ESRGAN 4x+ Anime6B, Version: v1.7.0-RC-16-geb2b1679

** DeepCache + HR + Hypertile**

2.65it/s
16.41GB (fp16)

Without optimization

1.47it/s

maybe... some invalid interrupt method? move to paper implementation fix descriptions, KeyError handle sgm for XL fix ruff, change default for out_block Implement Deepcache Optimization

aria1th · 2023-12-07T16:51:37Z

@gel-crabs Now it should work for both models!

gel-crabs · 2023-12-07T17:25:13Z

@gel-crabs Now it should work for both models!

Yeah, it works great! What Cache Resnet level did you use for SDXL?

(Also, what is your Hypertile VAE max tile size?)

Oh yeah, and another thing: I'm getting this in the console.

But yeah, the speedup here is absolutely immense. Do not miss out on this.

aria1th · 2023-12-07T17:55:04Z

@gel-crabs Resnet level 0, which is max as supposed to be - VAE max tile size was set to 128, swap size 6
The logs are removed!

gel-crabs · 2023-12-07T18:04:34Z

@gel-crabs Resnet level 0, which is max as supposed to be - VAE max tile size was set to 128, swap size 6 The logs are removed!

Ahh, thank you! One more thing, perhaps another step percentage for HR fix?

Also, this literally halves the time it takes to generate an image. And it barely even changes the image at all. Thank you so much for your work.

aria1th · 2023-12-07T18:08:14Z

@gel-crabs HR fix will use 100% cache (if option enabled, and, actually success / failure rate is now requires rework, some are steps and some are function call...)
But I guess it has to be checked with controlnet / other extensions too.

gel-crabs · 2023-12-07T18:17:09Z

@gel-crabs HR fix will use 100% cache (if option enabled, and, actually success / failure rate is now requires rework, some are steps and some are function call...) But I guess it has to be checked with controlnet / other extensions too.

Dang, I just checked with ControlNet and it makes the image go full orange. Dynamic Thresholding works perfectly though.

aria1th · 2023-12-07T18:33:45Z

https://github.com/Mikubill/sd-webui-controlnet/blob/main/scripts/hook.py#L425
Okay, this explains why we have bunch of more big code....

aria1th · 2023-12-08T12:51:39Z

https://github.com/aria1th/sd-webui-controlnet/tree/maybe-deepcache-wont-work

I was trying some various implementation including diffusers pipeline, and I guess it does not work well with ControlNet....

horseee/DeepCache#4

ControlNet obviously handles timestep-dependent embedding, which changes the output of U-Net drastically.

Thus, this is expected output.

Compared to this:

Also, I had to patch the controlnet extension, somehow hook override was not working if I offer the patched function in-place, even if it has executed correctly - it completely ignored controlnet.

Thus, in this level, I will just continue to release this as extension - unless someone comes up with great compatible code, you should only use it without controlnet 😢

gel-crabs · 2023-12-08T15:29:04Z

Aww man, that sucks. This is seriously a game changer. :(

Also, it doesn't appear to work with FreeU. The HR fix only speeds up after the original step percentage, I assume because it doesn't cache the steps before the step percentage.

aria1th · 2023-12-08T15:49:25Z

@gel-crabs Yeah, most of the U-Net forward hijacking functions won't work with this, It assumes the nearby step's effects are similar.

Some more academical stuff:

DDIM works well with this. its hidden states will change smoothly, we can use nearby weights.
LCM won't even work with this.
Some schedulers work drastically fast at initial steps, thus we can safely disable for those steps - yes, that's what you see as parameter.

It means whenever the UNet values have to change, the caching will mess up.

But, I guess for training, this can be kinda useful - we can force model to denoise with cache assumption?
(Meanwhile HyperTile is already useful for training)

bigmover · 2024-03-04T06:11:01Z

@gel-crabs Yeah, most of the U-Net forward hijacking functions won't work with this, It assumes the nearby step's effects are similar.

Some more academical stuff:

DDIM works well with this. its hidden states will change smoothly, we can use nearby weights. LCM won't even work with this. Some schedulers work drastically fast at initial steps, thus we can safely disable for those steps - yes, that's what you see as parameter.

It means whenever the UNet values have to change, the caching will mess up.

But, I guess for training, this can be kinda useful - we can force model to denoise with cache assumption? (Meanwhile HyperTile is already useful for training)

The deepcache is available on Webui now? How can we use it ?

aria1th · 2024-03-04T10:50:05Z

@bigmover https://github.com/aria1th/sd-webui-deepcache-standalone
Use the extension please, and note that it can't be used with controlnet / some other specific unet hijacking extensions

bigmover · 2024-04-16T09:50:30Z

@bigmover https://github.com/aria1th/sd-webui-deepcache-standalone Use the extension please, and note that it can't be used with controlnet / some other specific unet hijacking extensions

Appreciate your hard and awesome work! Want to know when or whether the controlnet is available for use with deepcache now or any plan for development?

aria1th requested a review from AUTOMATIC1111 as a code owner December 5, 2023 17:05

aria1th marked this pull request as draft December 6, 2023 02:39

DeepCache Implementation Mark 2

f166868

maybe... some invalid interrupt method? move to paper implementation fix descriptions, KeyError handle sgm for XL fix ruff, change default for out_block Implement Deepcache Optimization

aria1th force-pushed the deepcache branch from 177d507 to f166868 Compare December 7, 2023 16:50

fix ruff

74fb119

aria1th marked this pull request as ready for review December 7, 2023 16:54

remove print messages

d42526f

aria1th marked this pull request as draft December 7, 2023 18:22

aria1th added 2 commits December 8, 2023 22:06

fix logger, add hr_steps

af7d7ce

add xyz, fix

873aea8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Deepcache Optimization #14210

Implement Deepcache Optimization #14210

aria1th commented Dec 5, 2023 •

edited

gel-crabs commented Dec 5, 2023

FurkanGozukara commented Dec 5, 2023

gel-crabs commented Dec 5, 2023 •

edited

aria1th commented Dec 5, 2023

aria1th commented Dec 5, 2023 •

edited

gel-crabs commented Dec 6, 2023

aria1th commented Dec 6, 2023

gel-crabs commented Dec 6, 2023 •

edited

aria1th commented Dec 6, 2023

VainF commented Dec 6, 2023 •

edited

FurkanGozukara commented Dec 6, 2023

aria1th commented Dec 6, 2023

gel-crabs commented Dec 6, 2023

gel-crabs commented Dec 6, 2023 •

edited

gel-crabs commented Dec 6, 2023

aria1th commented Dec 6, 2023

aria1th commented Dec 6, 2023 •

edited

aria1th commented Dec 7, 2023 •

edited

aria1th commented Dec 7, 2023

gel-crabs commented Dec 7, 2023 •

edited

aria1th commented Dec 7, 2023

gel-crabs commented Dec 7, 2023

aria1th commented Dec 7, 2023 •

edited

gel-crabs commented Dec 7, 2023

aria1th commented Dec 7, 2023

aria1th commented Dec 8, 2023

gel-crabs commented Dec 8, 2023

aria1th commented Dec 8, 2023

bigmover commented Mar 4, 2024

aria1th commented Mar 4, 2024

bigmover commented Apr 16, 2024 •

edited

Implement Deepcache Optimization #14210

Are you sure you want to change the base?

Implement Deepcache Optimization #14210

Conversation

aria1th commented Dec 5, 2023 • edited

Description

Compatibility

Should be tested

Screenshots/videos:

Checklist:

gel-crabs commented Dec 5, 2023

FurkanGozukara commented Dec 5, 2023

gel-crabs commented Dec 5, 2023 • edited

aria1th commented Dec 5, 2023

aria1th commented Dec 5, 2023 • edited

gel-crabs commented Dec 6, 2023

aria1th commented Dec 6, 2023

gel-crabs commented Dec 6, 2023 • edited

aria1th commented Dec 6, 2023

VainF commented Dec 6, 2023 • edited

FurkanGozukara commented Dec 6, 2023

aria1th commented Dec 6, 2023

gel-crabs commented Dec 6, 2023

gel-crabs commented Dec 6, 2023 • edited

gel-crabs commented Dec 6, 2023

aria1th commented Dec 6, 2023

aria1th commented Dec 6, 2023 • edited

aria1th commented Dec 7, 2023 • edited

SD 1.5

SD XL :

aria1th commented Dec 7, 2023

gel-crabs commented Dec 7, 2023 • edited

aria1th commented Dec 7, 2023

gel-crabs commented Dec 7, 2023

aria1th commented Dec 7, 2023 • edited

gel-crabs commented Dec 7, 2023

aria1th commented Dec 7, 2023

aria1th commented Dec 8, 2023

gel-crabs commented Dec 8, 2023

aria1th commented Dec 8, 2023

bigmover commented Mar 4, 2024

aria1th commented Mar 4, 2024

bigmover commented Apr 16, 2024 • edited

aria1th commented Dec 5, 2023 •

edited

gel-crabs commented Dec 5, 2023 •

edited

aria1th commented Dec 5, 2023 •

edited

gel-crabs commented Dec 6, 2023 •

edited

VainF commented Dec 6, 2023 •

edited

gel-crabs commented Dec 6, 2023 •

edited

aria1th commented Dec 6, 2023 •

edited

aria1th commented Dec 7, 2023 •

edited

gel-crabs commented Dec 7, 2023 •

edited

aria1th commented Dec 7, 2023 •

edited

bigmover commented Apr 16, 2024 •

edited