Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Slow .safetensors loading #5893

Closed
1 task done
freecoderwaifu opened this issue Dec 20, 2022 · 19 comments · Fixed by #5907 or huggingface/safetensors#140
Closed
1 task done

[Bug]: Slow .safetensors loading #5893

freecoderwaifu opened this issue Dec 20, 2022 · 19 comments · Fixed by #5907 or huggingface/safetensors#140
Labels
bug-report Report of a bug, yet to be confirmed

Comments

@freecoderwaifu
Copy link

freecoderwaifu commented Dec 20, 2022

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What happened?

System specs: 5800X3D, RTX 3080 12GB, 64GB DDR4-3600, Windows 11 22H2 22621.963, latest Nvidia driver (527.56), latest BIOS update.

This is most likely a specific issue related only to my PC, but I've seen a couple of comments about it on other sites. safetensors load significantly slower than the same model in .ckpt, around 2-3 minutes to load a safetensors compared to .ckpt loading in less than 10 seconds.

However, safetensors do load fast when doing merges, both with the inbuilt merger and with merge extensions. I think this is because they mostly only load into RAM and not fully into VRAM, but switching to either of the models used for a merge after the merge is done also makes them load instantly.

Troubleshooting I've tried so far:
-launching with optimizations and without optimizations
-Fresh UI reinstall
-deleted and rebuilt venv
-set SAFETENSORS_FAST_GPU=1 and without the set parameter
-Python 3.10.6, 3.10.8 and 3.10.9
-No extensions
-No antivirus
-No GPU undervolt
-setting python.exe to Max Performance in Nvidia panel
-different browsers
-browser HW acceleration on and off

Videocard works as it should for everything else.

The most notable thing I've noticed is that when loading a .ckpt, both python.exe and System show reads in Resource Manager. When loading a .safetensors, only python.exe shows reads.

ckpt loading:
ckpt
safetensors loading:
safetensors

Steps to reproduce the problem

  1. Go to Stable Diffusion checkpoint dropdown
  2. Load safetensors model
  3. Takes 2-3 minutes to load compared to the same model but in .ckpt

What should have happened?

  1. Go to Stable Diffusion checkpoint dropdown
  2. Load safetensors model
  3. Load in less than 10 seconds like the same model but in .ckpt does.

Commit where the problem happens

685f963

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Brave

Command Line Arguments

Optimized .bat:

--xformers --deepdanbooru --gradio-img2img-tool color-sketch --gradio-inpaint-tool=color-sketch --opt-split-attention --precision autocast --opt-channelslast --api

Non optimized:
--deepdanbooru 

Both show the same issue.

Additional information, context and logs

No response

@freecoderwaifu freecoderwaifu added the bug-report Report of a bug, yet to be confirmed label Dec 20, 2022
@aliencaocao
Copy link
Contributor

have the same observation, thought it should load faster as advertised.
cc @Narsil

@Narsil
Copy link
Contributor

Narsil commented Dec 21, 2022

Hi thanks for the ping, this definitely shouldn't happen.

Do you mind sharing:

I need to try and reproduce, but so far I have failed (I must confess I'm only using cloud Windows since I don't own a Windows machine anymore).

they mostly only load into RAM and not fully into VRAM

Loading on CPU is always going to be faster than loading onto GPU yes.
However it's very surprising that the read speeds are vastly different for both files.

Could you isolate the issue maybe ?

import torch
import datetime

start = datetime.datetime.now()
weights = torch.load("riffusion-model-v1.ckpt", device="cuda:0")
print(f"Loaded PT in {datetime.datetime.now() - start}")
from safetensors.torch import load_file
import datetime

start = datetime.datetime.now()
weights = load_file("v1-5-pruned-emaonly.safetensors", device="cuda:0")
print(f"Loaded SF in {datetime.datetime.now() - start}")

If SF is indeed slower, we can also try this: https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282 to check if the slowness is indeed in the lib or not (It needs a bit adaptation to load the checkpoint you want onto GPU).

And report if you see the same things (just trying to ignore webui if possible.

In the meantime, I will keep trying to reproduce.

NB: The two files are not the same size it seems, but the SF one is smaller, so it should be faster to load.

@Narsil
Copy link
Contributor

Narsil commented Dec 21, 2022

Loading weights [3aafa6fe] from C:\Users\Administrator\src\stable-diffusion-webui\models\Stable-diffusion\riffusion-model-v1.ckpt
Loaded C:\Users\Administrator\src\stable-diffusion-webui\models\Stable-diffusion\riffusion-model-v1.ckpt in 0:01:15.861647
Applying cross attention optimization (Doggettx).
Weights loaded. in 0:01:19.734654
Loading weights [d7049739] from C:\Users\Administrator\src\stable-diffusion-webui\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors
Loaded C:\Users\Administrator\src\stable-diffusion-webui\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors in 0:00:00.308988
Applying cross attention optimization (Doggettx).
Weights loaded. in 0:00:44.164628

Here is what I get on a brand new Windows (Server 2022) with cuda support.
I merely added 2 timings within sd-webui one just after loading the file, and the other one where the classic prompt is

So it's indeed spending a bunch of time doing something AFTER loading the weights, instead of loading correctly the correct weights for SD-1.5 directly in a proper form.

I'll investigate a bit more to see what's going on for SD-1.5

@Narsil
Copy link
Contributor

Narsil commented Dec 21, 2022

Ok, I was able to investigate. It seems map_location or shared.weight_load_location resolves to None meaning the tensors are loaded on CPU (which is extremely fast), but than are sent to VRAM with load_state_dict (which is painfully slow and I don't know why.

https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/sd_models.py#L170

I changed that line and forced device="cuda:0" and then I got those results:

Loaded C:\Users\Administrator\src\stable-diffusion-webui\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors in 0:00:05.154992
Using map_location None
Loaded state dict 0:00:02.660000
Model half 0:00:02.677993
Model first stage to 0:00:02.696012
Applying cross attention optimization (Doggettx).
HIGHJACK stuff in 0:00:08.760007
callback stuff in 0:00:08.760007
load on device cuda in 0:00:09.488031
Weights loaded. in 0:00:09.488031

So the load time is slower (it loads directly on GPU) but it seems faster overall (because there's no need to reallocate on GPU afterwards).

Could that explain the issue you're having ?

As for a fix, I'm not sure what's the proper fix here.
IMO the device= should always be properly set (both for PT and SF) instead of using None (the device is stored in the pickle themselves, but that's not really portable, since if some weights where saved on GPU they can't be loaded on a machine without GPU anymore).

Narsil added a commit to Narsil/stable-diffusion-webui that referenced this issue Dec 21, 2022
@Narsil
Copy link
Contributor

Narsil commented Dec 21, 2022

Created #5907 that hopefully fixes it. If you could try it out and confirm that would be nice. (I'm not convinved it's the "best" fix though.

@aliencaocao
Copy link
Contributor

@Narsil You are right that this is more of a web UI issue. I ran your test script and safe tensor loads 3x faster than pytorch, from HDD.
Using your PR, i did not notice any difference in load timings. The difference is less than 0.5seconds (out of 38 seconds of loading - I am loading 3 different models on start up so its much longer)

@Narsil
Copy link
Contributor

Narsil commented Dec 21, 2022

Using your PR, i did not notice any difference in load timings.

Do you need help debugging ? The other thing I might have looked at is fp16 vs fp32 where safetensors will load whatever is on disk, but if you happen to change to .half() afterwards that also incurs a copy (PT should do the same work though, but if the files happen to have different things on disk, then maybe differences could occur).

I ran your test script and safe tensor loads 3x faster than pytorch, from HDD.

Hm, have you made sure to load similar checkpoints, riffusion is 14Go while sd-1.5 is ~5 ? (The scripts load the two files you showed in your issue for sake of being similar, but it's not necessarily fair :) )

@aliencaocao
Copy link
Contributor

aliencaocao commented Dec 21, 2022

I loaded the animefull model for your comparsion script, exact same model, just converted from pytorch to safetensor.

I have fp16 model for everything and half turned on, they should be the same

@Narsil
Copy link
Contributor

Narsil commented Dec 21, 2022

I have fp16 model for everything and half turned on, they should be the same

Can you try turning it off ? Just for a sanity check ?

What I could do is publish somewhere the version with all timings enabled, that should provide some insights into what's going on for you.

@aliencaocao
Copy link
Contributor

Using --no-half nearly doubled loading time for me

@Narsil
Copy link
Contributor

Narsil commented Dec 21, 2022

It's odd that it doubles (the overall timings). One of them should be the correct value so load without copy. Unless both version somehow create copies then it's normal (since f32 is twice f16)...

Could you try that:

#5913

and report the timings ? (They will show in the shell)

@freecoderwaifu
Copy link
Author

freecoderwaifu commented Dec 21, 2022

Tried the script and the safetensors model loaded in 35s and showed equivalent reads to ckpt loading in Resource Manager. Tried the PR but unfortunately still shows the same symptoms with and without launch args, at least for me, and reads never go above 20MB/s.

Another thing I had tried before is using safetensors converted using the Checkpoint Merger vs converted with the script in the original PR vs safetensors downloaded directly from the model provider, but all of them load slow. Another was using --disable-safe-unpickle since someone suggested it in another site, but also didn't show improvements. Additional stuff, just in case, from the Windows side was HAGS on vs off, manually setting OpenGL GPU in the Nvidia panel and Exploit Protection on vs off for python.exe, but that also didn't have any impact. Antivirus also doesn't show any reading done on the files.

It's definitely something with the UI, since it happens when loading any .safetensors only from the Checkpoint selector, but the standalone script provided above does make them load faster, and whatever the Checkpoint Merger and extensions like these do to load models also bypasses the slow load.

https://github.com/bbc-mc/sdweb-merge-block-weighted-gui

@Narsil
Copy link
Contributor

Narsil commented Dec 21, 2022

@freecoderwaifu can you provide the timings from the branch I created ? It might provide insight as to WHAT is slow exactly.

@freecoderwaifu
Copy link
Author

freecoderwaifu commented Dec 21, 2022

Whoops my bad. I had manually copy pasted the changes before, now I git pulled the branch (#5913) but still couldn't find the timings in the console. I also can't find the timings in the UI, the inbuilt log shows blank after loading a model but works fine for everything else (though it could possibly be due to some extension maybe, dunno).

console

I had only tried the safetensors load script too, I tried the ckpt script and got an error.

script

For torch version this is what the Dreambooth extension pulls up
db

But slow safetensors loading also happens on a fresh install with no extensions installed.

Overall I've noticed all .ckpt load much faster than I remember them loading before, especially when I started using the UI about 2 months ago. Most 2-4 GB .ckpt load in less than 10s. Even the biggest .ckpt, Riffusion at 14GB which I used as quick example, loads in less than a minute, it's only safetensors that seem to be impacted.

@freecoderwaifu
Copy link
Author

freecoderwaifu commented Dec 22, 2022

I tried a fresh install on a new folder to test just in case if somehow having models categorized in subfolders would cause this, and I only copied a handful of models to the new folder. They all loaded instantly, but then realized it's only because they were already cached in RAM from the copy. Resetting the GPU driver and clearing memory with RAMMap or just a rebooting reverts back to slow speeds.

I also tried installing the UI on a different HDD, I got higher reads in Resource Manager when loading the safetensors, it loaded 5 seconds faster using the script and through the UI it loaded about maybe 10-15s faster than on the other HDD, but still loaded noticeably slower than the same .ckpt.

hdd

One last thing I tried just in case was Resizable Bar ON vs OFF, but had no effect.

@Narsil
Copy link
Contributor

Narsil commented Dec 23, 2022

Do you mind rechecking out the PR, I adde even more prints. (What we need is to understand where the time is spent during this whole process).

@Narsil
Copy link
Contributor

Narsil commented Dec 23, 2022

Okay, I think I found something: huggingface/safetensors#140

Basically, SAFETENSORS_FAST_GPU=1 was having no effect on windows (well maybe a bad effect sometimes since it was spending too much time looking for a symbol which it couldn't find, more info in the PR).
I'm not sure if it did affect everyone, or only a subcategory.

Nevertheless loading on GPU was indeed slower on safetensors than with pickle if both loaded onto GPU.
I was confused by this comment :

I ran your test script and safe tensor loads 3x faster than pytorch, from HDD.

I kind of stumbled upon it, by realising that if I forgot to put SAFETENSORS_FAST_GPU, the timings would be the same...

All in all, the most important factor is still to be able to load the weights directly where they are going to be used (here the GPU) both for PT and SF (see my initial PR, which I will start cleaning up).
But there indeed was a bug in safetensors for Windows for this pathway which caused the fast GPU path to not work (and it ended up being slower as a consequence).

Edit: I must note that the speedups I'm seeing for Windows are much lower that what I get on Linux. Could be linked to the Cuda driver.

@freecoderwaifu
Copy link
Author

Thank you, I pulled the updated PR and I can see the timers now. safetensors also load much faster with it, with 2GB files loading in only 14-15s now. The speed improvements do seem to be from the PR alone, nothing else has changed on my system.

SD 1.5 CKPT
sd15ckpt

SD 1.5 safetensors

sd15safe

Additional safetensors test

extra

2GB safetensors

2gb

@aliencaocao
Copy link
Contributor

aliencaocao commented Dec 24, 2022

"Checking for accelerate"
Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]
Commit hash: bf0c4c131920b1cff1c1754eddbfe326baa6cc15
Installing requirements for scikit_learn

Launching Web UI with arguments: --force-enable-xformers --listen --api --disable-safe-unpickle --opt-channelslast --enable-insecure-extension-access --theme dark
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [eb01a8dc] from E:\PyCharmProjects\stable-diffusion-webui\models\Stable-diffusion\Anything-V3.0-pruned.safetensors
Loaded E:\PyCharmProjects\stable-diffusion-webui\models\Stable-diffusion\Anything-V3.0-pruned.safetensors in 0:00:31.845676
Using map_location None
Using shared.weight_load_location None
Using device cuda:0
Read state dict 0:00:31.846677
Loaded state dict 0:00:32.684437
Moved to channel last 0:00:33.069787
Model half 0:00:33.578248
Model first stage to 0:00:33.580250
Loading VAE weights from: E:\PyCharmProjects\stable-diffusion-webui\models\VAE\Anything-V3.0.vae.pt
Applying xformers cross attention optimization.
Model loaded.
Loaded a total of 50 textual inversion embeddings.
...

Indeed seem to speed up things for me
This is loading from a HDD

Narsil added a commit to huggingface/safetensors that referenced this issue Feb 27, 2023
default).

This is now only a deprecation notice.

The reason for this:
- Moving to `0.3` for the alignment modification allowing for more
  change.
- Not specifying the default has been a real performance hurt:
  AUTOMATIC1111/stable-diffusion-webui#5893
  https://github.com/huggingface/diffusers/blob/e5810e686ea4ac499e325c2961808c8972dee039/src/diffusers/models/modeling_utils.py#L103

  This should only affect from disk -> CPU/GPU since this is where the
  location is modified. When loading from bytes, the location is already
  CPU so it's natural to use CPU (no alloc).

- Giving 2 "minor" versions before dropping support, this should allow
  users to have time to move.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-report Report of a bug, yet to be confirmed
Projects
None yet
3 participants