[Bug]: Slow .safetensors loading #5893

freecoderwaifu · 2022-12-20T19:12:48Z

Is there an existing issue for this?

I have searched the existing issues and checked the recent builds/commits

What happened?

System specs: 5800X3D, RTX 3080 12GB, 64GB DDR4-3600, Windows 11 22H2 22621.963, latest Nvidia driver (527.56), latest BIOS update.

This is most likely a specific issue related only to my PC, but I've seen a couple of comments about it on other sites. safetensors load significantly slower than the same model in .ckpt, around 2-3 minutes to load a safetensors compared to .ckpt loading in less than 10 seconds.

However, safetensors do load fast when doing merges, both with the inbuilt merger and with merge extensions. I think this is because they mostly only load into RAM and not fully into VRAM, but switching to either of the models used for a merge after the merge is done also makes them load instantly.

Troubleshooting I've tried so far:
-launching with optimizations and without optimizations
-Fresh UI reinstall
-deleted and rebuilt venv
-set SAFETENSORS_FAST_GPU=1 and without the set parameter
-Python 3.10.6, 3.10.8 and 3.10.9
-No extensions
-No antivirus
-No GPU undervolt
-setting python.exe to Max Performance in Nvidia panel
-different browsers
-browser HW acceleration on and off

Videocard works as it should for everything else.

The most notable thing I've noticed is that when loading a .ckpt, both python.exe and System show reads in Resource Manager. When loading a .safetensors, only python.exe shows reads.

ckpt loading:

safetensors loading:

Steps to reproduce the problem

Go to Stable Diffusion checkpoint dropdown
Load safetensors model
Takes 2-3 minutes to load compared to the same model but in .ckpt

What should have happened?

Go to Stable Diffusion checkpoint dropdown
Load safetensors model
Load in less than 10 seconds like the same model but in .ckpt does.

Commit where the problem happens

685f963

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Brave

Command Line Arguments

Optimized .bat:

--xformers --deepdanbooru --gradio-img2img-tool color-sketch --gradio-inpaint-tool=color-sketch --opt-split-attention --precision autocast --opt-channelslast --api

Non optimized:
--deepdanbooru 

Both show the same issue.

Additional information, context and logs

No response

The text was updated successfully, but these errors were encountered:

aliencaocao · 2022-12-21T06:16:44Z

have the same observation, thought it should load faster as advertised.
cc @Narsil

Narsil · 2022-12-21T08:59:22Z

Hi thanks for the ping, this definitely shouldn't happen.

Do you mind sharing:

The actual files you're trying to load ? (Part of it could come from incorrect precision in the files, since you use --precision autocast it could cause some issues).
- I am guessing this : https://huggingface.co/riffusion/riffusion-model-v1/blob/main/riffusion-model-v1.ckpt
- And this : https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/refs%2Fpr%2F46/v1-5-pruned-emaonly.safetensors
The exact PyTorch version
Can you confirm you're loading on the GPU correct ?
The type of Disk your are using (SSD, HDD).
Both files are stored on the same disk I guess ?
You don't have an antivirus that might be inspecting the file slowing down its load ?

I need to try and reproduce, but so far I have failed (I must confess I'm only using cloud Windows since I don't own a Windows machine anymore).

they mostly only load into RAM and not fully into VRAM

Loading on CPU is always going to be faster than loading onto GPU yes.
However it's very surprising that the read speeds are vastly different for both files.

Could you isolate the issue maybe ?

import torch
import datetime

start = datetime.datetime.now()
weights = torch.load("riffusion-model-v1.ckpt", device="cuda:0")
print(f"Loaded PT in {datetime.datetime.now() - start}")

from safetensors.torch import load_file
import datetime

start = datetime.datetime.now()
weights = load_file("v1-5-pruned-emaonly.safetensors", device="cuda:0")
print(f"Loaded SF in {datetime.datetime.now() - start}")

If SF is indeed slower, we can also try this: https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282 to check if the slowness is indeed in the lib or not (It needs a bit adaptation to load the checkpoint you want onto GPU).

And report if you see the same things (just trying to ignore webui if possible.

In the meantime, I will keep trying to reproduce.

NB: The two files are not the same size it seems, but the SF one is smaller, so it should be faster to load.

Narsil · 2022-12-21T11:22:21Z

Loading weights [3aafa6fe] from C:\Users\Administrator\src\stable-diffusion-webui\models\Stable-diffusion\riffusion-model-v1.ckpt
Loaded C:\Users\Administrator\src\stable-diffusion-webui\models\Stable-diffusion\riffusion-model-v1.ckpt in 0:01:15.861647
Applying cross attention optimization (Doggettx).
Weights loaded. in 0:01:19.734654
Loading weights [d7049739] from C:\Users\Administrator\src\stable-diffusion-webui\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors
Loaded C:\Users\Administrator\src\stable-diffusion-webui\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors in 0:00:00.308988
Applying cross attention optimization (Doggettx).
Weights loaded. in 0:00:44.164628

Here is what I get on a brand new Windows (Server 2022) with cuda support.
I merely added 2 timings within sd-webui one just after loading the file, and the other one where the classic prompt is

So it's indeed spending a bunch of time doing something AFTER loading the weights, instead of loading correctly the correct weights for SD-1.5 directly in a proper form.

I'll investigate a bit more to see what's going on for SD-1.5

Narsil · 2022-12-21T12:43:10Z

Ok, I was able to investigate. It seems map_location or shared.weight_load_location resolves to None meaning the tensors are loaded on CPU (which is extremely fast), but than are sent to VRAM with load_state_dict (which is painfully slow and I don't know why.

https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/sd_models.py#L170

I changed that line and forced device="cuda:0" and then I got those results:

Loaded C:\Users\Administrator\src\stable-diffusion-webui\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors in 0:00:05.154992
Using map_location None
Loaded state dict 0:00:02.660000
Model half 0:00:02.677993
Model first stage to 0:00:02.696012
Applying cross attention optimization (Doggettx).
HIGHJACK stuff in 0:00:08.760007
callback stuff in 0:00:08.760007
load on device cuda in 0:00:09.488031
Weights loaded. in 0:00:09.488031

So the load time is slower (it loads directly on GPU) but it seems faster overall (because there's no need to reallocate on GPU afterwards).

Could that explain the issue you're having ?

As for a fix, I'm not sure what's the proper fix here.
IMO the device= should always be properly set (both for PT and SF) instead of using None (the device is stored in the pickle themselves, but that's not really portable, since if some weights where saved on GPU they can't be loaded on a machine without GPU anymore).

Fixes AUTOMATIC1111#5893

Narsil · 2022-12-21T12:48:17Z

Created #5907 that hopefully fixes it. If you could try it out and confirm that would be nice. (I'm not convinved it's the "best" fix though.

aliencaocao · 2022-12-21T13:09:34Z

@Narsil You are right that this is more of a web UI issue. I ran your test script and safe tensor loads 3x faster than pytorch, from HDD.
Using your PR, i did not notice any difference in load timings. The difference is less than 0.5seconds (out of 38 seconds of loading - I am loading 3 different models on start up so its much longer)

Narsil · 2022-12-21T13:18:45Z

Using your PR, i did not notice any difference in load timings.

Do you need help debugging ? The other thing I might have looked at is fp16 vs fp32 where safetensors will load whatever is on disk, but if you happen to change to .half() afterwards that also incurs a copy (PT should do the same work though, but if the files happen to have different things on disk, then maybe differences could occur).

I ran your test script and safe tensor loads 3x faster than pytorch, from HDD.

Hm, have you made sure to load similar checkpoints, riffusion is 14Go while sd-1.5 is ~5 ? (The scripts load the two files you showed in your issue for sake of being similar, but it's not necessarily fair :) )

aliencaocao · 2022-12-21T13:20:15Z

I loaded the animefull model for your comparsion script, exact same model, just converted from pytorch to safetensor.

I have fp16 model for everything and half turned on, they should be the same

Narsil · 2022-12-21T13:35:42Z

I have fp16 model for everything and half turned on, they should be the same

Can you try turning it off ? Just for a sanity check ?

What I could do is publish somewhere the version with all timings enabled, that should provide some insights into what's going on for you.

aliencaocao · 2022-12-21T13:47:49Z

Using --no-half nearly doubled loading time for me

Narsil · 2022-12-21T16:34:25Z

It's odd that it doubles (the overall timings). One of them should be the correct value so load without copy. Unless both version somehow create copies then it's normal (since f32 is twice f16)...

Could you try that:

#5913

and report the timings ? (They will show in the shell)

freecoderwaifu · 2022-12-21T18:38:17Z

Tried the script and the safetensors model loaded in 35s and showed equivalent reads to ckpt loading in Resource Manager. Tried the PR but unfortunately still shows the same symptoms with and without launch args, at least for me, and reads never go above 20MB/s.

Another thing I had tried before is using safetensors converted using the Checkpoint Merger vs converted with the script in the original PR vs safetensors downloaded directly from the model provider, but all of them load slow. Another was using --disable-safe-unpickle since someone suggested it in another site, but also didn't show improvements. Additional stuff, just in case, from the Windows side was HAGS on vs off, manually setting OpenGL GPU in the Nvidia panel and Exploit Protection on vs off for python.exe, but that also didn't have any impact. Antivirus also doesn't show any reading done on the files.

It's definitely something with the UI, since it happens when loading any .safetensors only from the Checkpoint selector, but the standalone script provided above does make them load faster, and whatever the Checkpoint Merger and extensions like these do to load models also bypasses the slow load.

https://github.com/bbc-mc/sdweb-merge-block-weighted-gui

Narsil · 2022-12-21T19:27:12Z

@freecoderwaifu can you provide the timings from the branch I created ? It might provide insight as to WHAT is slow exactly.

freecoderwaifu · 2022-12-21T23:17:51Z

Whoops my bad. I had manually copy pasted the changes before, now I git pulled the branch (#5913) but still couldn't find the timings in the console. I also can't find the timings in the UI, the inbuilt log shows blank after loading a model but works fine for everything else (though it could possibly be due to some extension maybe, dunno).

I had only tried the safetensors load script too, I tried the ckpt script and got an error.

For torch version this is what the Dreambooth extension pulls up

But slow safetensors loading also happens on a fresh install with no extensions installed.

Overall I've noticed all .ckpt load much faster than I remember them loading before, especially when I started using the UI about 2 months ago. Most 2-4 GB .ckpt load in less than 10s. Even the biggest .ckpt, Riffusion at 14GB which I used as quick example, loads in less than a minute, it's only safetensors that seem to be impacted.

freecoderwaifu · 2022-12-22T04:44:24Z

I tried a fresh install on a new folder to test just in case if somehow having models categorized in subfolders would cause this, and I only copied a handful of models to the new folder. They all loaded instantly, but then realized it's only because they were already cached in RAM from the copy. Resetting the GPU driver and clearing memory with RAMMap or just a rebooting reverts back to slow speeds.

I also tried installing the UI on a different HDD, I got higher reads in Resource Manager when loading the safetensors, it loaded 5 seconds faster using the script and through the UI it loaded about maybe 10-15s faster than on the other HDD, but still loaded noticeably slower than the same .ckpt.

One last thing I tried just in case was Resizable Bar ON vs OFF, but had no effect.

Narsil · 2022-12-23T11:15:21Z

Do you mind rechecking out the PR, I adde even more prints. (What we need is to understand where the time is spent during this whole process).

Narsil · 2022-12-23T16:45:07Z

Okay, I think I found something: huggingface/safetensors#140

Basically, SAFETENSORS_FAST_GPU=1 was having no effect on windows (well maybe a bad effect sometimes since it was spending too much time looking for a symbol which it couldn't find, more info in the PR).
I'm not sure if it did affect everyone, or only a subcategory.

Nevertheless loading on GPU was indeed slower on safetensors than with pickle if both loaded onto GPU.
I was confused by this comment :

I ran your test script and safe tensor loads 3x faster than pytorch, from HDD.

I kind of stumbled upon it, by realising that if I forgot to put SAFETENSORS_FAST_GPU, the timings would be the same...

All in all, the most important factor is still to be able to load the weights directly where they are going to be used (here the GPU) both for PT and SF (see my initial PR, which I will start cleaning up).
But there indeed was a bug in safetensors for Windows for this pathway which caused the fast GPU path to not work (and it ended up being slower as a consequence).

Edit: I must note that the speedups I'm seeing for Windows are much lower that what I get on Linux. Could be linked to the Cuda driver.

freecoderwaifu · 2022-12-23T16:45:52Z

Thank you, I pulled the updated PR and I can see the timers now. safetensors also load much faster with it, with 2GB files loading in only 14-15s now. The speed improvements do seem to be from the PR alone, nothing else has changed on my system.

SD 1.5 CKPT

SD 1.5 safetensors

Additional safetensors test

2GB safetensors

aliencaocao · 2022-12-24T02:35:23Z

"Checking for accelerate"
Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]
Commit hash: bf0c4c131920b1cff1c1754eddbfe326baa6cc15
Installing requirements for scikit_learn

Launching Web UI with arguments: --force-enable-xformers --listen --api --disable-safe-unpickle --opt-channelslast --enable-insecure-extension-access --theme dark
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [eb01a8dc] from E:\PyCharmProjects\stable-diffusion-webui\models\Stable-diffusion\Anything-V3.0-pruned.safetensors
Loaded E:\PyCharmProjects\stable-diffusion-webui\models\Stable-diffusion\Anything-V3.0-pruned.safetensors in 0:00:31.845676
Using map_location None
Using shared.weight_load_location None
Using device cuda:0
Read state dict 0:00:31.846677
Loaded state dict 0:00:32.684437
Moved to channel last 0:00:33.069787
Model half 0:00:33.578248
Model first stage to 0:00:33.580250
Loading VAE weights from: E:\PyCharmProjects\stable-diffusion-webui\models\VAE\Anything-V3.0.vae.pt
Applying xformers cross attention optimization.
Model loaded.
Loaded a total of 50 textual inversion embeddings.
...

Indeed seem to speed up things for me
This is loading from a HDD

default). This is now only a deprecation notice. The reason for this: - Moving to `0.3` for the alignment modification allowing for more change. - Not specifying the default has been a real performance hurt: AUTOMATIC1111/stable-diffusion-webui#5893 https://github.com/huggingface/diffusers/blob/e5810e686ea4ac499e325c2961808c8972dee039/src/diffusers/models/modeling_utils.py#L103 This should only affect from disk -> CPU/GPU since this is where the location is modified. When loading from bytes, the location is already CPU so it's natural to use CPU (no alloc). - Giving 2 "minor" versions before dropping support, this should allow users to have time to move.

freecoderwaifu added the bug-report Report of a bug, yet to be confirmed label Dec 20, 2022

Narsil mentioned this issue Dec 21, 2022

Attempting to solve slow loads for safetensors. #5907

Merged

Narsil added a commit to Narsil/stable-diffusion-webui that referenced this issue Dec 21, 2022

Attempting to solve slow loads for safetensors.

42cf618

Fixes AUTOMATIC1111#5893

Narsil mentioned this issue Dec 21, 2022

[WIP] adding prints to solve slow loads for safetensors. #5913

Closed

Narsil mentioned this issue Dec 23, 2022

Making SAFETENSORS_FAST_GPU=1 work on Windows again. huggingface/safetensors#140

Merged

AUTOMATIC1111 closed this as completed in 5ba04f9 Jan 4, 2023

Narsil mentioned this issue Jan 27, 2023

[Ideation - Don't work on] Add DefaultDevice depending on feature flags - use in Tensor & all nn modules coreylowman/dfdx#410

Closed

Narsil mentioned this issue Feb 27, 2023

Deprecating loading files without specifying device (removing the default). huggingface/safetensors#182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Slow .safetensors loading #5893

[Bug]: Slow .safetensors loading #5893

freecoderwaifu commented Dec 20, 2022 •

edited

aliencaocao commented Dec 21, 2022

Narsil commented Dec 21, 2022 •

edited

Narsil commented Dec 21, 2022

Narsil commented Dec 21, 2022

Narsil commented Dec 21, 2022

aliencaocao commented Dec 21, 2022

Narsil commented Dec 21, 2022 •

edited

aliencaocao commented Dec 21, 2022 •

edited

Narsil commented Dec 21, 2022

aliencaocao commented Dec 21, 2022

Narsil commented Dec 21, 2022

freecoderwaifu commented Dec 21, 2022 •

edited

Narsil commented Dec 21, 2022

freecoderwaifu commented Dec 21, 2022 •

edited

freecoderwaifu commented Dec 22, 2022 •

edited

Narsil commented Dec 23, 2022

Narsil commented Dec 23, 2022 •

edited

freecoderwaifu commented Dec 23, 2022

aliencaocao commented Dec 24, 2022 •

edited

[Bug]: Slow .safetensors loading #5893

[Bug]: Slow .safetensors loading #5893

Comments

freecoderwaifu commented Dec 20, 2022 • edited

Is there an existing issue for this?

What happened?

Steps to reproduce the problem

What should have happened?

Commit where the problem happens

What platforms do you use to access UI ?

What browsers do you use to access the UI ?

Command Line Arguments

Additional information, context and logs

aliencaocao commented Dec 21, 2022

Narsil commented Dec 21, 2022 • edited

Narsil commented Dec 21, 2022

Narsil commented Dec 21, 2022

Narsil commented Dec 21, 2022

aliencaocao commented Dec 21, 2022

Narsil commented Dec 21, 2022 • edited

aliencaocao commented Dec 21, 2022 • edited

Narsil commented Dec 21, 2022

aliencaocao commented Dec 21, 2022

Narsil commented Dec 21, 2022

freecoderwaifu commented Dec 21, 2022 • edited

Narsil commented Dec 21, 2022

freecoderwaifu commented Dec 21, 2022 • edited

freecoderwaifu commented Dec 22, 2022 • edited

Narsil commented Dec 23, 2022

Narsil commented Dec 23, 2022 • edited

freecoderwaifu commented Dec 23, 2022

aliencaocao commented Dec 24, 2022 • edited

freecoderwaifu commented Dec 20, 2022 •

edited

Narsil commented Dec 21, 2022 •

edited

Narsil commented Dec 21, 2022 •

edited

aliencaocao commented Dec 21, 2022 •

edited

freecoderwaifu commented Dec 21, 2022 •

edited

freecoderwaifu commented Dec 21, 2022 •

edited

freecoderwaifu commented Dec 22, 2022 •

edited

Narsil commented Dec 23, 2022 •

edited

aliencaocao commented Dec 24, 2022 •

edited