-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Slow .safetensors loading #5893
[Bug]: Slow .safetensors loading #5893
Comments
have the same observation, thought it should load faster as advertised. |
Hi thanks for the ping, this definitely shouldn't happen. Do you mind sharing:
I need to try and reproduce, but so far I have failed (I must confess I'm only using cloud Windows since I don't own a Windows machine anymore).
Loading on CPU is always going to be faster than loading onto GPU yes. Could you isolate the issue maybe ? import torch
import datetime
start = datetime.datetime.now()
weights = torch.load("riffusion-model-v1.ckpt", device="cuda:0")
print(f"Loaded PT in {datetime.datetime.now() - start}") from safetensors.torch import load_file
import datetime
start = datetime.datetime.now()
weights = load_file("v1-5-pruned-emaonly.safetensors", device="cuda:0")
print(f"Loaded SF in {datetime.datetime.now() - start}") If SF is indeed slower, we can also try this: https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282 to check if the slowness is indeed in the lib or not (It needs a bit adaptation to load the checkpoint you want onto GPU). And report if you see the same things (just trying to ignore webui if possible. In the meantime, I will keep trying to reproduce. NB: The two files are not the same size it seems, but the SF one is smaller, so it should be faster to load. |
Here is what I get on a brand new Windows (Server 2022) with cuda support. So it's indeed spending a bunch of time doing something AFTER loading the weights, instead of loading correctly the correct weights for SD-1.5 directly in a proper form. I'll investigate a bit more to see what's going on for SD-1.5 |
Ok, I was able to investigate. It seems https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/sd_models.py#L170 I changed that line and forced
So the load time is slower (it loads directly on GPU) but it seems faster overall (because there's no need to reallocate on GPU afterwards). Could that explain the issue you're having ? As for a fix, I'm not sure what's the proper fix here. |
Created #5907 that hopefully fixes it. If you could try it out and confirm that would be nice. (I'm not convinved it's the "best" fix though. |
@Narsil You are right that this is more of a web UI issue. I ran your test script and safe tensor loads 3x faster than pytorch, from HDD. |
Do you need help debugging ? The other thing I might have looked at is
Hm, have you made sure to load similar checkpoints, riffusion is 14Go while sd-1.5 is ~5 ? (The scripts load the two files you showed in your issue for sake of being similar, but it's not necessarily fair :) ) |
I loaded the animefull model for your comparsion script, exact same model, just converted from pytorch to safetensor. I have fp16 model for everything and half turned on, they should be the same |
Can you try turning it off ? Just for a sanity check ? What I could do is publish somewhere the version with all timings enabled, that should provide some insights into what's going on for you. |
Using |
It's odd that it doubles (the overall timings). One of them should be the correct value so load without copy. Unless both version somehow create copies then it's normal (since f32 is twice f16)... Could you try that: and report the timings ? (They will show in the shell) |
Tried the script and the safetensors model loaded in 35s and showed equivalent reads to ckpt loading in Resource Manager. Tried the PR but unfortunately still shows the same symptoms with and without launch args, at least for me, and reads never go above 20MB/s. Another thing I had tried before is using safetensors converted using the Checkpoint Merger vs converted with the script in the original PR vs safetensors downloaded directly from the model provider, but all of them load slow. Another was using --disable-safe-unpickle since someone suggested it in another site, but also didn't show improvements. Additional stuff, just in case, from the Windows side was HAGS on vs off, manually setting OpenGL GPU in the Nvidia panel and Exploit Protection on vs off for python.exe, but that also didn't have any impact. Antivirus also doesn't show any reading done on the files. It's definitely something with the UI, since it happens when loading any .safetensors only from the Checkpoint selector, but the standalone script provided above does make them load faster, and whatever the Checkpoint Merger and extensions like these do to load models also bypasses the slow load. |
@freecoderwaifu can you provide the timings from the branch I created ? It might provide insight as to WHAT is slow exactly. |
Whoops my bad. I had manually copy pasted the changes before, now I git pulled the branch (#5913) but still couldn't find the timings in the console. I also can't find the timings in the UI, the inbuilt log shows blank after loading a model but works fine for everything else (though it could possibly be due to some extension maybe, dunno). I had only tried the safetensors load script too, I tried the ckpt script and got an error. For torch version this is what the Dreambooth extension pulls up But slow safetensors loading also happens on a fresh install with no extensions installed. Overall I've noticed all .ckpt load much faster than I remember them loading before, especially when I started using the UI about 2 months ago. Most 2-4 GB .ckpt load in less than 10s. Even the biggest .ckpt, Riffusion at 14GB which I used as quick example, loads in less than a minute, it's only safetensors that seem to be impacted. |
I tried a fresh install on a new folder to test just in case if somehow having models categorized in subfolders would cause this, and I only copied a handful of models to the new folder. They all loaded instantly, but then realized it's only because they were already cached in RAM from the copy. Resetting the GPU driver and clearing memory with RAMMap or just a rebooting reverts back to slow speeds. I also tried installing the UI on a different HDD, I got higher reads in Resource Manager when loading the safetensors, it loaded 5 seconds faster using the script and through the UI it loaded about maybe 10-15s faster than on the other HDD, but still loaded noticeably slower than the same .ckpt. One last thing I tried just in case was Resizable Bar ON vs OFF, but had no effect. |
Do you mind rechecking out the PR, I adde even more prints. (What we need is to understand where the time is spent during this whole process). |
Okay, I think I found something: huggingface/safetensors#140 Basically, Nevertheless loading on GPU was indeed slower on
I kind of stumbled upon it, by realising that if I forgot to put SAFETENSORS_FAST_GPU, the timings would be the same... All in all, the most important factor is still to be able to load the weights directly where they are going to be used (here the GPU) both for PT and SF (see my initial PR, which I will start cleaning up). Edit: I must note that the speedups I'm seeing for Windows are much lower that what I get on Linux. Could be linked to the Cuda driver. |
Thank you, I pulled the updated PR and I can see the timers now. safetensors also load much faster with it, with 2GB files loading in only 14-15s now. The speed improvements do seem to be from the PR alone, nothing else has changed on my system. SD 1.5 safetensors Additional safetensors test 2GB safetensors |
Indeed seem to speed up things for me |
default). This is now only a deprecation notice. The reason for this: - Moving to `0.3` for the alignment modification allowing for more change. - Not specifying the default has been a real performance hurt: AUTOMATIC1111/stable-diffusion-webui#5893 https://github.com/huggingface/diffusers/blob/e5810e686ea4ac499e325c2961808c8972dee039/src/diffusers/models/modeling_utils.py#L103 This should only affect from disk -> CPU/GPU since this is where the location is modified. When loading from bytes, the location is already CPU so it's natural to use CPU (no alloc). - Giving 2 "minor" versions before dropping support, this should allow users to have time to move.
Is there an existing issue for this?
What happened?
System specs: 5800X3D, RTX 3080 12GB, 64GB DDR4-3600, Windows 11 22H2 22621.963, latest Nvidia driver (527.56), latest BIOS update.
This is most likely a specific issue related only to my PC, but I've seen a couple of comments about it on other sites. safetensors load significantly slower than the same model in .ckpt, around 2-3 minutes to load a safetensors compared to .ckpt loading in less than 10 seconds.
However, safetensors do load fast when doing merges, both with the inbuilt merger and with merge extensions. I think this is because they mostly only load into RAM and not fully into VRAM, but switching to either of the models used for a merge after the merge is done also makes them load instantly.
Troubleshooting I've tried so far:
-launching with optimizations and without optimizations
-Fresh UI reinstall
-deleted and rebuilt venv
-set SAFETENSORS_FAST_GPU=1 and without the set parameter
-Python 3.10.6, 3.10.8 and 3.10.9
-No extensions
-No antivirus
-No GPU undervolt
-setting python.exe to Max Performance in Nvidia panel
-different browsers
-browser HW acceleration on and off
Videocard works as it should for everything else.
The most notable thing I've noticed is that when loading a .ckpt, both python.exe and System show reads in Resource Manager. When loading a .safetensors, only python.exe shows reads.
ckpt loading:
safetensors loading:
Steps to reproduce the problem
What should have happened?
Commit where the problem happens
685f963
What platforms do you use to access UI ?
Windows
What browsers do you use to access the UI ?
Brave
Command Line Arguments
Additional information, context and logs
No response
The text was updated successfully, but these errors were encountered: