-
Notifications
You must be signed in to change notification settings - Fork 26.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dreambooth #2002
Dreambooth #2002
Conversation
Also, @AUTOMATIC1111, if you could check your reddit, I sent you a PM. |
Naive question… but what does this PR allow users to do? Have you found a way to separate the Dreambooth “changes” and apply them on top of other CKPT ? or is this to create dreambooth models via webui? |
It should do all the things. First, you point it at an existing checkpoint, even a custom one. Then, It'll extract the diffusion models for that checkpoint and set up a working directory for training. Once set up, you tell it where your training images are, your input prompt, and your "classification" prompt. Set the number of training steps, and let it rip. I don't have the progress bar, "intermediary images", or "save a checkpoint every N steps" bits added yet, but in theory, it should work to train. I can get it to throw an OOM error, which is what I'd expect since I'm not forcing it to run on my CPU yet. BUT, once done, it should then take the Dreambooth generated files and merge them into the selected checkpoint, saving it along side the others. Since I'm getting OOM errors and can't use it yet, I can't verify I have the "build a new checkpoint" parts right, but if there is a bug/mistake there, it should be fairly trivial to fix. |
I'll try and see if I can get it working with a 3090 and some of the missing features in. Will edit this comment just in-case I don't get anywhere before Tues. Notes for myself:
File "/home/unknown/Development/stable-diffusion-webui/modules/dreambooth/dreambooth.py", line 386, in train
if not global_step % self.save_data_every:
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
File "/home/unknown/Development/stable-diffusion-webui/modules/ui.py", line 188, in f
res = list(func(*args, **kwargs))
TypeError: 'NoneType' object is not iterable |
To work on a 3090 with 12GB you need to use deepspeed.
this is from pinkred's comment on the diffusers patch - huggingface/diffusers#735 Note that TTL had to also do explicit casts rather than relying on auto to ensure that everything stayed 16bit. |
In hindsight it might be better to just have diffusers as an optional dependency in repositories/ like xformers is; Instead of redistributing 2 py files from it in repo. |
I'm only using one file from the HD repo, and it's pretty heavily modified, so not really re-distributed... |
Yeah sorry but this doesn't work for a bunch of people, exactly why is uncertain but it's OOM on my 3080 10GB with 64GB of RAM. (The TTL implementation is supposed to run at 8gb per his account) |
Perhaps you could integrate those changes… allow to run on 8gb apparently:
https://www.reddit.com/r/StableDiffusion/comments/xzbc2h/guide_for_dreambooth_with_8gb_vram_under_windows/?utm_source=share&utm_medium=ios_app&utm_name=iossmf
…On Mon, Oct 10, 2022 at 4:30 AM devilismyfriend ***@***.***> wrote:
Yeah sorry but this doesn't work for a bunch of people, exactly why is
uncertain but it's OOM on my 3080 10GB with 64GB of RAM.
—
Reply to this email directly, view it on GitHub
<#2002 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZA34QNSCOFBHOUR4FKDMDWCPH3NANCNFSM6AAAAAARAMZOXE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Weird, it's like I almost mention in my initial commit that I currently cant get this version to run due to OOM errors, which is specifically because I'm asking for help with the launch accelerate commands needed to make it run under 8GB. :P |
"accelerate config" is literally how I have the stand-alone version running, on windows, on 8GB right now. It's why I chose the base diffusers repo, and it's what I'm asking @AUTOMATIC1111 or anybody else for a bit of help with. ;) |
I will try the manual method today and then poke at things to see if I can figure something if I can get thing running manually. I have close to zero Python experience so not much hope but who knows. |
OK... I see what you are talking about.. the issue is that the activation can't be done using the python script... and this is what is causing the issue. Just for a test... what if activation was done before starting webui? Would that solve this issue? |
What do you mean by "activation"? It would either be up to the user to run "accelerate config" to set the required params (or maybe do it with a script, launch.py, etc.). The bit I need to understand is how I can run "accelerate launch" from within the UI, versus from the command-line as it's documented. I think it's possible, but I haven't tested yet. |
I see. On my side I am stuck trying to make it work manually... until I can do that even the UI won't work. I have done all the installation and config but when I try to run things I get:
|
Add notebook launcher for training start, use the --medvram and --lowvram flags to hijack the launcher's torch_cuda_available method and pass "False" if set to force training only on CPU.
Give the latest commit I just made a try. Be sure to set --medvram in the COMMAND_LINE_ARGS of your launch script, or set it however. I wired in the "notebook_launcher" class from Accelerate, and then forced it to run only on CPU if medvram or lowvram is set. Haven't verified that it trains myself, yet...but my indicator of early success has been how long the "caching latents" portion takes. If it goes fast, it's gonna OOM. If it's running slow (as it is now), then training will run after that call. |
The good news: I can make it work on an 8GB GPU now, from the UI. The bad news: It's abysmally slow, seemingly more so than when I run it manually. I suspect there are other things that can be done to make it faster...but I'll need to futz with it more. Also, still no progress bar, no, way to interrupt/resume training, and no preview in the UI. But, hey, it will run. Progress! |
The latest version fail as soon as I hit train with:
|
Yeah, my bad. Dumb coding error. Fixed already, do another pull. |
Hummm... when using --medvram I get:
I guess this is not supposed to be like that... |
Hummm... to pass the --medvram I need to use python launch --medvram... and this is what gives the cuda error. I usually just run bash webui.sh to launch webui but that one does not pass parameters... |
This is happening on a fresh install of this PR:
Also this happens when I try to train: So apparently the interface fails to get argument inputs
|
Revert changes to launch/requirements. Add class batch size. Use "regular" script from stable-diffusion again.
With the last commit, using the default settings & no command-line arguments on rtx3090:
When I run with I have plenty of free VRAM:
|
These commits solved it. Thanks! Windows 10 (Native, not WSL) & 3090, training takes around 15-22gb of VRAM depending on settings on how big my training dataset is. |
v. 3.8 of Gradio lets us use a dictionary of keys/blocks as an input, versus one big list that has to be constantly updated, meaning we can use **kwargs for functions. :D
Add option to load previous training params after first-time training (resume). Clean up UI, add tabbed interface, Move stuff around so it's easier to work with. Add cancellation support for the class image generation phase, better UI messages. Fix up image generation for UI, hook class generation to UI. Update/cleanup requirements.
modules/dreambooth/dreambooth.py
Outdated
os.makedirs(self.class_data_dir) | ||
|
||
self.logging_dir = os.path.join(self.output_dir, "logging") | ||
self.pretrained_model_path = os.path.join(model_dir, "stable-diffusion-v1-5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does every model require its own copy of stable-diffusion-v1-5
? Can it be downloaded just once to models/dreambooth
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, no model requires a copy. The files are extracted from the target checkpoint when you create a new dreambooth model. They will reside on disk until training is completed and the datafolder for that model is deleted manually.
The only file that actually gets downloaded is the config file needed to load the model.
modules/dreambooth/dreambooth.py
Outdated
try: | ||
print(f"Saving to {self.output_dir}") | ||
pipeline.save_pretrained(self.output_dir) | ||
save_checkpoint(self.total_steps + global_step, self.src, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's a good idea to allow save_checkpoint
to receive a pipeline (instead of a path on disk) to save on I/O & RAM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I can implement this, just need to decide how to handle the "save_checkpoint" call that runs in "start_training".
It's really not necessary if I update the logic a bit at the "check for save" part to ensure we're saving on the last iteration of training. Maybe I'm already doing this...
I've installed webui on WSL / Debian, I'm able to use dreambooth with shivam repo, but when I use the webui I keep getting errors, I tried all 3 types of scheduler Error completing request |
Train->advanced->sheduler |
Pass a pipeline to "save_checkpoint", versus making one twice. Update logic for saving preview/checkpoints. Cleanup extraneous print messages. Move paths for /logging and main config (may break previous trainings, sorry) Add UI Updates/status when creating new DB Model.
I'm not sure what you mean? |
|
Possible memory / storage saving through dehydrated models? |
I'll have to review the code, but I highly doubt it can be removed or shared. When you create a "new" DreamBooth model, it's taking data from the checkpoint you selected and extracting it into "diffusers" format. This is what lives in the /Stable-Diffusion_v1.5 folder. It's not always the same checkpoint data, I'm just using the same folder name. I say I'll have to review the code because it might be possible to delete this after saving the first bit of training data - but I need to review the method used to convert the data back to .ckpt format and ensure it doesn't need the original folder for anything. Which, I think it does at the moment - but I also found a new method that doesn't rely on this folder, so I could potentially do away with it. On the flip side - once you've trained a model, you can delete the folder in models/dreambooth/MODEL NAME. It's just there in case you want to resume training a model. |
Fix saving checkpoint data so ALL the saved checkpoint data is encoded, not just the unet? Ditch the somewhat hacky conversion script in favor of the script directly from huggingface. Don't create or use a "stable-diffusion-v-1-5" folder, just extract to /working and work from there.
Config reload from UI was broken because of additional "name" value. Re-arrange UI (again). Save VAE/text encoder when saving model.
Closing, opening new PR to squash commits and make it clean. |
…ocm_installer_for_navi Improved ROCm installer for Navi 3x and ROCm 5.5+ (and experimental Navi 2x support)
Add basic UI implementation and stuff to unpack a selected checkpoint and then use it with Dreambooth.
There's also code to re-merge the output with said selected checkpoint, but I can't currently test with my potato because I don't know how to incorporate the necessary "accelerate launch" command to make it only run on GPU.
@AUTOMATIC1111 - Need help with this bit. It's useless to me if I can't get the accelerate launch stuff to work so I can force it just to my GPU, unless you know some other magick to make it work with 8GB.