-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pooling VRAM #8
Comments
Looking into that! |
My idea was the following: I figured I need to do something like this with where |
Thank you so much for this insight! It sounds like an OS scheduling problem, hmm... I found a few resources (but I don't understand them fully): |
Thanks for your help, I see your point that would definitely come in handy, but atm I'm not too "scared" by the scale of the problem to turn to GPU computing, I think getting to |
If you're using the huggingface diffusers library, would using huggingface accelerate work? Or is that only for training models, and not executing them? |
Yep during training you have to keep weights update synchronized so it makes sense to use a framework. I'll go with the brute force solution for now, I'll keep you posted. |
Okay I can generate the possible combinations of components-to-GPUs assignment, works well (in terms of speed) if we cut down the number of assignments at each step from a theoretical max of N^4 to smt like a random sample of 2 of them (ikr 😕). This is a greedy approach so we give up on optimality, but I believe it's a fair trade-off. This is probably an overkill of analysis since I doubt it will be used to generate images on a cluster of 128 A100, but perhaps it can turn out to be useful for some other projects by simply scaling-up the random search I've done here. |
👏 👏 🥳 WOOO!!! I can't express in plaintext how exciting this is, even if not optimal! This is a great first step that takes skill to pull off. The broader community may be able to help optimize from here. 128 A100? Not yet, but perhaps if someone makes a job distributor or some kind of kubernetes/distributed scheduler integration for stablediffusion... (looking at myself, maybe) |
@NickLucche Would this help at all? https://cundy.me/post/blog_post_running_gpt_j_on_several_smaller_gpus/ |
What about setups with nvlink? Does it make it easier to pool memory or same thing? |
Nvlink looks like a cool idea but I'm not sure whether it supports finding the best assignments for multiple models parts. I should look into that. |
Wait, so does this fork of yours make any dual GPU setup behave like Nvlink? |
No not really, this is high level code (pytorch level, not nvidia firmware) that's specific to this stable diffusion model. It should support splitting multiple models. I know it may sound confusing, but it's really just Data+Model Parallel. |
I'm just an artist, definitely is confusing to me lol. I found this guy talking about multi GPU: https://youtu.be/hBKcL8fNZ18?list=PLzSRtos7-PQRCskmdrgtMYIt_bKEbMPfD&t=436 No clue if it's helpful at all |
No worries thanks for your help, I'll try to make it so that you don't have to worry about how it runs under the hood, hopefully it'll simply work! |
I'm willing to help test things on my hardware pool if you want some help :) |
I have a somewhat stable build that can be tested with: I am expecting some bugs here and there, so please report the logs/error that appear in the console! Current build has some limitations when
|
I'm excited already! 😄 waiting for the downloads to finish.... |
@NickLucche does it matter which noise scheduler is used? |
This is so exciting! I generated FOUR 512x512 images in the time it used to take me to generate ONE 512x512 image (on a P100) Now to try 14 images...
|
I think I found a bug! 😄 Hardware environment:
When trying to generate 14 images with the following parameters:
the first GPU fails because it only has 8GB VRAM, which is fine, whatever.
Is there a way to solve that? Perhaps scaling what is scheduled to fit on a per-card basis? (if VRAM amounts differ by card -- which others do, e.g. P100 is 16GB VRAM and P40 is 24GB VRAM) |
Thanks a lot for testing out that out so promptly! Nice setup btw 😮
No you can choose any of the available ones it shouldn't effect speed sensibly.
Yeah unfortunately that is how it is supposed to work atm, the small GPU can be a bottleneck for the whole system if included among the
Anyway, does discarding the small device (by setting |
Same parameters as the last test, this time
UNDER A MINUTE WITH PNDM! That's 3.57 sec/image! Trying it with DDIM: |
How would the second choice handle smaller GPUs in the pool? Does the work need to be split evenly between the GPUs (problematic if the GPUs are not evenly sized)? |
Sorry for the late reply, thanks a lot for testing out the build and reporting the inference time too, that is super useful!
I was thinking about filling the biggest GPUs first, and placing only the lightest component on the small one.
Then I'll be merging the results into the master branch and update the "stable" version.
Yeah that looks weird, are you getting the same gibberish results with the single-model version (e.g |
No worries, glad to help 😄
Yes please! 😃
I'll try that and report back! Thank you again, so much, for your work on this. |
Okay, I've added the fp32 support and polished up the code a bit. I'll need to test out that everything that was working before this change is still okay, then I'll be merging this into the master branch. |
Awesome! Is there anything specific I need to do to use fp32 mode? Thank you so much |
The good old |
@NickLucche what were the next steps after this? |
could you re-test the latest image with |
@NickLucche I'm getting an
|
Also, is there a way to mount a cache folder for pip to download its packages to? Downloading 4GB+ every time I |
I also noticed that GPUs 0 and 1 are used (some (note below: the
|
more error: tried on a pair of smaller cards:
|
Thanks a lot for testing!
Good point I'll add a section in the Readme, it relates to #10
I'll look into that too, perhaps some leftover hanging processes..?
This looks like drivers error, we can have another issue with that with info on the specs of the cards |
Thanks for the link! I'll add a volume for Do you want me to open issues for "leftover hanging processes" and "Unable to find valid cuDNN algorithm"? (perhaps a missing python dependency?) |
Also, tried running again - another error: I think missing dependencies?
|
yeah you're definitely missing some drivers for the card you're trying to use, I suggest you first try to install cudnn and run some example code on the new gpus; this "hello world" container from nvidia may help with that
|
Looks fine to me: The cards I'm trying to test are a 3070 and 3070 Ti
|
Hey, just found this thread! Great looking stuff!
It's loading a lot of models, 17 in fact. Might that be the culprit? Anyways, if I can participate in testing or help in any way, I'm here to do so :) |
That makes 2 of us! Oh no :( |
Hi @huotarih , thanks a lot for reporting this bug! I'll also ask you to test back the fixed version if that's ok with you :)
Good point, currently I'm only taking 60% of the free memory of the GPU to instantiate the model(s), that is because generating one or more images requires a substantial amount of free memory, which is only occupied when you actually send the input to the network. 60% is a conservative threshold, as the memory varies with the requested image output, I am still unsure how to properly explain that to the user. |
is there a way to get this working on Automatic1111? Single image generation job on multiple GPUs at once? |
yes but it needs to be a separate contribution to Automatic1111 repo. |
I am closing this issue has this has been stale for a while. Inference on multiple gpus is implemented here (readme); I will come back to supporting heterogeneous gpus setups when I have more time and resources (e.g. a test environment with different gpus for one). All PRs are welcome :) |
what's the state of this in 2024? any plans on getting this to work with comfyUI? |
Originally posted by @NickLucche in #5 (comment)
I would like to be able to pool resources (VRAM) from the multiple cards I have installed into one pool. For example,
I have 4x NVIDIA P100 cards installed. I want to combine them all (16GB VRAM each) into 64GB VRAM so that complicated or high-resolution images don't overload the process with a 16GB VRAM limit.
This also would be useful for people with multiple 4GB VRAM consumer/hobbyist cards to reach workable amounts of VRAM without buying enterprise GPUs.
The text was updated successfully, but these errors were encountered: