Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pooling VRAM #8

Closed
mchaker opened this issue Sep 3, 2022 · 52 comments
Closed

Pooling VRAM #8

mchaker opened this issue Sep 3, 2022 · 52 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@mchaker
Copy link

mchaker commented Sep 3, 2022

Great thanks, at this point I think we can close this issue and make another one for the pooled VRAM!

Originally posted by @NickLucche in #5 (comment)

I would like to be able to pool resources (VRAM) from the multiple cards I have installed into one pool. For example,

I have 4x NVIDIA P100 cards installed. I want to combine them all (16GB VRAM each) into 64GB VRAM so that complicated or high-resolution images don't overload the process with a 16GB VRAM limit.

This also would be useful for people with multiple 4GB VRAM consumer/hobbyist cards to reach workable amounts of VRAM without buying enterprise GPUs.

@NickLucche NickLucche added enhancement New feature or request help wanted Extra attention is needed labels Sep 5, 2022
@NickLucche
Copy link
Owner

Looking into that!

@NickLucche
Copy link
Owner

NickLucche commented Sep 6, 2022

My idea was the following:
the model gets split into 4 main components $\text{unet}_e$, $\text{unet}_d$, text_embedding and vae.
These 4 components must be distributed over $N$ GPUs, and possibly replicated more than once so that you can run multiple models split over multiple devices (this is like combining Data and model parallel).

I figured I need to do something like this
$x_1 \text{unet}_e + y_1 \text{unet}_d + z_1 \text{text-embedding} + k_1 \text{vae} \leq G_1$.
...
$x_N \text{unet}_e + y_N \text{unet}_d + z_N \text{text-embedding} + k_N \text{vae} \leq G_N$.

with $\sum x_i=\sum y_i=\sum z_i=\sum k_i$
$x, y, z, k \geq 0$
$x, y, z, k \in Z^n$

where $G_i$ is the memory capacity of some GPU $i$, while $\text{unet}_e$ represents the memory required by the component.
This looks like a ILP problem to me. Unfortunately, I don't know how to solve it.
The alternative is to use some other greedy approach (start placing components in GPUs where there's enough memory and see where it goes) or brute force, generating all possible combinations.
This thing may be overkill for the purpose of this project, so I'll think about it some more and come up with a feasible idea.

@mchaker
Copy link
Author

mchaker commented Sep 6, 2022

Thank you so much for this insight! It sounds like an OS scheduling problem, hmm...
Would it be possible to solve the ILP problems/optimization problem using GPUs?
They seem ideal for the task.
(I don't mean to be facetious -- I mean that given we are already working with GPUs, perhaps we can use them as part of a startup calculation, then unload the calculation once results are found, and load in the stablediffusion models etc?)

I found a few resources (but I don't understand them fully):

@NickLucche
Copy link
Owner

Thanks for your help, I see your point that would definitely come in handy, but atm I'm not too "scared" by the scale of the problem to turn to GPU computing, I think getting to $N=128$ GPUs would be the biggest use-case we can have, with $N*4$ variables.
I'm mostly concerned in figuring out whether my formulation is correct or if there's something much simpler that can be used do to this, perhaps with reference to some similar work..?

@mchaker
Copy link
Author

mchaker commented Sep 6, 2022

If you're using the huggingface diffusers library, would using huggingface accelerate work?

Or is that only for training models, and not executing them?

@NickLucche
Copy link
Owner

Yep during training you have to keep weights update synchronized so it makes sense to use a framework.

I'll go with the brute force solution for now, I'll keep you posted.

@NickLucche
Copy link
Owner

Okay I can generate the possible combinations of components-to-GPUs assignment, works well (in terms of speed) if we cut down the number of assignments at each step from a theoretical max of N^4 to smt like a random sample of 2 of them (ikr 😕). This is a greedy approach so we give up on optimality, but I believe it's a fair trade-off.
Furthermore, the max number of models that can be split can be limited not only by the amount of combined available VRAM, but also by the number of processes that must handle them (e.g. I took n_cpus*2) .

This is probably an overkill of analysis since I doubt it will be used to generate images on a cluster of 128 A100, but perhaps it can turn out to be useful for some other projects by simply scaling-up the random search I've done here.

@mchaker
Copy link
Author

mchaker commented Sep 7, 2022

Okay I can generate the possible combinations of components-to-GPUs assignment, works well (in terms of speed) if we cut down the number of assignments at each step from a theoretical max of N^4 to smt like a random sample of 2 of them (ikr 😕). This is a greedy approach so we give up on optimality, but I believe it's a fair trade-off. Furthermore, the max number of models that can be split can be limited not only by the amount of combined available VRAM, but also by the number of processes that must handle them (e.g. I took n_cpus*2) .

This is probably an overkill of analysis since I doubt it will be used to generate images on a cluster of 128 A100, but perhaps it can turn out to be useful for some other projects by simply scaling-up the random search I've done here.

👏 👏 🥳 WOOO!!! I can't express in plaintext how exciting this is, even if not optimal!

This is a great first step that takes skill to pull off. The broader community may be able to help optimize from here.

128 A100? Not yet, but perhaps if someone makes a job distributor or some kind of kubernetes/distributed scheduler integration for stablediffusion... (looking at myself, maybe)

@mchaker
Copy link
Author

mchaker commented Sep 7, 2022

@NickLucche Would this help at all?

https://cundy.me/post/blog_post_running_gpt_j_on_several_smaller_gpus/

@aeon3
Copy link

aeon3 commented Sep 9, 2022

What about setups with nvlink? Does it make it easier to pool memory or same thing?

@NickLucche
Copy link
Owner

Nvlink looks like a cool idea but I'm not sure whether it supports finding the best assignments for multiple models parts. I should look into that.
Anyway, I only need to add a few minor things here https://github.com/NickLucche/stable-diffusion-nvidia-docker/tree/model-parallel before testing this approach. Should have updates in the weekend.

@aeon3
Copy link

aeon3 commented Sep 9, 2022

Nvlink looks like a cool idea but I'm not sure whether it supports finding the best assignments for multiple models parts. I should look into that. Anyway, I only need to add a few minor things here https://github.com/NickLucche/stable-diffusion-nvidia-docker/tree/model-parallel before testing this approach. Should have updates in the weekend.

Wait, so does this fork of yours make any dual GPU setup behave like Nvlink?
And is there any benefit in running this fork with nvlink compared to any other SD forks that do not have your special multi gpu code?

@NickLucche
Copy link
Owner

No not really, this is high level code (pytorch level, not nvidia firmware) that's specific to this stable diffusion model.
It tries to find an optimal way to distribute the (predefined, fixed) model components across multiple GPUs and takes care of moving tensors from one GPU to the next one.

It should support splitting multiple models. I know it may sound confusing, but it's really just Data+Model Parallel.

@aeon3
Copy link

aeon3 commented Sep 9, 2022

I'm just an artist, definitely is confusing to me lol. I found this guy talking about multi GPU: https://youtu.be/hBKcL8fNZ18?list=PLzSRtos7-PQRCskmdrgtMYIt_bKEbMPfD&t=436

No clue if it's helpful at all

@NickLucche
Copy link
Owner

No worries thanks for your help, I'll try to make it so that you don't have to worry about how it runs under the hood, hopefully it'll simply work!

@mchaker
Copy link
Author

mchaker commented Sep 9, 2022

I'm willing to help test things on my hardware pool if you want some help :)

@NickLucche
Copy link
Owner

I'm willing to help test things on my hardware pool if you want some help :)
I was counting on it, really appreciate your help! 🙏🏻

I have a somewhat stable build that can be tested with:
docker run --name stable-diffusion --gpus all -it -e DEVICES=all -e MODEL_PARALLEL=1 -e TOKEN=<YOUR_TOKEN> -p 7860:7860 nicklucche/stable-diffusion:multi-gpu

I am expecting some bugs here and there, so please report the logs/error that appear in the console!

Current build has some limitations when MODEL_PARALLEL=1 is set (everything else should work as usual when MODEL_PARALLEL is not set):

  • fp16 only, fp32 not supported (yet, I'll add it asap, is no big deal)
  • nsfw filter is turned off (by default) and can't bet turned on
  • single-gpu multiple models is not (yet) supported (so you need at least 2 GPUs to try this version)
  • Maximum GPU memory that the model(s) will take is set to 60% of the free one, the rest should be used during inference; thing is that as the size of the image increases, the process takes up more memory, so it might crash for greater resolutions
  • current version tries to pack as many models as it can on specified devices; I am thinking about a simpler mode in which we only spread a single model over multiple gpus, that can be turned on by the user

@NickLucche NickLucche self-assigned this Sep 10, 2022
@mchaker
Copy link
Author

mchaker commented Sep 10, 2022

Loading model..
Creating and moving model to cuda:3 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:2 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:5 (Tesla P40)..
Creating and moving model to cuda:0 (NVIDIA GeForce RTX 3070 Ti)..
Creating and moving model to cuda:1 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:4 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:6 (Tesla P40)..

I'm excited already! 😄 waiting for the downloads to finish....

@mchaker
Copy link
Author

mchaker commented Sep 10, 2022

@NickLucche does it matter which noise scheduler is used?

@mchaker
Copy link
Author

mchaker commented Sep 10, 2022

This is so exciting!

I generated FOUR 512x512 images in the time it used to take me to generate ONE 512x512 image (on a P100)

Now to try 14 images...

51it [00:07,  6.79it/s]
51it [00:18,  2.80it/s]
51it [00:18,  2.79it/s]
51it [00:18,  2.79it/s]

@mchaker
Copy link
Author

mchaker commented Sep 10, 2022

I think I found a bug! 😄

Hardware environment:

Loading model..
Creating and moving model to cuda:3 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:2 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:5 (Tesla P40)..
Creating and moving model to cuda:0 (NVIDIA GeForce RTX 3070 Ti)..
Creating and moving model to cuda:1 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:4 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:6 (Tesla P40)..

When trying to generate 14 images with the following parameters:

prompt: "Multiple nvidia Tesla GPUs"
number of images: 14
steps: 50
height: 512
width: 512
guidance scale: 7.5
seed: 0/default
NSFW filter unchecked
noise scheduler: PNDM

the first GPU fails because it only has 8GB VRAM, which is fine, whatever.
However, the main bug is that when the first GPU fails, it blocks the rest of the render request from completing -- the other GPUs finish their work, but the first failed GPU process just sits there in an error state... (see the 0it [00:00, ?it/s] at the beginning) and no images appear in the gradio UI (since the job does not complete)

0it [00:00, ?it/s]
Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/utils.py", line 80, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 151, in forward
    hidden_states=sample, temb=emb, encoder_hidden_states=encoder_hidden_states
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_blocks.py", line 505, in forward
    hidden_states = attn(hidden_states, context=encoder_hidden_states)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/attention.py", line 168, in forward
    x = block(x, context=context)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/attention.py", line 196, in forward
    x = self.attn1(self.norm1(x)) + x
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/attention.py", line 254, in forward
    attn = sim.softmax(dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 7.80 GiB total capacity; 3.10 GiB already allocated; 1.25 GiB free; 5.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
51it [00:32,  1.58it/s]
51it [00:32,  1.58it/s]
51it [00:32,  1.58it/s]
51it [00:32,  1.56it/s]
51it [00:35,  1.45it/s]
51it [00:35,  1.44it/s]

Is there a way to solve that? Perhaps scaling what is scheduled to fit on a per-card basis? (if VRAM amounts differ by card -- which others do, e.g. P100 is 16GB VRAM and P40 is 24GB VRAM)

@NickLucche
Copy link
Owner

Thanks a lot for testing out that out so promptly! Nice setup btw 😮

@NickLucche does it matter which noise scheduler is used?

No you can choose any of the available ones it shouldn't effect speed sensibly.

However, the main bug is that when the first GPU fails, it blocks the rest of the render request from completing

Yeah unfortunately that is how it is supposed to work atm, the small GPU can be a bottleneck for the whole system if included among the devices, it's not trivial but I could:

  • move components away from the small GPU "on-the-fly", but that would make generating images painfully slow
  • introduce a bias/preference toward big GPUs when searching for the assignment: this is less trivial to implement but would be by far the best choice; anyway, I can't guarantee that it would work for any amount of images

Anyway, does discarding the small device (by setting -e DEVICES=1,2...) solve your issue in generating 14 images?

@mchaker
Copy link
Author

mchaker commented Sep 10, 2022

Same parameters as the last test, this time -e DEVICES=1,2,3,4,5,6 when initially starting the container with docker run ...: success!

51it [00:33,  1.54it/s]
51it [00:33,  1.54it/s]
51it [00:33,  1.54it/s]
51it [00:35,  1.43it/s]
51it [00:47,  1.07it/s]
51it [00:50,  1.02it/s]

UNDER A MINUTE WITH PNDM! That's 3.57 sec/image!

Trying it with DDIM:
Doggettx optimization: approximately 56 (P100) to 65 (P40) sec/image
NickLucche parallelism: approximately 5.29 sec/image 🚀

My favorite image generated from this test, lol: TEDLA
image

@mchaker
Copy link
Author

mchaker commented Sep 10, 2022

Yeah unfortunately that is how it is supposed to work atm, the small GPU can be a bottleneck for the whole system if included among the devices, it's not trivial but I could:

  • move components away from the small GPU "on-the-fly", but that would make generating images painfully slow
  • introduce a bias/preference toward big GPUs when searching for the assignment: this is less trivial to implement but would be by far the best choice; anyway, I can't guarantee that it would work for any amount of images

How would the second choice handle smaller GPUs in the pool? Does the work need to be split evenly between the GPUs (problematic if the GPUs are not evenly sized)?

@mchaker
Copy link
Author

mchaker commented Sep 10, 2022

Interesting... PNDM generated GPUs

but DDIM with GPU-parallelism generated.... things like these 🤔 :

image
image

@NickLucche
Copy link
Owner

UNDER A MINUTE WITH PNDM! That's 3.57 sec/image!

Trying it with DDIM: Doggettx optimization: approximately 56 (P100) to 65 (P40) sec/image NickLucche parallelism: approximately 5.29 sec/image rocket

Sorry for the late reply, thanks a lot for testing out the build and reporting the inference time too, that is super useful!

How would the second choice handle smaller GPUs in the pool? Does the work need to be split evenly between the GPUs (problematic if the GPUs are not evenly sized)?

I was thinking about filling the biggest GPUs first, and placing only the lightest component on the small one.
Atm tho, I am thinking about adding these features in the upcoming days:

  • fp32 support
  • simpler mode for users that have multiple small GPUs and want to run the model, as originally planned
  • nsfw filter (if it doesn't take up too much time)

Then I'll be merging the results into the master branch and update the "stable" version.
We can have other issues to handle other bugs/enhancements.

but DDIM with GPU-parallelism generated.... things like these thinking

Yeah that looks weird, are you getting the same gibberish results with the single-model version (e.g -e DEVICES=1) when switching sampler?

@mchaker
Copy link
Author

mchaker commented Sep 11, 2022

Sorry for the late reply, thanks a lot for testing out the build and reporting the inference time too, that is super useful!

No worries, glad to help 😄

I was thinking about filling the biggest GPUs first, and placing only the lightest component on the small one. Atm tho, I am thinking about adding these features in the upcoming days:

  • fp32 support
  • simpler mode for users that have multiple small GPUs and want to run the model, as originally planned
  • nsfw filter (if it doesn't take up too much time)

Then I'll be merging the results into the master branch and update the "stable" version. We can have other issues to handle other bugs/enhancements.

Yes please! 😃

Yeah that looks weird, are you getting the same gibberish results with the single-model version (e.g -e DEVICES=1) when switching sampler?

I'll try that and report back! Thank you again, so much, for your work on this.

@NickLucche
Copy link
Owner

Okay, I've added the fp32 support and polished up the code a bit. I'll need to test out that everything that was working before this change is still okay, then I'll be merging this into the master branch.

@mchaker
Copy link
Author

mchaker commented Sep 14, 2022

Awesome! Is there anything specific I need to do to use fp32 mode?

Thank you so much

@NickLucche
Copy link
Owner

The good old -e FP16=0 option should do! Closing this issue now before merge

@mchaker
Copy link
Author

mchaker commented Sep 17, 2022

@NickLucche what were the next steps after this?
What issues did you want me to make?
What things do you still want me to test? :)

@NickLucche
Copy link
Owner

could you re-test the latest image with -e MODEL_PARALLEL=1? Just wanted to make sure it's working properly on multiple devices without hanging..
make sure you pull the latest image and don't use the one in your cache by adding --pull always to the docker run command. Thanks a lot!

@mchaker
Copy link
Author

mchaker commented Sep 18, 2022

@NickLucche I'm getting an AssertionError :(

latest: Pulling from nicklucche/stable-diffusion
Digest: sha256:199901bbb2a85da90ff91aecd1ccea899f7f8b8c0b407506740594dee4f280ab
Status: Image is up to date for nicklucche/stable-diffusion:latest
Loading model..
Looking for a valid assignment in which to split model parts to device(s): [2, 3, 4, 5]
Free GPU memory (per device):  [3504, 6365, 6359, 6532]
Search has found that 5 model(s) can be split over 4 device(s)!
Assignments: [{0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}]
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Downloading: 100%|████████████████████████████████████████████████████| 1.34k/1.34k [00:00<00:00, 492kB/s]
Downloading: 100%|████████████████████████████████████████████████████| 14.9k/14.9k [00:00<00:00, 206kB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 342/342 [00:00<00:00, 351kB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 543/543 [00:00<00:00, 206kB/s]
Downloading: 100%|███████████████████████████████████████████████████| 4.56k/4.56k [00:00<00:00, 3.82MB/s]
Downloading: 100%|███████████████████████████████████████████████████| 1.22G/1.22G [07:13<00:00, 2.81MB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 426kB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 592/592 [00:00<00:00, 328kB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 492M/492M [00:06<00:00, 73.8MB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 1.32MB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 475kB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 806/806 [00:00<00:00, 792kB/s]
Downloading: 100%|███████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 2.18MB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 743/743 [00:00<00:00, 687kB/s]
Downloading: 100%|███████████████████████████████████████████████████| 3.44G/3.44G [02:11<00:00, 26.2MB/s]
Downloading: 100%|████████████████████████████████████████████████████| 71.2k/71.2k [00:00<00:00, 443kB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 522/522 [00:00<00:00, 209kB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 335M/335M [00:04<00:00, 71.9MB/s]
Traceback (most recent call last):
  File "server.py", line 9, in <module>
    from main import inference, MP as model_parallel
  File "/app/main.py", line 55, in <module>
    n_procs, devices, model_parallel_assignment=model_ass, **kwargs
  File "/app/parallel.py", line 149, in from_pretrained
    assert d
AssertionError

@mchaker
Copy link
Author

mchaker commented Sep 18, 2022

Also, is there a way to mount a cache folder for pip to download its packages to? Downloading 4GB+ every time I docker run is... slow 😅

@mchaker
Copy link
Author

mchaker commented Sep 18, 2022

I also noticed that GPUs 0 and 1 are used (some conda python process) even though I specified GPUs 2, 3, 4, 5?

(note below: the /usr/bin/python3 processes on GPUs 0, 1, 6, 7 are expected from another application... and the 4.5-5GB python3 processes on GPUs 2, 3, 4, 5 are expected from another application.)

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11169      C   /usr/bin/python3                 6543MiB |
|    0   N/A  N/A     11170      C   /usr/bin/python3                 5245MiB |
|    0   N/A  N/A    458266      C   /opt/conda/bin/python3           2981MiB |
|    0   N/A  N/A    458278      C   /opt/conda/bin/python3           2981MiB |
|    1   N/A  N/A     11175      C   /usr/bin/python3                  897MiB |
|    1   N/A  N/A     11178      C   /usr/bin/python3                  897MiB |
|    1   N/A  N/A     11179      C   /usr/bin/python3                  897MiB |
|    1   N/A  N/A    458266      C   /opt/conda/bin/python3           2311MiB |
|    1   N/A  N/A    458278      C   /opt/conda/bin/python3           2311MiB |
|    2   N/A  N/A    403173      C   python3                          5105MiB |
|    2   N/A  N/A    458115      C   python3                           565MiB |
|    3   N/A  N/A    164703      C   python3                          5115MiB |
|    3   N/A  N/A    458115      C   python3                           565MiB |
|    4   N/A  N/A    164949      C   python3                          4827MiB |
|    4   N/A  N/A    458115      C   python3                           565MiB |
|    5   N/A  N/A    165170      C   python3                          4827MiB |
|    5   N/A  N/A    458115      C   python3                           565MiB |
|    6   N/A  N/A     11171      C   /usr/bin/python3                 1607MiB |
|    7   N/A  N/A     11172      C   /usr/bin/python3                 1615MiB |
+-----------------------------------------------------------------------------+

@mchaker
Copy link
Author

mchaker commented Sep 19, 2022

more error: tried on a pair of smaller cards:

To create a public link, set `share=True` in `launch()`.
0it [00:01, ?it/s]
Process Process-2:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
    sample = self.conv_in(sample)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
0it [00:01, ?it/s]
Process Process-3:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
    sample = self.conv_in(sample)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
0it [00:01, ?it/s]
Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
    sample = self.conv_in(sample)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
0it [00:01, ?it/s]
Process Process-4:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
    sample = self.conv_in(sample)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

@NickLucche
Copy link
Owner

Thanks a lot for testing!
I'll re-open the issue until we fix this feature, I'll get a multi-gpu aws instance so I can test that too.

Also, is there a way to mount a cache folder for pip to download its packages to? Downloading 4GB+ every time I docker run is... slow

Good point I'll add a section in the Readme, it relates to #10

I also noticed that GPUs 0 and 1 are used (some conda python process) even though I specified GPUs 2, 3, 4, 5?

I'll look into that too, perhaps some leftover hanging processes..?

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

This looks like drivers error, we can have another issue with that with info on the specs of the cards

@NickLucche NickLucche reopened this Sep 19, 2022
@mchaker
Copy link
Author

mchaker commented Sep 19, 2022

Thanks for the link! I'll add a volume for /root/.cache/.

Do you want me to open issues for "leftover hanging processes" and "Unable to find valid cuDNN algorithm"?

(perhaps a missing python dependency?)

@mchaker
Copy link
Author

mchaker commented Sep 19, 2022

Also, tried running again - another error:
CUBLAS this time

I think missing dependencies?

Running on local URL:  http://localhost:7860/

To create a public link, set `share=True` in `launch()`.
51it [00:05,  8.74it/s]
Attempting to cast a BatchFeature to type None. This is not supported.
Process Process-4:
Process Process-3:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 82, in __call__
    text_embeddings = self.text_encoder(text_input.input_ids.to(self.device))[0]
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 82, in __call__
    text_embeddings = self.text_encoder(text_input.input_ids.to(self.device))[0]
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 734, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 655, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 734, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 582, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 655, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 325, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 582, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 210, in forward
    query_states = self.q_proj(hidden_states) * self.scale
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 325, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 210, in forward
    query_states = self.q_proj(hidden_states) * self.scale
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

@NickLucche
Copy link
Owner

I think missing dependencies?

yeah you're definitely missing some drivers for the card you're trying to use, I suggest you first try to install cudnn and run some example code on the new gpus; this "hello world" container from nvidia may help with that

docker run --rm --gpus <GPU_NUMBER_HERE> nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

@mchaker
Copy link
Author

mchaker commented Sep 20, 2022

Looks fine to me:

The cards I'm trying to test are a 3070 and 3070 Ti

$ docker run --rm --gpus '"device=6,7"' nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.0.3-base-ubuntu20.04' locally
11.0.3-base-ubuntu20.04: Pulling from nvidia/cuda
d7bfe07ed847: Already exists 
75eccf561042: Pull complete 
191419884744: Pull complete 
a17a942db7e1: Pull complete 
16156c70987f: Pull complete 
Digest: sha256:57455121f3393b7ed9e5a0bc2b046f57ee7187ea9ec562a7d17bf8c97174040d
Status: Downloaded newer image for nvidia/cuda:11.0.3-base-ubuntu20.04
Tue Sep 20 13:19:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:8B:00.0 Off |                  N/A |
|  0%   40C    P0    46W / 240W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:C1:00.0 Off |                  N/A |
| 45%   33C    P0    70W / 310W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@huotarih
Copy link

huotarih commented Oct 1, 2022

Hey, just found this thread! Great looking stuff!
I put up a g4dn.12xlarge instance with 4 T4's, tried a command but ended up with AssertionError :/

[ec2-user@ip ~]$ docker run --name stable-diffusion --gpus all -it -e DEVICES=0,1,2,3 -e MODEL_PARALLEL=1 -e TOKEN=token -p 7860:7860 nicklucche/stable-diffusion:multi-gpu Loading model.. Looking for a valid assignment in which to split model parts to device(s): [0, 1, 2, 3] Free GPU memory (per device): [8665, 8665, 8665, 8665] Search has found that 17 model(s) can be split over 4 device(s)! Assignments: [{0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}] Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.34k/1.34k [00:00<00:00, 739kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.5k/12.5k [00:00<00:00, 12.9MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 342/342 [00:00<00:00, 182kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 543/543 [00:00<00:00, 307kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.63k/4.63k [00:00<00:00, 2.48MB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 608M/608M [00:07<00:00, 77.8MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 117kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 122kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 572/572 [00:00<00:00, 317kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 246M/246M [00:03<00:00, 72.5MB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 58.8MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 563kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 788/788 [00:00<00:00, 1.07MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 62.3MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 772/772 [00:00<00:00, 1.07MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.72G/1.72G [00:22<00:00, 75.2MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71.2k/71.2k [00:00<00:00, 37.7MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 550/550 [00:00<00:00, 300kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 167M/167M [00:02<00:00, 74.1MB/s] Traceback (most recent call last): File "server.py", line 9, in <module> from main import inference, MP as model_parallel File "/app/main.py", line 55, in <module> n_procs, devices, model_parallel_assignment=model_ass, **kwargs File "/app/parallel.py", line 149, in from_pretrained assert d AssertionError

It's loading a lot of models, 17 in fact. Might that be the culprit?

Anyways, if I can participate in testing or help in any way, I'm here to do so :)
Also wondering why it says only 8665MB of free memory when nvidia-smi told me I had 15360MiB per GPU free just before that.

@mchaker
Copy link
Author

mchaker commented Oct 2, 2022

That makes 2 of us! Oh no :(

@NickLucche
Copy link
Owner

Hi @huotarih , thanks a lot for reporting this bug!
I do have some issues developing on a multi-gpu system as I also need to get something on the cloud, but I'll into that asap!
Would you mind opening a separate issue for this bug?

I'll also ask you to test back the fixed version if that's ok with you :)

Also wondering why it says only 8665MB of free memory when nvidia-smi told me I had 15360MiB per GPU free just before that.

Good point, currently I'm only taking 60% of the free memory of the GPU to instantiate the model(s), that is because generating one or more images requires a substantial amount of free memory, which is only occupied when you actually send the input to the network. 60% is a conservative threshold, as the memory varies with the requested image output, I am still unsure how to properly explain that to the user.

@rbychn
Copy link

rbychn commented May 24, 2023

is there a way to get this working on Automatic1111? Single image generation job on multiple GPUs at once?

@NickLucche
Copy link
Owner

yes but it needs to be a separate contribution to Automatic1111 repo.

@MiTereKun
Copy link

Recently, NVLink also appeared in our cloud; it doesn’t work out of the box
image
We are waiting for implementation

@NickLucche
Copy link
Owner

I am closing this issue has this has been stale for a while. Inference on multiple gpus is implemented here (readme); I will come back to supporting heterogeneous gpus setups when I have more time and resources (e.g. a test environment with different gpus for one). All PRs are welcome :)

@rbychn
Copy link

rbychn commented May 4, 2024

what's the state of this in 2024? any plans on getting this to work with comfyUI?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants