Pooling VRAM #8

mchaker · 2022-09-03T15:56:17Z

Great thanks, at this point I think we can close this issue and make another one for the pooled VRAM!

Originally posted by @NickLucche in #5 (comment)

I would like to be able to pool resources (VRAM) from the multiple cards I have installed into one pool. For example,

I have 4x NVIDIA P100 cards installed. I want to combine them all (16GB VRAM each) into 64GB VRAM so that complicated or high-resolution images don't overload the process with a 16GB VRAM limit.

This also would be useful for people with multiple 4GB VRAM consumer/hobbyist cards to reach workable amounts of VRAM without buying enterprise GPUs.

NickLucche · 2022-09-05T08:02:47Z

Looking into that!

NickLucche · 2022-09-06T10:02:33Z

My idea was the following:
the model gets split into 4 main components $\text{unet}_e$, $\text{unet}_d$, text_embedding and vae.
These 4 components must be distributed over $N$ GPUs, and possibly replicated more than once so that you can run multiple models split over multiple devices (this is like combining Data and model parallel).

I figured I need to do something like this
$x_1 \text{unet}_e + y_1 \text{unet}_d + z_1 \text{text-embedding} + k_1 \text{vae} \leq G_1$.
...
$x_N \text{unet}_e + y_N \text{unet}_d + z_N \text{text-embedding} + k_N \text{vae} \leq G_N$.

with $\sum x_i=\sum y_i=\sum z_i=\sum k_i$
$x, y, z, k \geq 0$
$x, y, z, k \in Z^n$

where $G_i$ is the memory capacity of some GPU $i$, while $\text{unet}_e$ represents the memory required by the component.
This looks like a ILP problem to me. Unfortunately, I don't know how to solve it.
The alternative is to use some other greedy approach (start placing components in GPUs where there's enough memory and see where it goes) or brute force, generating all possible combinations.
This thing may be overkill for the purpose of this project, so I'll think about it some more and come up with a feasible idea.

mchaker · 2022-09-06T10:12:56Z

Thank you so much for this insight! It sounds like an OS scheduling problem, hmm...
Would it be possible to solve the ILP problems/optimization problem using GPUs?
They seem ideal for the task.
(I don't mean to be facetious -- I mean that given we are already working with GPUs, perhaps we can use them as part of a startup calculation, then unload the calculation once results are found, and load in the stablediffusion models etc?)

I found a few resources (but I don't understand them fully):

NickLucche · 2022-09-06T12:09:49Z

Thanks for your help, I see your point that would definitely come in handy, but atm I'm not too "scared" by the scale of the problem to turn to GPU computing, I think getting to $N=128$ GPUs would be the biggest use-case we can have, with $N*4$ variables.
I'm mostly concerned in figuring out whether my formulation is correct or if there's something much simpler that can be used do to this, perhaps with reference to some similar work..?

mchaker · 2022-09-06T13:52:31Z

If you're using the huggingface diffusers library, would using huggingface accelerate work?

Or is that only for training models, and not executing them?

NickLucche · 2022-09-07T07:47:19Z

Yep during training you have to keep weights update synchronized so it makes sense to use a framework.

I'll go with the brute force solution for now, I'll keep you posted.

NickLucche · 2022-09-07T10:37:42Z

Okay I can generate the possible combinations of components-to-GPUs assignment, works well (in terms of speed) if we cut down the number of assignments at each step from a theoretical max of N^4 to smt like a random sample of 2 of them (ikr 😕). This is a greedy approach so we give up on optimality, but I believe it's a fair trade-off.
Furthermore, the max number of models that can be split can be limited not only by the amount of combined available VRAM, but also by the number of processes that must handle them (e.g. I took n_cpus*2) .

This is probably an overkill of analysis since I doubt it will be used to generate images on a cluster of 128 A100, but perhaps it can turn out to be useful for some other projects by simply scaling-up the random search I've done here.

mchaker · 2022-09-07T11:22:11Z

Okay I can generate the possible combinations of components-to-GPUs assignment, works well (in terms of speed) if we cut down the number of assignments at each step from a theoretical max of N^4 to smt like a random sample of 2 of them (ikr 😕). This is a greedy approach so we give up on optimality, but I believe it's a fair trade-off. Furthermore, the max number of models that can be split can be limited not only by the amount of combined available VRAM, but also by the number of processes that must handle them (e.g. I took n_cpus*2) .

This is probably an overkill of analysis since I doubt it will be used to generate images on a cluster of 128 A100, but perhaps it can turn out to be useful for some other projects by simply scaling-up the random search I've done here.

👏 👏 🥳 WOOO!!! I can't express in plaintext how exciting this is, even if not optimal!

This is a great first step that takes skill to pull off. The broader community may be able to help optimize from here.

128 A100? Not yet, but perhaps if someone makes a job distributor or some kind of kubernetes/distributed scheduler integration for stablediffusion... (looking at myself, maybe)

mchaker · 2022-09-07T12:51:29Z

@NickLucche Would this help at all?

https://cundy.me/post/blog_post_running_gpt_j_on_several_smaller_gpus/

aeon3 · 2022-09-09T11:08:19Z

What about setups with nvlink? Does it make it easier to pool memory or same thing?

NickLucche · 2022-09-09T12:34:55Z

Nvlink looks like a cool idea but I'm not sure whether it supports finding the best assignments for multiple models parts. I should look into that.
Anyway, I only need to add a few minor things here https://github.com/NickLucche/stable-diffusion-nvidia-docker/tree/model-parallel before testing this approach. Should have updates in the weekend.

aeon3 · 2022-09-09T13:50:35Z

Nvlink looks like a cool idea but I'm not sure whether it supports finding the best assignments for multiple models parts. I should look into that. Anyway, I only need to add a few minor things here https://github.com/NickLucche/stable-diffusion-nvidia-docker/tree/model-parallel before testing this approach. Should have updates in the weekend.

Wait, so does this fork of yours make any dual GPU setup behave like Nvlink?
And is there any benefit in running this fork with nvlink compared to any other SD forks that do not have your special multi gpu code?

NickLucche · 2022-09-09T14:04:51Z

No not really, this is high level code (pytorch level, not nvidia firmware) that's specific to this stable diffusion model.
It tries to find an optimal way to distribute the (predefined, fixed) model components across multiple GPUs and takes care of moving tensors from one GPU to the next one.

It should support splitting multiple models. I know it may sound confusing, but it's really just Data+Model Parallel.

aeon3 · 2022-09-09T14:10:20Z

I'm just an artist, definitely is confusing to me lol. I found this guy talking about multi GPU: https://youtu.be/hBKcL8fNZ18?list=PLzSRtos7-PQRCskmdrgtMYIt_bKEbMPfD&t=436

No clue if it's helpful at all

NickLucche · 2022-09-09T14:17:32Z

No worries thanks for your help, I'll try to make it so that you don't have to worry about how it runs under the hood, hopefully it'll simply work!

mchaker · 2022-09-09T14:18:26Z

I'm willing to help test things on my hardware pool if you want some help :)

NickLucche · 2022-09-10T08:55:52Z

I'm willing to help test things on my hardware pool if you want some help :)
I was counting on it, really appreciate your help! 🙏🏻

I have a somewhat stable build that can be tested with:
docker run --name stable-diffusion --gpus all -it -e DEVICES=all -e MODEL_PARALLEL=1 -e TOKEN=<YOUR_TOKEN> -p 7860:7860 nicklucche/stable-diffusion:multi-gpu

I am expecting some bugs here and there, so please report the logs/error that appear in the console!

Current build has some limitations when MODEL_PARALLEL=1 is set (everything else should work as usual when MODEL_PARALLEL is not set):

fp16 only, fp32 not supported (yet, I'll add it asap, is no big deal)
nsfw filter is turned off (by default) and can't bet turned on
single-gpu multiple models is not (yet) supported (so you need at least 2 GPUs to try this version)
Maximum GPU memory that the model(s) will take is set to 60% of the free one, the rest should be used during inference; thing is that as the size of the image increases, the process takes up more memory, so it might crash for greater resolutions
current version tries to pack as many models as it can on specified devices; I am thinking about a simpler mode in which we only spread a single model over multiple gpus, that can be turned on by the user

mchaker · 2022-09-10T08:59:25Z

Loading model..
Creating and moving model to cuda:3 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:2 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:5 (Tesla P40)..
Creating and moving model to cuda:0 (NVIDIA GeForce RTX 3070 Ti)..
Creating and moving model to cuda:1 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:4 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:6 (Tesla P40)..

I'm excited already! 😄 waiting for the downloads to finish....

mchaker · 2022-09-10T09:01:33Z

@NickLucche does it matter which noise scheduler is used?

mchaker · 2022-09-10T09:03:08Z

This is so exciting!

I generated FOUR 512x512 images in the time it used to take me to generate ONE 512x512 image (on a P100)

Now to try 14 images...

51it [00:07,  6.79it/s]
51it [00:18,  2.80it/s]
51it [00:18,  2.79it/s]
51it [00:18,  2.79it/s]

mchaker · 2022-09-10T09:08:59Z

I think I found a bug! 😄

Hardware environment:

Loading model..
Creating and moving model to cuda:3 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:2 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:5 (Tesla P40)..
Creating and moving model to cuda:0 (NVIDIA GeForce RTX 3070 Ti)..
Creating and moving model to cuda:1 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:4 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:6 (Tesla P40)..

When trying to generate 14 images with the following parameters:

prompt: "Multiple nvidia Tesla GPUs"
number of images: 14
steps: 50
height: 512
width: 512
guidance scale: 7.5
seed: 0/default
NSFW filter unchecked
noise scheduler: PNDM

the first GPU fails because it only has 8GB VRAM, which is fine, whatever.
However, the main bug is that when the first GPU fails, it blocks the rest of the render request from completing -- the other GPUs finish their work, but the first failed GPU process just sits there in an error state... (see the 0it [00:00, ?it/s] at the beginning) and no images appear in the gradio UI (since the job does not complete)

0it [00:00, ?it/s]
Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/utils.py", line 80, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 151, in forward
    hidden_states=sample, temb=emb, encoder_hidden_states=encoder_hidden_states
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_blocks.py", line 505, in forward
    hidden_states = attn(hidden_states, context=encoder_hidden_states)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/attention.py", line 168, in forward
    x = block(x, context=context)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/attention.py", line 196, in forward
    x = self.attn1(self.norm1(x)) + x
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/attention.py", line 254, in forward
    attn = sim.softmax(dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 7.80 GiB total capacity; 3.10 GiB already allocated; 1.25 GiB free; 5.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
51it [00:32,  1.58it/s]
51it [00:32,  1.58it/s]
51it [00:32,  1.58it/s]
51it [00:32,  1.56it/s]
51it [00:35,  1.45it/s]
51it [00:35,  1.44it/s]

Is there a way to solve that? Perhaps scaling what is scheduled to fit on a per-card basis? (if VRAM amounts differ by card -- which others do, e.g. P100 is 16GB VRAM and P40 is 24GB VRAM)

NickLucche · 2022-09-10T10:01:49Z

Thanks a lot for testing out that out so promptly! Nice setup btw 😮

@NickLucche does it matter which noise scheduler is used?

No you can choose any of the available ones it shouldn't effect speed sensibly.

However, the main bug is that when the first GPU fails, it blocks the rest of the render request from completing

Yeah unfortunately that is how it is supposed to work atm, the small GPU can be a bottleneck for the whole system if included among the devices, it's not trivial but I could:

move components away from the small GPU "on-the-fly", but that would make generating images painfully slow
introduce a bias/preference toward big GPUs when searching for the assignment: this is less trivial to implement but would be by far the best choice; anyway, I can't guarantee that it would work for any amount of images

Anyway, does discarding the small device (by setting -e DEVICES=1,2...) solve your issue in generating 14 images?

mchaker · 2022-09-10T10:48:19Z

Same parameters as the last test, this time -e DEVICES=1,2,3,4,5,6 when initially starting the container with docker run ...: success!

51it [00:33,  1.54it/s]
51it [00:33,  1.54it/s]
51it [00:33,  1.54it/s]
51it [00:35,  1.43it/s]
51it [00:47,  1.07it/s]
51it [00:50,  1.02it/s]

UNDER A MINUTE WITH PNDM! That's 3.57 sec/image!

Trying it with DDIM:
Doggettx optimization: approximately 56 (P100) to 65 (P40) sec/image
NickLucche parallelism: approximately 5.29 sec/image 🚀

My favorite image generated from this test, lol: TEDLA

mchaker · 2022-09-10T10:51:12Z

Yeah unfortunately that is how it is supposed to work atm, the small GPU can be a bottleneck for the whole system if included among the devices, it's not trivial but I could:

move components away from the small GPU "on-the-fly", but that would make generating images painfully slow

introduce a bias/preference toward big GPUs when searching for the assignment: this is less trivial to implement but would be by far the best choice; anyway, I can't guarantee that it would work for any amount of images

How would the second choice handle smaller GPUs in the pool? Does the work need to be split evenly between the GPUs (problematic if the GPUs are not evenly sized)?

mchaker · 2022-09-10T10:58:22Z

Interesting... PNDM generated GPUs

but DDIM with GPU-parallelism generated.... things like these 🤔 :

NickLucche · 2022-09-11T22:42:13Z

UNDER A MINUTE WITH PNDM! That's 3.57 sec/image!

Trying it with DDIM: Doggettx optimization: approximately 56 (P100) to 65 (P40) sec/image NickLucche parallelism: approximately 5.29 sec/image rocket

Sorry for the late reply, thanks a lot for testing out the build and reporting the inference time too, that is super useful!

How would the second choice handle smaller GPUs in the pool? Does the work need to be split evenly between the GPUs (problematic if the GPUs are not evenly sized)?

I was thinking about filling the biggest GPUs first, and placing only the lightest component on the small one.
Atm tho, I am thinking about adding these features in the upcoming days:

fp32 support
simpler mode for users that have multiple small GPUs and want to run the model, as originally planned
nsfw filter (if it doesn't take up too much time)

Then I'll be merging the results into the master branch and update the "stable" version.
We can have other issues to handle other bugs/enhancements.

but DDIM with GPU-parallelism generated.... things like these thinking

Yeah that looks weird, are you getting the same gibberish results with the single-model version (e.g -e DEVICES=1) when switching sampler?

mchaker · 2022-09-11T22:59:33Z

Sorry for the late reply, thanks a lot for testing out the build and reporting the inference time too, that is super useful!

No worries, glad to help 😄

I was thinking about filling the biggest GPUs first, and placing only the lightest component on the small one. Atm tho, I am thinking about adding these features in the upcoming days:

fp32 support

simpler mode for users that have multiple small GPUs and want to run the model, as originally planned

nsfw filter (if it doesn't take up too much time)

Then I'll be merging the results into the master branch and update the "stable" version. We can have other issues to handle other bugs/enhancements.

Yes please! 😃

Yeah that looks weird, are you getting the same gibberish results with the single-model version (e.g -e DEVICES=1) when switching sampler?

I'll try that and report back! Thank you again, so much, for your work on this.

NickLucche · 2022-09-14T08:22:12Z

Okay, I've added the fp32 support and polished up the code a bit. I'll need to test out that everything that was working before this change is still okay, then I'll be merging this into the master branch.

mchaker · 2022-09-14T09:46:50Z

Awesome! Is there anything specific I need to do to use fp32 mode?

Thank you so much

NickLucche · 2022-09-14T11:09:49Z

The good old -e FP16=0 option should do! Closing this issue now before merge

mchaker · 2022-09-17T14:48:33Z

@NickLucche what were the next steps after this?
What issues did you want me to make?
What things do you still want me to test? :)

NickLucche · 2022-09-18T08:33:19Z

could you re-test the latest image with -e MODEL_PARALLEL=1? Just wanted to make sure it's working properly on multiple devices without hanging..
make sure you pull the latest image and don't use the one in your cache by adding --pull always to the docker run command. Thanks a lot!

mchaker · 2022-09-18T15:51:18Z

@NickLucche I'm getting an AssertionError :(

latest: Pulling from nicklucche/stable-diffusion
Digest: sha256:199901bbb2a85da90ff91aecd1ccea899f7f8b8c0b407506740594dee4f280ab
Status: Image is up to date for nicklucche/stable-diffusion:latest
Loading model..
Looking for a valid assignment in which to split model parts to device(s): [2, 3, 4, 5]
Free GPU memory (per device):  [3504, 6365, 6359, 6532]
Search has found that 5 model(s) can be split over 4 device(s)!
Assignments: [{0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}]
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Downloading: 100%|████████████████████████████████████████████████████| 1.34k/1.34k [00:00<00:00, 492kB/s]
Downloading: 100%|████████████████████████████████████████████████████| 14.9k/14.9k [00:00<00:00, 206kB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 342/342 [00:00<00:00, 351kB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 543/543 [00:00<00:00, 206kB/s]
Downloading: 100%|███████████████████████████████████████████████████| 4.56k/4.56k [00:00<00:00, 3.82MB/s]
Downloading: 100%|███████████████████████████████████████████████████| 1.22G/1.22G [07:13<00:00, 2.81MB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 426kB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 592/592 [00:00<00:00, 328kB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 492M/492M [00:06<00:00, 73.8MB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 1.32MB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 475kB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 806/806 [00:00<00:00, 792kB/s]
Downloading: 100%|███████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 2.18MB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 743/743 [00:00<00:00, 687kB/s]
Downloading: 100%|███████████████████████████████████████████████████| 3.44G/3.44G [02:11<00:00, 26.2MB/s]
Downloading: 100%|████████████████████████████████████████████████████| 71.2k/71.2k [00:00<00:00, 443kB/s]
Downloading: 100%|████████████████████████████████████████████████████████| 522/522 [00:00<00:00, 209kB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 335M/335M [00:04<00:00, 71.9MB/s]
Traceback (most recent call last):
  File "server.py", line 9, in <module>
    from main import inference, MP as model_parallel
  File "/app/main.py", line 55, in <module>
    n_procs, devices, model_parallel_assignment=model_ass, **kwargs
  File "/app/parallel.py", line 149, in from_pretrained
    assert d
AssertionError

mchaker · 2022-09-18T15:55:03Z

Also, is there a way to mount a cache folder for pip to download its packages to? Downloading 4GB+ every time I docker run is... slow 😅

mchaker · 2022-09-18T16:07:22Z

I also noticed that GPUs 0 and 1 are used (some conda python process) even though I specified GPUs 2, 3, 4, 5?

(note below: the /usr/bin/python3 processes on GPUs 0, 1, 6, 7 are expected from another application... and the 4.5-5GB python3 processes on GPUs 2, 3, 4, 5 are expected from another application.)

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11169      C   /usr/bin/python3                 6543MiB |
|    0   N/A  N/A     11170      C   /usr/bin/python3                 5245MiB |
|    0   N/A  N/A    458266      C   /opt/conda/bin/python3           2981MiB |
|    0   N/A  N/A    458278      C   /opt/conda/bin/python3           2981MiB |
|    1   N/A  N/A     11175      C   /usr/bin/python3                  897MiB |
|    1   N/A  N/A     11178      C   /usr/bin/python3                  897MiB |
|    1   N/A  N/A     11179      C   /usr/bin/python3                  897MiB |
|    1   N/A  N/A    458266      C   /opt/conda/bin/python3           2311MiB |
|    1   N/A  N/A    458278      C   /opt/conda/bin/python3           2311MiB |
|    2   N/A  N/A    403173      C   python3                          5105MiB |
|    2   N/A  N/A    458115      C   python3                           565MiB |
|    3   N/A  N/A    164703      C   python3                          5115MiB |
|    3   N/A  N/A    458115      C   python3                           565MiB |
|    4   N/A  N/A    164949      C   python3                          4827MiB |
|    4   N/A  N/A    458115      C   python3                           565MiB |
|    5   N/A  N/A    165170      C   python3                          4827MiB |
|    5   N/A  N/A    458115      C   python3                           565MiB |
|    6   N/A  N/A     11171      C   /usr/bin/python3                 1607MiB |
|    7   N/A  N/A     11172      C   /usr/bin/python3                 1615MiB |
+-----------------------------------------------------------------------------+

mchaker · 2022-09-19T04:18:02Z

more error: tried on a pair of smaller cards:

To create a public link, set `share=True` in `launch()`.
0it [00:01, ?it/s]
Process Process-2:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
    sample = self.conv_in(sample)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
0it [00:01, ?it/s]
Process Process-3:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
    sample = self.conv_in(sample)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
0it [00:01, ?it/s]
Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
    sample = self.conv_in(sample)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
0it [00:01, ?it/s]
Process Process-4:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
    sample = self.conv_in(sample)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

NickLucche · 2022-09-19T07:05:01Z

Thanks a lot for testing!
I'll re-open the issue until we fix this feature, I'll get a multi-gpu aws instance so I can test that too.

Also, is there a way to mount a cache folder for pip to download its packages to? Downloading 4GB+ every time I docker run is... slow

Good point I'll add a section in the Readme, it relates to #10

I also noticed that GPUs 0 and 1 are used (some conda python process) even though I specified GPUs 2, 3, 4, 5?

I'll look into that too, perhaps some leftover hanging processes..?

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

This looks like drivers error, we can have another issue with that with info on the specs of the cards

mchaker · 2022-09-19T13:57:00Z

Thanks for the link! I'll add a volume for /root/.cache/.

Do you want me to open issues for "leftover hanging processes" and "Unable to find valid cuDNN algorithm"?

(perhaps a missing python dependency?)

mchaker · 2022-09-19T14:03:52Z

Also, tried running again - another error:
CUBLAS this time

I think missing dependencies?

Running on local URL:  http://localhost:7860/

To create a public link, set `share=True` in `launch()`.
51it [00:05,  8.74it/s]
Attempting to cast a BatchFeature to type None. This is not supported.
Process Process-4:
Process Process-3:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 82, in __call__
    text_embeddings = self.text_encoder(text_input.input_ids.to(self.device))[0]
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/parallel.py", line 90, in cuda_inference_process
    images: List[Image.Image] = model(prompts, **kwargs)["sample"]
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 82, in __call__
    text_embeddings = self.text_encoder(text_input.input_ids.to(self.device))[0]
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 734, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/utils.py", line 103, in forward
    y = self.layer(x.to(self.device), *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 655, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 734, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 582, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 655, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 325, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 582, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 210, in forward
    query_states = self.q_proj(hidden_states) * self.scale
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 325, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 210, in forward
    query_states = self.q_proj(hidden_states) * self.scale
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

NickLucche · 2022-09-20T08:04:47Z

I think missing dependencies?

yeah you're definitely missing some drivers for the card you're trying to use, I suggest you first try to install cudnn and run some example code on the new gpus; this "hello world" container from nvidia may help with that

docker run --rm --gpus <GPU_NUMBER_HERE> nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

mchaker · 2022-09-20T13:20:04Z

Looks fine to me:

The cards I'm trying to test are a 3070 and 3070 Ti

$ docker run --rm --gpus '"device=6,7"' nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.0.3-base-ubuntu20.04' locally
11.0.3-base-ubuntu20.04: Pulling from nvidia/cuda
d7bfe07ed847: Already exists 
75eccf561042: Pull complete 
191419884744: Pull complete 
a17a942db7e1: Pull complete 
16156c70987f: Pull complete 
Digest: sha256:57455121f3393b7ed9e5a0bc2b046f57ee7187ea9ec562a7d17bf8c97174040d
Status: Downloaded newer image for nvidia/cuda:11.0.3-base-ubuntu20.04
Tue Sep 20 13:19:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:8B:00.0 Off |                  N/A |
|  0%   40C    P0    46W / 240W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:C1:00.0 Off |                  N/A |
| 45%   33C    P0    70W / 310W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

huotarih · 2022-10-01T21:06:04Z

Hey, just found this thread! Great looking stuff!
I put up a g4dn.12xlarge instance with 4 T4's, tried a command but ended up with AssertionError :/

[ec2-user@ip ~]$ docker run --name stable-diffusion --gpus all -it -e DEVICES=0,1,2,3 -e MODEL_PARALLEL=1 -e TOKEN=token -p 7860:7860 nicklucche/stable-diffusion:multi-gpu Loading model.. Looking for a valid assignment in which to split model parts to device(s): [0, 1, 2, 3] Free GPU memory (per device): [8665, 8665, 8665, 8665] Search has found that 17 model(s) can be split over 4 device(s)! Assignments: [{0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}] Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.34k/1.34k [00:00<00:00, 739kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.5k/12.5k [00:00<00:00, 12.9MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 342/342 [00:00<00:00, 182kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 543/543 [00:00<00:00, 307kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.63k/4.63k [00:00<00:00, 2.48MB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 608M/608M [00:07<00:00, 77.8MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 117kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 122kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 572/572 [00:00<00:00, 317kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 246M/246M [00:03<00:00, 72.5MB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 58.8MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 563kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 788/788 [00:00<00:00, 1.07MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 62.3MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 772/772 [00:00<00:00, 1.07MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.72G/1.72G [00:22<00:00, 75.2MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71.2k/71.2k [00:00<00:00, 37.7MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 550/550 [00:00<00:00, 300kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 167M/167M [00:02<00:00, 74.1MB/s] Traceback (most recent call last): File "server.py", line 9, in <module> from main import inference, MP as model_parallel File "/app/main.py", line 55, in <module> n_procs, devices, model_parallel_assignment=model_ass, **kwargs File "/app/parallel.py", line 149, in from_pretrained assert d AssertionError

It's loading a lot of models, 17 in fact. Might that be the culprit?

Anyways, if I can participate in testing or help in any way, I'm here to do so :)
Also wondering why it says only 8665MB of free memory when nvidia-smi told me I had 15360MiB per GPU free just before that.

mchaker · 2022-10-02T08:19:48Z

That makes 2 of us! Oh no :(

NickLucche · 2022-10-02T08:30:13Z

Hi @huotarih , thanks a lot for reporting this bug!
I do have some issues developing on a multi-gpu system as I also need to get something on the cloud, but I'll into that asap!
Would you mind opening a separate issue for this bug?

I'll also ask you to test back the fixed version if that's ok with you :)

Also wondering why it says only 8665MB of free memory when nvidia-smi told me I had 15360MiB per GPU free just before that.

Good point, currently I'm only taking 60% of the free memory of the GPU to instantiate the model(s), that is because generating one or more images requires a substantial amount of free memory, which is only occupied when you actually send the input to the network. 60% is a conservative threshold, as the memory varies with the requested image output, I am still unsure how to properly explain that to the user.

rbychn · 2023-05-24T10:32:28Z

is there a way to get this working on Automatic1111? Single image generation job on multiple GPUs at once?

NickLucche · 2023-06-01T13:59:15Z

yes but it needs to be a separate contribution to Automatic1111 repo.

MiTereKun · 2023-10-19T00:18:47Z

Recently, NVLink also appeared in our cloud; it doesn’t work out of the box

We are waiting for implementation

NickLucche · 2023-10-19T11:56:49Z

I am closing this issue has this has been stale for a while. Inference on multiple gpus is implemented here (readme); I will come back to supporting heterogeneous gpus setups when I have more time and resources (e.g. a test environment with different gpus for one). All PRs are welcome :)

rbychn · 2024-05-04T09:10:56Z

what's the state of this in 2024? any plans on getting this to work with comfyUI?

NickLucche added enhancement New feature or request help wanted Extra attention is needed labels Sep 5, 2022

NickLucche self-assigned this Sep 10, 2022

NickLucche closed this as completed Sep 14, 2022

This was referenced Sep 17, 2022

Re-structure the "Samples" section of the README #15

Closed

DDIM sampler produces weird/non-coherent images #13

Closed

NickLucche reopened this Sep 19, 2022

mchaker mentioned this issue Sep 23, 2022

How to allocate memory from 2nd GPU? AUTOMATIC1111/stable-diffusion-webui#156

Open

NickLucche mentioned this issue Oct 2, 2022

Model Parallel assertion error #17

Open

ilikenwf mentioned this issue May 7, 2023

[Feature]: Multi-GPU Image Generation / Inference vladmandic/automatic#405

Closed

NickLucche closed this as completed Oct 19, 2023

Pooling VRAM #8

Pooling VRAM #8

Comments

mchaker commented Sep 3, 2022

NickLucche commented Sep 5, 2022

NickLucche commented Sep 6, 2022 • edited

mchaker commented Sep 6, 2022 • edited

NickLucche commented Sep 6, 2022

mchaker commented Sep 6, 2022 • edited

NickLucche commented Sep 7, 2022

NickLucche commented Sep 7, 2022

mchaker commented Sep 7, 2022 • edited

mchaker commented Sep 7, 2022

aeon3 commented Sep 9, 2022

NickLucche commented Sep 9, 2022

aeon3 commented Sep 9, 2022

NickLucche commented Sep 9, 2022

aeon3 commented Sep 9, 2022

NickLucche commented Sep 9, 2022

mchaker commented Sep 9, 2022

NickLucche commented Sep 10, 2022

mchaker commented Sep 10, 2022

mchaker commented Sep 10, 2022

mchaker commented Sep 10, 2022

mchaker commented Sep 10, 2022 • edited

NickLucche commented Sep 10, 2022

mchaker commented Sep 10, 2022 • edited

mchaker commented Sep 10, 2022

mchaker commented Sep 10, 2022

NickLucche commented Sep 11, 2022

mchaker commented Sep 11, 2022

NickLucche commented Sep 14, 2022

mchaker commented Sep 14, 2022

NickLucche commented Sep 14, 2022

mchaker commented Sep 17, 2022

NickLucche commented Sep 18, 2022

mchaker commented Sep 18, 2022

mchaker commented Sep 18, 2022

mchaker commented Sep 18, 2022 • edited

mchaker commented Sep 19, 2022

NickLucche commented Sep 19, 2022

mchaker commented Sep 19, 2022

mchaker commented Sep 19, 2022

NickLucche commented Sep 20, 2022

mchaker commented Sep 20, 2022 • edited

huotarih commented Oct 1, 2022

mchaker commented Oct 2, 2022

NickLucche commented Oct 2, 2022

rbychn commented May 24, 2023

NickLucche commented Jun 1, 2023

MiTereKun commented Oct 19, 2023

NickLucche commented Oct 19, 2023

rbychn commented May 4, 2024

NickLucche commented Sep 6, 2022 •

edited

mchaker commented Sep 6, 2022 •

edited

mchaker commented Sep 6, 2022 •

edited

mchaker commented Sep 7, 2022 •

edited

mchaker commented Sep 10, 2022 •

edited

mchaker commented Sep 10, 2022 •

edited

mchaker commented Sep 18, 2022 •

edited

mchaker commented Sep 20, 2022 •

edited