Multi-GPU is broken #53

Shakahs · 2022-10-06T16:00:03Z

2x3090 instance on Runpod using the Runpod notebook on their Stable Diffusion image. I can train on GPU 0, but not 0 and 1 together or even separately. Running on GPU 0 works fine.

Here is what happens when I am training on GPU 0, and try to start a separate training on GPU 1. It seems GPU 0 is hardcoded somewhere.

!python "main.py" \
 --base configs/stable-diffusion/v1-finetune_unfrozen.yaml \
 -t \
 --actual_resume "model.ckpt" \
 --reg_data_root "{reg_data_root}" \
 -n "{project_name}" \
 --gpus 1, \
 --data_root "/workspace/Dreambooth-Stable-Diffusion/MS" \
 --max_training_steps {max_training_steps} \
 --class_word "{class_word}" \
 --token "{token}" \
 --no-test

.....

Traceback (most recent call last):
  File "main.py", line 665, in <module>
    model = load_model_from_config(config, opt.actual_resume)
  File "main.py", line 42, in load_model_from_config
    model.cuda()
  File "/venv/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 138, in cuda
    return super().cuda(device=device)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 0 bytes already allocated; 13.56 MiB free; 0 bytes reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 883, in <module>
    if trainer.global_rank == 0:
NameError: name 'trainer' is not defined

The text was updated successfully, but these errors were encountered:

jimtalksdata · 2022-10-16T04:53:38Z

Something like this would fix it, no? Pass in gpuinfo when it is called in main.py.

def load_model_from_config(config, gpuinfo, ckpt, verbose=False):
    print(f"Loading model from {ckpt}")
    pl_sd = torch.load(ckpt, map_location="cpu")
    sd = pl_sd["state_dict"]
    config.model.params.ckpt_path = ckpt
    model = instantiate_from_config(config.model)
    m, u = model.load_state_dict(sd, strict=False)
    if len(m) > 0 and verbose:
        print("missing keys:")
        print(m)
    if len(u) > 0 and verbose:
        print("unexpected keys:")
        print(u)

    device = torch.device("cuda:" + str(gpuinfo).rstrip(",")) if torch.cuda.is_available() else torch.device("cpu")
    model.to(device)
    model.eval()
    return model

model.cuda() tries to call on GPU 0.

swcrazyfan · 2022-10-25T12:15:14Z

Is multi GPU supposed to be supported?

0xdevalias · 2022-11-07T01:17:04Z

Potentially related (NameError: name 'trainer' is not defined):

I went to line 896 on Main.py, and changed "trainer" for "Trainer", now it's working

Originally posted by @Pegaxsus in #86 (comment)

0xdevalias · 2022-11-07T01:22:40Z

Something like this would fix it, no? Pass in gpuinfo when it is called in main.py.

The following seems to give a solution:

https://datascience.stackexchange.com/questions/54907/model-cuda-in-pytorch

model.cuda() by default will send your model to the "current device", which can be set with torch.cuda.set_device(device).

An alternative way to send the model to a specific device is model.to(torch.device('cuda:0')).

This, of course, is subject to the device visibility specified in the environment variable CUDA_VISIBLE_DEVICES.

You can check GPU usage with nvidia-smi. Also, nvtop is very nice for this.

The standard way in PyTorch to train a model in multiple GPUs is to use nn.DataParallel which copies the model to the GPUs and during training splits the batch among them and combines the individual outputs.

mprenditore · 2022-12-01T17:45:02Z

Following in the hope this get supported :)

djbielejeski · 2023-02-14T21:59:47Z

No plans to support this, PR's welcome though if you can figure it out

0xdevalias mentioned this issue Nov 7, 2022

NameError: name 'trainer' is not defined #28

Closed

This was referenced Nov 7, 2022

Error while Training #86

Closed

Error on Paperspace #87

Closed

djbielejeski closed this as completed Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU is broken #53

Multi-GPU is broken #53

Shakahs commented Oct 6, 2022 •

edited

Loading

jimtalksdata commented Oct 16, 2022 •

edited

Loading

swcrazyfan commented Oct 25, 2022

0xdevalias commented Nov 7, 2022 •

edited

Loading

0xdevalias commented Nov 7, 2022

mprenditore commented Dec 1, 2022

djbielejeski commented Feb 14, 2023

Multi-GPU is broken #53

Multi-GPU is broken #53

Comments

Shakahs commented Oct 6, 2022 • edited Loading

jimtalksdata commented Oct 16, 2022 • edited Loading

swcrazyfan commented Oct 25, 2022

0xdevalias commented Nov 7, 2022 • edited Loading

0xdevalias commented Nov 7, 2022

mprenditore commented Dec 1, 2022

djbielejeski commented Feb 14, 2023

Shakahs commented Oct 6, 2022 •

edited

Loading

jimtalksdata commented Oct 16, 2022 •

edited

Loading

0xdevalias commented Nov 7, 2022 •

edited

Loading