Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU is broken #53

Closed
Shakahs opened this issue Oct 6, 2022 · 6 comments
Closed

Multi-GPU is broken #53

Shakahs opened this issue Oct 6, 2022 · 6 comments

Comments

@Shakahs
Copy link

Shakahs commented Oct 6, 2022

2x3090 instance on Runpod using the Runpod notebook on their Stable Diffusion image. I can train on GPU 0, but not 0 and 1 together or even separately. Running on GPU 0 works fine.

Here is what happens when I am training on GPU 0, and try to start a separate training on GPU 1. It seems GPU 0 is hardcoded somewhere.

!python "main.py" \
 --base configs/stable-diffusion/v1-finetune_unfrozen.yaml \
 -t \
 --actual_resume "model.ckpt" \
 --reg_data_root "{reg_data_root}" \
 -n "{project_name}" \
 --gpus 1, \
 --data_root "/workspace/Dreambooth-Stable-Diffusion/MS" \
 --max_training_steps {max_training_steps} \
 --class_word "{class_word}" \
 --token "{token}" \
 --no-test

.....

Traceback (most recent call last):
  File "main.py", line 665, in <module>
    model = load_model_from_config(config, opt.actual_resume)
  File "main.py", line 42, in load_model_from_config
    model.cuda()
  File "/venv/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 138, in cuda
    return super().cuda(device=device)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 0 bytes already allocated; 13.56 MiB free; 0 bytes reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 883, in <module>
    if trainer.global_rank == 0:
NameError: name 'trainer' is not defined
@jimtalksdata
Copy link

jimtalksdata commented Oct 16, 2022

Something like this would fix it, no? Pass in gpuinfo when it is called in main.py.

def load_model_from_config(config, gpuinfo, ckpt, verbose=False):
    print(f"Loading model from {ckpt}")
    pl_sd = torch.load(ckpt, map_location="cpu")
    sd = pl_sd["state_dict"]
    config.model.params.ckpt_path = ckpt
    model = instantiate_from_config(config.model)
    m, u = model.load_state_dict(sd, strict=False)
    if len(m) > 0 and verbose:
        print("missing keys:")
        print(m)
    if len(u) > 0 and verbose:
        print("unexpected keys:")
        print(u)

    device = torch.device("cuda:" + str(gpuinfo).rstrip(",")) if torch.cuda.is_available() else torch.device("cpu")
    model.to(device)
    model.eval()
    return model

model.cuda() tries to call on GPU 0.

@swcrazyfan
Copy link

Is multi GPU supposed to be supported?

@0xdevalias
Copy link

0xdevalias commented Nov 7, 2022

Potentially related (NameError: name 'trainer' is not defined):


I went to line 896 on Main.py, and changed "trainer" for "Trainer", now it's working

Originally posted by @Pegaxsus in #86 (comment)

This was referenced Nov 7, 2022
@0xdevalias
Copy link

Something like this would fix it, no? Pass in gpuinfo when it is called in main.py.

The following seems to give a solution:

model.cuda() by default will send your model to the "current device", which can be set with torch.cuda.set_device(device).

An alternative way to send the model to a specific device is model.to(torch.device('cuda:0')).

This, of course, is subject to the device visibility specified in the environment variable CUDA_VISIBLE_DEVICES.

You can check GPU usage with nvidia-smi. Also, nvtop is very nice for this.

The standard way in PyTorch to train a model in multiple GPUs is to use nn.DataParallel which copies the model to the GPUs and during training splits the batch among them and combines the individual outputs.

@mprenditore
Copy link

Following in the hope this get supported :)

@djbielejeski
Copy link
Collaborator

No plans to support this, PR's welcome though if you can figure it out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants