Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: torch.cuda.OutOfMemoryError: HIP out of memory. When training embeddings #6460

Open
1 task done
elen07zz opened this issue Jan 7, 2023 · 11 comments
Open
1 task done
Labels
bug-report Report of a bug, yet to be confirmed

Comments

@elen07zz
Copy link

elen07zz commented Jan 7, 2023

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What happened?

Im trying to train a embedding but im getting this error.
Running webui with this setttings
python3 launch.py --precision full --no-half --opt-split-attention

100%|█████████████████████████████████████████| 616/616 [01:20<00:00, 7.67it/s]
0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/user/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 395, in train_embedding
scaler.scale(loss).backward()
File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB (GPU 0; 9.98 GiB total capacity; 8.51 GiB already allocated; 742.00 MiB free; 9.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

Steps to reproduce the problem

  1. Im getting this error when im run it using webui python3 launch.py --precision full --no-half --opt-split-attention
  2. But if i run it using instead python3 launch.py --precision full --no-half --opt-split-attention --medvram
  3. im receive this error Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!

0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/akairax/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 395, in train_embedding
scaler.scale(loss).backward()
File "/home/akairax/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/akairax/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument weight in method wrapper__native_layer_norm_backward)

What should have happened?

just run

Commit where the problem happens

874b975

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

python3 launch.py --precision full --no-half --opt-split-attention
python3 launch.py --precision full --no-half --opt-split-attention --medvram

Additional information, context and logs

Running ubuntu 22.04

@elen07zz elen07zz added the bug-report Report of a bug, yet to be confirmed label Jan 7, 2023
@leohu1
Copy link

leohu1 commented Jan 7, 2023

I think this is because your GPU memory are to low.

@elen07zz
Copy link
Author

elen07zz commented Jan 8, 2023

I think this is because your GPU memory are to low.

what is the minimum I need, even with optimizations enabled?

@HiroseKoichi
Copy link

HiroseKoichi commented Jan 14, 2023

Try this:

For AMD
PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

For Nvidia
PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

In my experience --opt-sub-quad-attention is the best vram optimizer for AMD cards and --xformers is the best for NVIDIA, so don't try using --medvram or --lowvram unless either of those don't work for you, and don't combine them like '--opt-sub-quad-attention --medvram' or '--xformers --lowvram' because in my testing it increased vram usage and made image generation slower, so only use one vram optimizer at once.

I'm also getting the 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!' error but it won't affect training in anyway, it just means you won't be able to see the preview images being generated in the webui, but you can still view them by going to /stable-diffusion-webui/textual_inversion/

@4xxFallacy
Copy link

For Nvidia PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

I'm having the same issue, Where would I set this?

@HiroseKoichi
Copy link

HiroseKoichi commented Jan 24, 2023

For Windows
You would put --xformers into your webui-user.bat in the command-line arguments section, then open the webui directory in cmd and run this command:
PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 webui-user.bat

For Linux
Open a terminal in the Web-ui directory and run the command:
PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

Though I recommend switching to a docker container, I started using a docker container (using podman instead of docker) a little less than a week ago and I no longer have the issue when training.

Also forgot to mention you'll want to check 'enable cross attention optimizations when training' in the settings, this will reduce your vram usage while training by a lot

@4xxFallacy
Copy link

Thanks! I managed to add them manually directly to the webui.bat. I think (extreme emphasis on the think) adding it there sets the pytorch environment variable for the venv during its activation and, although I'm sure xformers is now doing it's job and I'm able to train, I'm not sure setting the pytorch variable the way I did actually works. Also because I'm in Windows and nvidia-smi won't actually show me vram usage for my 3080 I know how well it's running only when it dies and throws errors my way, which is not great.

I'd try the docker to avoid issues but i fought with them in the past having issues with virtualization and stuff.

Thanks again!

@tnginako
Copy link

tnginako commented Jan 30, 2023

PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

Hi, I'm adding this just for future reference, I'm using a 6750xt GPU and this solved my Hip out of memory problem when generating large images (1024x1536 from hires. fix - I added --opt-sub-quad-attention in the terminal commands). However, I'd like to add for future reference that since this GPU is not really "supported", HSA_OVERRIDE_GFX_VERSION=10.3.0 should be ran in order to avoid Segmentation fault (core dumped) error. (just in case someone also gets the same error - I'm using Linux Mint.)

Taken from a rentry troubleshooting page.

Segmentation fault (core dumped) "${python_cmd}" launch.py

You tried to force an incompatible binary with your gpu via the HSA_OVERRIDE_GFX_VERSION environment variable. Unset it via set -e HSA_OVERRIDE_GFX_VERSION and retry the command.

@MrLavender
Copy link

Looking at your crash log you have 10GB vram so I'm guessing it's a RX 6700?

Try using the new --upcast-sampling feature which allows fp16 on AMD ROCm. Also --opt-sub-quad-attention because other cross attention layer optimizations may cause problems with --upcast-sampling.

python3 launch.py --upcast-sampling --opt-sub-quad-attention

In Settings -> Training enable "Move VAE and CLIP to RAM when training if possible" and "Use cross attention optimizations while training".

If using a SD 2.x model enable Settings -> Stable Diffusion -> "Upcast cross attention layer to float32".

With the above setup I'm able to train embeddings on a RX 5500XT 8GB (for 1.5 models anyway, haven't tried any 2.x training).

@mashiq3
Copy link

mashiq3 commented Mar 10, 2023

Try this:

For AMD PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

For Nvidia PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

In my experience --opt-sub-quad-attention is the best vram optimizer for AMD cards and --xformers is the best for NVIDIA, so don't try using --medvram or --lowvram unless either of those don't work for you, and don't combine them like '--opt-sub-quad-attention --medvram' or '--xformers --lowvram' because in my testing it increased vram usage and made image generation slower, so only use one vram optimizer at once.

I'm also getting the 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!' error but it won't affect training in anyway, it just means you won't be able to see the preview images being generated in the webui, but you can still view them by going to /stable-diffusion-webui/textual_inversion/

I need help doing this, can we do screenshare?

@Yama-K
Copy link

Yama-K commented Jul 28, 2023

PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

Results in unstable system, adding --opt-sub-quad-attention to launch args fixes the problem alone. Thank you.

@YabbaYabbaYabba
Copy link

YabbaYabbaYabba commented Aug 21, 2023

Looking at your crash log you have 10GB vram so I'm guessing it's a RX 6700?

Try using the new --upcast-sampling feature which allows fp16 on AMD ROCm. Also --opt-sub-quad-attention because other cross attention layer optimizations may cause problems with --upcast-sampling.

python3 launch.py --upcast-sampling --opt-sub-quad-attention

In Settings -> Training enable "Move VAE and CLIP to RAM when training if possible" and "Use cross attention optimizations while training".

If using a SD 2.x model enable Settings -> Stable Diffusion -> "Upcast cross attention layer to float32".

With the above setup I'm able to train embeddings on a RX 5500XT 8GB (for 1.5 models anyway, haven't tried any 2.x training).

Just wanted to say thank you so much! I was not able to run SDXL in A1111 on my AMD 6700XT at all but after your suggestion its running fantastic , not out of memory and it faster then before. Running at 3.74s/it now. Game changer at least for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-report Report of a bug, yet to be confirmed
Projects
None yet
Development

No branches or pull requests

9 participants