[Bug]: torch.cuda.OutOfMemoryError: HIP out of memory. When training embeddings #6460

elen07zz · 2023-01-07T06:12:12Z

Is there an existing issue for this?

I have searched the existing issues and checked the recent builds/commits

What happened?

Im trying to train a embedding but im getting this error.
Running webui with this setttings
python3 launch.py --precision full --no-half --opt-split-attention

100%|█████████████████████████████████████████| 616/616 [01:20<00:00, 7.67it/s]
0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/user/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 395, in train_embedding
scaler.scale(loss).backward()
File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/user/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 512.00 MiB (GPU 0; 9.98 GiB total capacity; 8.51 GiB already allocated; 742.00 MiB free; 9.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

Steps to reproduce the problem

Im getting this error when im run it using webui python3 launch.py --precision full --no-half --opt-split-attention
But if i run it using instead python3 launch.py --precision full --no-half --opt-split-attention --medvram
im receive this error Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!

0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/akairax/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py", line 395, in train_embedding
scaler.scale(loss).backward()
File "/home/akairax/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/akairax/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument weight in method wrapper__native_layer_norm_backward)

What should have happened?

just run

Commit where the problem happens

874b975

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

python3 launch.py --precision full --no-half --opt-split-attention
python3 launch.py --precision full --no-half --opt-split-attention --medvram

Additional information, context and logs

Running ubuntu 22.04

The text was updated successfully, but these errors were encountered:

leohu1 · 2023-01-07T14:30:39Z

I think this is because your GPU memory are to low.

elen07zz · 2023-01-08T09:40:23Z

I think this is because your GPU memory are to low.

what is the minimum I need, even with optimizations enabled?

HiroseKoichi · 2023-01-14T01:23:53Z

Try this:

For AMD
PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

For Nvidia
PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

In my experience --opt-sub-quad-attention is the best vram optimizer for AMD cards and --xformers is the best for NVIDIA, so don't try using --medvram or --lowvram unless either of those don't work for you, and don't combine them like '--opt-sub-quad-attention --medvram' or '--xformers --lowvram' because in my testing it increased vram usage and made image generation slower, so only use one vram optimizer at once.

I'm also getting the 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!' error but it won't affect training in anyway, it just means you won't be able to see the preview images being generated in the webui, but you can still view them by going to /stable-diffusion-webui/textual_inversion/

4xxFallacy · 2023-01-23T13:26:32Z

For Nvidia PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

I'm having the same issue, Where would I set this?

HiroseKoichi · 2023-01-24T07:21:00Z

For Windows
You would put --xformers into your webui-user.bat in the command-line arguments section, then open the webui directory in cmd and run this command:
PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 webui-user.bat

For Linux
Open a terminal in the Web-ui directory and run the command:
PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

Though I recommend switching to a docker container, I started using a docker container (using podman instead of docker) a little less than a week ago and I no longer have the issue when training.

Also forgot to mention you'll want to check 'enable cross attention optimizations when training' in the settings, this will reduce your vram usage while training by a lot

4xxFallacy · 2023-01-24T07:47:20Z

Thanks! I managed to add them manually directly to the webui.bat. I think (extreme emphasis on the think) adding it there sets the pytorch environment variable for the venv during its activation and, although I'm sure xformers is now doing it's job and I'm able to train, I'm not sure setting the pytorch variable the way I did actually works. Also because I'm in Windows and nvidia-smi won't actually show me vram usage for my 3080 I know how well it's running only when it dies and throws errors my way, which is not great.

I'd try the docker to avoid issues but i fought with them in the past having issues with virtualization and stuff.

Thanks again!

tnginako · 2023-01-30T13:53:05Z

PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

Hi, I'm adding this just for future reference, I'm using a 6750xt GPU and this solved my Hip out of memory problem when generating large images (1024x1536 from hires. fix - I added --opt-sub-quad-attention in the terminal commands). However, I'd like to add for future reference that since this GPU is not really "supported", HSA_OVERRIDE_GFX_VERSION=10.3.0 should be ran in order to avoid Segmentation fault (core dumped) error. (just in case someone also gets the same error - I'm using Linux Mint.)

Taken from a rentry troubleshooting page.

Segmentation fault (core dumped) "${python_cmd}" launch.py

You tried to force an incompatible binary with your gpu via the HSA_OVERRIDE_GFX_VERSION environment variable. Unset it via set -e HSA_OVERRIDE_GFX_VERSION and retry the command.

MrLavender · 2023-01-30T15:16:57Z

Looking at your crash log you have 10GB vram so I'm guessing it's a RX 6700?

Try using the new --upcast-sampling feature which allows fp16 on AMD ROCm. Also --opt-sub-quad-attention because other cross attention layer optimizations may cause problems with --upcast-sampling.

python3 launch.py --upcast-sampling --opt-sub-quad-attention

In Settings -> Training enable "Move VAE and CLIP to RAM when training if possible" and "Use cross attention optimizations while training".

If using a SD 2.x model enable Settings -> Stable Diffusion -> "Upcast cross attention layer to float32".

With the above setup I'm able to train embeddings on a RX 5500XT 8GB (for 1.5 models anyway, haven't tried any 2.x training).

mashiq3 · 2023-03-10T17:27:35Z

Try this:

For AMD PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

For Nvidia PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --xformers

In my experience --opt-sub-quad-attention is the best vram optimizer for AMD cards and --xformers is the best for NVIDIA, so don't try using --medvram or --lowvram unless either of those don't work for you, and don't combine them like '--opt-sub-quad-attention --medvram' or '--xformers --lowvram' because in my testing it increased vram usage and made image generation slower, so only use one vram optimizer at once.

I'm also getting the 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!' error but it won't affect training in anyway, it just means you won't be able to see the preview images being generated in the webui, but you can still view them by going to /stable-diffusion-webui/textual_inversion/

I need help doing this, can we do screenshare?

Yama-K · 2023-07-28T03:46:46Z

PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 python launch.py --precision full --no-half --opt-sub-quad-attention

Results in unstable system, adding --opt-sub-quad-attention to launch args fixes the problem alone. Thank you.

YabbaYabbaYabba · 2023-08-21T16:00:55Z

Looking at your crash log you have 10GB vram so I'm guessing it's a RX 6700?

Try using the new --upcast-sampling feature which allows fp16 on AMD ROCm. Also --opt-sub-quad-attention because other cross attention layer optimizations may cause problems with --upcast-sampling.

python3 launch.py --upcast-sampling --opt-sub-quad-attention

In Settings -> Training enable "Move VAE and CLIP to RAM when training if possible" and "Use cross attention optimizations while training".

If using a SD 2.x model enable Settings -> Stable Diffusion -> "Upcast cross attention layer to float32".

With the above setup I'm able to train embeddings on a RX 5500XT 8GB (for 1.5 models anyway, haven't tried any 2.x training).

Just wanted to say thank you so much! I was not able to run SDXL in A1111 on my AMD 6700XT at all but after your suggestion its running fantastic , not out of memory and it faster then before. Running at 3.74s/it now. Game changer at least for me.

elen07zz added the bug-report Report of a bug, yet to be confirmed label Jan 7, 2023

Cykyrios mentioned this issue Nov 14, 2023

[Bug]: 1.6.0 Hires. fix uses all memory on an AMD 7900 XT #13000

Open

1 task

psychedelicious mentioned this issue May 3, 2024

[bug]: ROCm Out of Memory Errors - Excessive VRAM Allocation invoke-ai/InvokeAI#6301

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: torch.cuda.OutOfMemoryError: HIP out of memory. When training embeddings #6460

[Bug]: torch.cuda.OutOfMemoryError: HIP out of memory. When training embeddings #6460

elen07zz commented Jan 7, 2023

leohu1 commented Jan 7, 2023 •

edited

elen07zz commented Jan 8, 2023

HiroseKoichi commented Jan 14, 2023 •

edited

4xxFallacy commented Jan 23, 2023

HiroseKoichi commented Jan 24, 2023 •

edited

4xxFallacy commented Jan 24, 2023

tnginako commented Jan 30, 2023 •

edited

MrLavender commented Jan 30, 2023

mashiq3 commented Mar 10, 2023

Yama-K commented Jul 28, 2023

YabbaYabbaYabba commented Aug 21, 2023 •

edited

[Bug]: torch.cuda.OutOfMemoryError: HIP out of memory. When training embeddings #6460

[Bug]: torch.cuda.OutOfMemoryError: HIP out of memory. When training embeddings #6460

Comments

elen07zz commented Jan 7, 2023

Is there an existing issue for this?

What happened?

Steps to reproduce the problem

What should have happened?

Commit where the problem happens

What platforms do you use to access UI ?

What browsers do you use to access the UI ?

Command Line Arguments

Additional information, context and logs

leohu1 commented Jan 7, 2023 • edited

elen07zz commented Jan 8, 2023

HiroseKoichi commented Jan 14, 2023 • edited

4xxFallacy commented Jan 23, 2023

HiroseKoichi commented Jan 24, 2023 • edited

4xxFallacy commented Jan 24, 2023

tnginako commented Jan 30, 2023 • edited

MrLavender commented Jan 30, 2023

mashiq3 commented Mar 10, 2023

Yama-K commented Jul 28, 2023

YabbaYabbaYabba commented Aug 21, 2023 • edited

leohu1 commented Jan 7, 2023 •

edited

HiroseKoichi commented Jan 14, 2023 •

edited

HiroseKoichi commented Jan 24, 2023 •

edited

tnginako commented Jan 30, 2023 •

edited

YabbaYabbaYabba commented Aug 21, 2023 •

edited