[Bug]: Seg Fault with ROCM 7900 XT #14763

curvedinf · 2024-01-25T23:07:55Z

Checklist

The issue exists after disabling all extensions
The issue exists on a clean installation of webui
The issue is caused by an extension, but I believe it is caused by a bug in the webui
The issue exists in the current version of the webui
The issue has not been reported before recently
The issue has been reported before but has not been fixed yet

What happened?

Fresh install on Ubuntu 22.04 goes well. However, when running webui shortly after HTTP server starts up (and launches a browser window that successfully loads from the server), the server crashes with the following error. I have tested this with ROCM 5.7, 6.0, and 6.0.1. I am running text-generation-webui successfully on the rocm device (so I think its not an overall system config issue) and the device is detected properly. I previously had a 6700 XT installed that was running stable-diffusion-webui well, but the new 7900 XT is not.

Steps to reproduce the problem

Run ./webui.sh

What should have happened?

WebUI should start up normally and load a model.

What browsers do you use to access the UI ?

No response

Sysinfo

sysinfo-2024-01-25-23-06.json

Console logs

$ ./webui.sh 

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################

################################################################
Running on xxx user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc_minimal.so.4
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Launching Web UI with arguments: 
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Style database not found: /home/chase/Projects/stable-diffusion-webui/styles.csv
Loading weights [aeb7e9e689] from /home/chase/Projects/stable-diffusion-webui/models/Stable-diffusion/juggernautXL_v8Rundiffusion.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 6.0s (prepare environment: 1.7s, import torch: 1.4s, import gradio: 0.5s, setup paths: 1.1s, other imports: 0.3s, load scripts: 0.3s, create ui: 0.3s, gradio launch: 0.3s).
Creating model from config: /home/chase/Projects/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml
Calculating sha256 for /home/chase/Projects/stable-diffusion-webui/models/Stable-diffusion/realvisxlV30Turbo_v30TurboBakedvae.safetensors: cfab6aec061f4905db12c40dc43534a26b84d0a5c0085c428729fe36e3dc056c
Loading weights [cfab6aec06] from /home/chase/Projects/stable-diffusion-webui/models/Stable-diffusion/realvisxlV30Turbo_v30TurboBakedvae.safetensors
Creating model from config: /home/chase/Projects/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml
changing setting sd_model_checkpoint to realvisxlV30Turbo_v30TurboBakedvae.safetensors: RuntimeError
Traceback (most recent call last):
  File "/home/chase/Projects/stable-diffusion-webui/modules/options.py", line 146, in set
    option.onchange()
  File "/home/chase/Projects/stable-diffusion-webui/modules/call_queue.py", line 13, in f
    res = func(*args, **kwargs)
  File "/home/chase/Projects/stable-diffusion-webui/modules/initialize_util.py", line 174, in <lambda>
    shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: sd_models.reload_model_weights()), call=False)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_models.py", line 783, in reload_model_weights
    load_model(checkpoint_info, already_loaded_state_dict=state_dict)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_models.py", line 658, in load_model
    load_model_weights(sd_model, checkpoint_info, state_dict, timer)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_models.py", line 375, in load_model_weights
    model.load_state_dict(state_dict, strict=False)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_disable_initialization.py", line 223, in <lambda>
    module_load_state_dict = self.replace(torch.nn.Module, 'load_state_dict', lambda *args, **kwargs: load_state_dict(module_load_state_dict, *args, **kwargs))
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_disable_initialization.py", line 219, in load_state_dict
    state_dict = {k: v.to(device="meta", dtype=v.dtype) for k, v in state_dict.items()}
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_disable_initialization.py", line 219, in <dictcomp>
    state_dict = {k: v.to(device="meta", dtype=v.dtype) for k, v in state_dict.items()}
RuntimeError: dictionary changed size during iteration

./webui.sh: line 256:  5100 Segmentation fault      (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"

Additional information

No response

ashirviskas · 2024-02-01T09:38:57Z

Your torch rocm version seems to be quite old, try updating to at least 5.7.

I have 7900 XTX and while I still haven't managed to get it working though

DGdev91 · 2024-02-01T15:12:57Z

First of all, use the last rocm version. I suggest you to use the official amdgpu installer tool, and follow the official instructions https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

Since you just changed your gpu, try to delete your venv folder and make it download all the packages again. Also, make sure there isn't any customization in the webui_user.sh (for exaple, the HSA_OVERRIDE flag. You don't need it anymore.

curvedinf · 2024-02-10T23:42:04Z

I have solved this by updating the venv's torch and torchvision versions to the latest nightlies. I also am running the latest ROCM driver (6.0.2). There are multiple ways to update the versions, but the way I elected to do it is by updating my webui.sh script by replacing line 161 with the following:

        export TORCH_COMMAND="pip install torch==2.3.0.dev20240210+rocm6.0 torchvision==0.18.0.dev20240210+rocm6.0 --index-url https://download.pytorch.org/whl/nightly/rocm6.0"

Then I deleted the venv directory and ran webui.sh.

Just to reiterate, I have stable-diffusion-webui working with my 7900 XT with little effort. The maintainers should be able to get this working by updating the install script only.

If new torch or rocm versions become available, you can view the available torch versions on the torch pip index: https://download.pytorch.org/whl/nightly/rocm6.0

(you can also replace rocm6.0 in that url with newer or older versions of rocm to facilitate your driver version)

ronidee · 2024-04-10T14:00:10Z

I'm on RX 7800 XT, ROCm 6.0.2.60002-115~22.04, Ubuntu 23.10, torch 2.3.0.dev20240210+rocm6.0, torchvision 0.18.0.dev20240210+rocm6.0 and also got a seg fault. However, in my case the reason was that I set the gfx version environment variable to 10.3.0.

So using HSA_OVERRIDE_GFX_VERSION=11.0.0 instead got it working for me.

DGdev91 · 2024-04-10T14:07:37Z

I'm on RX 7800 XT, ROCm 6.0.2.60002-115~22.04, Ubuntu 23.10, torch 2.3.0.dev20240210+rocm6.0, torchvision 0.18.0.dev20240210+rocm6.0 and also got a seg fault. However, in my case the reason was that I set the gfx version environment variable to 10.3.0.

So using HSA_OVERRIDE_GFX_VERSION=11.0.0 instead got it working for me.

For 7900XT and 7900XTX the HSA_OVERRIDE_GFX_VERSION flag isn't needed at all.
Not sure for other 7000-series gpus. You can try to remove it and see if it still works.

ronidee · 2024-04-12T10:45:56Z

Hey @DGdev91 thanks for your reply :-) I already tried as, as I read your previous comment as well and it doesn't work. It causes following error: RuntimeError: HIP error: invalid device function.

I only used 10.3.0 because it was recommended often and I didn't understand it's meaning. As far as I understand now, the 11.0.0 is the closest version to my card that's officially supported, right?

I still have a memory leak, after a couple of runs my 32GB RAM is full so I have to restart the program. But that's off-topic and I will search for related issues.

Update: I fixed the memory leak by omitting the --medvram flag. Now RAM stays the same and doesn't fill up over time.

DGdev91 · 2024-04-12T11:39:16Z

Hey @DGdev91 thanks for your reply :-) I already tried as, as I read your previous comment as well and it doesn't work. It causes following error: RuntimeError: HIP error: invalid device function.

I only used 10.3.0 because it was recommended often and I didn't understand it's meaning. As far as I understand now, the 11.0.0 is the closest version to my card that's officially supported, right?

I still have a memory leak, after a couple of runs my 32GB RAM is full so I have to restart the program. But that's off-topic and I will search for related issues.

Well, good to know then.
Yes, most likely in your case 11.0.0 is the closer supported version, so just keep it like that.

.... there's also a patch wich has been merged some weeks ago wich should in theory make the the default config to build the tensile libs for many "not fully supported" archs, and should in theory make that flag not needed anymore in next rocm release.

But for now, just keep that.

curvedinf added the bug-report Report of a bug, yet to be confirmed label Jan 25, 2024

dazzlemon mentioned this issue Mar 3, 2024

[Bug]: RuntimeError: Torch is not able to use GPU #15057

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Seg Fault with ROCM 7900 XT #14763

[Bug]: Seg Fault with ROCM 7900 XT #14763

curvedinf commented Jan 25, 2024

ashirviskas commented Feb 1, 2024

DGdev91 commented Feb 1, 2024

curvedinf commented Feb 10, 2024 •

edited

ronidee commented Apr 10, 2024 •

edited

DGdev91 commented Apr 10, 2024

ronidee commented Apr 12, 2024 •

edited

DGdev91 commented Apr 12, 2024

[Bug]: Seg Fault with ROCM 7900 XT #14763

[Bug]: Seg Fault with ROCM 7900 XT #14763

Comments

curvedinf commented Jan 25, 2024

Checklist

What happened?

Steps to reproduce the problem

What should have happened?

What browsers do you use to access the UI ?

Sysinfo

Console logs

Additional information

ashirviskas commented Feb 1, 2024

DGdev91 commented Feb 1, 2024

curvedinf commented Feb 10, 2024 • edited

ronidee commented Apr 10, 2024 • edited

DGdev91 commented Apr 10, 2024

ronidee commented Apr 12, 2024 • edited

DGdev91 commented Apr 12, 2024

curvedinf commented Feb 10, 2024 •

edited

ronidee commented Apr 10, 2024 •

edited

ronidee commented Apr 12, 2024 •

edited