Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Seg Fault with ROCM 7900 XT #14763

Open
4 of 6 tasks
curvedinf opened this issue Jan 25, 2024 · 7 comments
Open
4 of 6 tasks

[Bug]: Seg Fault with ROCM 7900 XT #14763

curvedinf opened this issue Jan 25, 2024 · 7 comments
Labels
bug-report Report of a bug, yet to be confirmed

Comments

@curvedinf
Copy link

Checklist

  • The issue exists after disabling all extensions
  • The issue exists on a clean installation of webui
  • The issue is caused by an extension, but I believe it is caused by a bug in the webui
  • The issue exists in the current version of the webui
  • The issue has not been reported before recently
  • The issue has been reported before but has not been fixed yet

What happened?

Fresh install on Ubuntu 22.04 goes well. However, when running webui shortly after HTTP server starts up (and launches a browser window that successfully loads from the server), the server crashes with the following error. I have tested this with ROCM 5.7, 6.0, and 6.0.1. I am running text-generation-webui successfully on the rocm device (so I think its not an overall system config issue) and the device is detected properly. I previously had a 6700 XT installed that was running stable-diffusion-webui well, but the new 7900 XT is not.

Steps to reproduce the problem

  1. Run ./webui.sh

What should have happened?

WebUI should start up normally and load a model.

What browsers do you use to access the UI ?

No response

Sysinfo

sysinfo-2024-01-25-23-06.json

Console logs

$ ./webui.sh 

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye), Fedora 34+ and openSUSE Leap 15.4 or newer.
################################################################

################################################################
Running on xxx user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc_minimal.so.4
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Launching Web UI with arguments: 
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
Style database not found: /home/chase/Projects/stable-diffusion-webui/styles.csv
Loading weights [aeb7e9e689] from /home/chase/Projects/stable-diffusion-webui/models/Stable-diffusion/juggernautXL_v8Rundiffusion.safetensors
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 6.0s (prepare environment: 1.7s, import torch: 1.4s, import gradio: 0.5s, setup paths: 1.1s, other imports: 0.3s, load scripts: 0.3s, create ui: 0.3s, gradio launch: 0.3s).
Creating model from config: /home/chase/Projects/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml
Calculating sha256 for /home/chase/Projects/stable-diffusion-webui/models/Stable-diffusion/realvisxlV30Turbo_v30TurboBakedvae.safetensors: cfab6aec061f4905db12c40dc43534a26b84d0a5c0085c428729fe36e3dc056c
Loading weights [cfab6aec06] from /home/chase/Projects/stable-diffusion-webui/models/Stable-diffusion/realvisxlV30Turbo_v30TurboBakedvae.safetensors
Creating model from config: /home/chase/Projects/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml
changing setting sd_model_checkpoint to realvisxlV30Turbo_v30TurboBakedvae.safetensors: RuntimeError
Traceback (most recent call last):
  File "/home/chase/Projects/stable-diffusion-webui/modules/options.py", line 146, in set
    option.onchange()
  File "/home/chase/Projects/stable-diffusion-webui/modules/call_queue.py", line 13, in f
    res = func(*args, **kwargs)
  File "/home/chase/Projects/stable-diffusion-webui/modules/initialize_util.py", line 174, in <lambda>
    shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: sd_models.reload_model_weights()), call=False)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_models.py", line 783, in reload_model_weights
    load_model(checkpoint_info, already_loaded_state_dict=state_dict)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_models.py", line 658, in load_model
    load_model_weights(sd_model, checkpoint_info, state_dict, timer)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_models.py", line 375, in load_model_weights
    model.load_state_dict(state_dict, strict=False)
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_disable_initialization.py", line 223, in <lambda>
    module_load_state_dict = self.replace(torch.nn.Module, 'load_state_dict', lambda *args, **kwargs: load_state_dict(module_load_state_dict, *args, **kwargs))
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_disable_initialization.py", line 219, in load_state_dict
    state_dict = {k: v.to(device="meta", dtype=v.dtype) for k, v in state_dict.items()}
  File "/home/chase/Projects/stable-diffusion-webui/modules/sd_disable_initialization.py", line 219, in <dictcomp>
    state_dict = {k: v.to(device="meta", dtype=v.dtype) for k, v in state_dict.items()}
RuntimeError: dictionary changed size during iteration

./webui.sh: line 256:  5100 Segmentation fault      (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"

Additional information

No response

@curvedinf curvedinf added the bug-report Report of a bug, yet to be confirmed label Jan 25, 2024
@ashirviskas
Copy link

Your torch rocm version seems to be quite old, try updating to at least 5.7.

I have 7900 XTX and while I still haven't managed to get it working though

@DGdev91
Copy link
Contributor

DGdev91 commented Feb 1, 2024

First of all, use the last rocm version. I suggest you to use the official amdgpu installer tool, and follow the official instructions https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

Since you just changed your gpu, try to delete your venv folder and make it download all the packages again. Also, make sure there isn't any customization in the webui_user.sh (for exaple, the HSA_OVERRIDE flag. You don't need it anymore.

@curvedinf
Copy link
Author

curvedinf commented Feb 10, 2024

I have solved this by updating the venv's torch and torchvision versions to the latest nightlies. I also am running the latest ROCM driver (6.0.2). There are multiple ways to update the versions, but the way I elected to do it is by updating my webui.sh script by replacing line 161 with the following:

        export TORCH_COMMAND="pip install torch==2.3.0.dev20240210+rocm6.0 torchvision==0.18.0.dev20240210+rocm6.0 --index-url https://download.pytorch.org/whl/nightly/rocm6.0"

Then I deleted the venv directory and ran webui.sh.

Just to reiterate, I have stable-diffusion-webui working with my 7900 XT with little effort. The maintainers should be able to get this working by updating the install script only.

If new torch or rocm versions become available, you can view the available torch versions on the torch pip index: https://download.pytorch.org/whl/nightly/rocm6.0

(you can also replace rocm6.0 in that url with newer or older versions of rocm to facilitate your driver version)

@ronidee
Copy link

ronidee commented Apr 10, 2024

I'm on RX 7800 XT, ROCm 6.0.2.60002-115~22.04, Ubuntu 23.10, torch 2.3.0.dev20240210+rocm6.0, torchvision 0.18.0.dev20240210+rocm6.0 and also got a seg fault. However, in my case the reason was that I set the gfx version environment variable to 10.3.0.

So using HSA_OVERRIDE_GFX_VERSION=11.0.0 instead got it working for me.

@DGdev91
Copy link
Contributor

DGdev91 commented Apr 10, 2024

I'm on RX 7800 XT, ROCm 6.0.2.60002-115~22.04, Ubuntu 23.10, torch 2.3.0.dev20240210+rocm6.0, torchvision 0.18.0.dev20240210+rocm6.0 and also got a seg fault. However, in my case the reason was that I set the gfx version environment variable to 10.3.0.

So using HSA_OVERRIDE_GFX_VERSION=11.0.0 instead got it working for me.

For 7900XT and 7900XTX the HSA_OVERRIDE_GFX_VERSION flag isn't needed at all.
Not sure for other 7000-series gpus. You can try to remove it and see if it still works.

@ronidee
Copy link

ronidee commented Apr 12, 2024

Hey @DGdev91 thanks for your reply :-) I already tried as, as I read your previous comment as well and it doesn't work. It causes following error: RuntimeError: HIP error: invalid device function.

I only used 10.3.0 because it was recommended often and I didn't understand it's meaning. As far as I understand now, the 11.0.0 is the closest version to my card that's officially supported, right?

I still have a memory leak, after a couple of runs my 32GB RAM is full so I have to restart the program. But that's off-topic and I will search for related issues.

Update: I fixed the memory leak by omitting the --medvram flag. Now RAM stays the same and doesn't fill up over time.

@DGdev91
Copy link
Contributor

DGdev91 commented Apr 12, 2024

Hey @DGdev91 thanks for your reply :-) I already tried as, as I read your previous comment as well and it doesn't work. It causes following error: RuntimeError: HIP error: invalid device function.

I only used 10.3.0 because it was recommended often and I didn't understand it's meaning. As far as I understand now, the 11.0.0 is the closest version to my card that's officially supported, right?

I still have a memory leak, after a couple of runs my 32GB RAM is full so I have to restart the program. But that's off-topic and I will search for related issues.

Well, good to know then.
Yes, most likely in your case 11.0.0 is the closer supported version, so just keep it like that.

.... there's also a patch wich has been merged some weeks ago wich should in theory make the the default config to build the tensile libs for many "not fully supported" archs, and should in theory make that flag not needed anymore in next rocm release.

But for now, just keep that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-report Report of a bug, yet to be confirmed
Projects
None yet
Development

No branches or pull requests

4 participants