Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[help-with-local-system]: Segmentation fault RX580 #12376

Closed
1 task done
faengal opened this issue Aug 6, 2023 · 16 comments
Closed
1 task done

[help-with-local-system]: Segmentation fault RX580 #12376

faengal opened this issue Aug 6, 2023 · 16 comments
Labels
asking-for-help-with-local-system-issues This issue is asking for help related to local system; please offer assistance platform:amd Issues that apply to AMD manufactured cards

Comments

@faengal
Copy link

faengal commented Aug 6, 2023

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What happened?

It's my second time installing SD for a RX580, I been following these instructions: #4870 (comment)

It worked flawlessly at my first system running manjaro distro, however I can't seem to make it work at another machine with the same GPU but running arch. It gives off a segmentation fault error even before the webui finishes starting up.

Does anyone have any idea of what I could try, or what's causing this problem?

Steps to reproduce the problem

  1. Install arch linux through archinstall with kde desktop preset
  2. follow this to install rocm and sd: [Bug]: Segmentation fault when running ROCm with RX 580 8GB #4870 (comment)
  3. start run webui.sh and get segfault at startup

What should have happened?

webui starts up properly without segmentation fault just like at my manjaro setup.

Version or Commit where the problem happens

68f336b

What Python version are you running on ?

Python 3.11.x (above, no supported yet)

What platforms do you use to access the UI ?

Linux

What device are you running WebUI on?

AMD GPUs (RX 5000 below)

Cross attention optimization

Automatic

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

--medvram

List of extensions

No

Console logs

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on faengal user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc_minimal.so.4
Python 3.11.3 (main, Jun  5 2023, 09:32:32) [GCC 13.1.1 20230429]
Version: v1.5.1
Commit hash: 68f336bd994bed5442ad95bad6b6ad5564a5409a
Launching Web UI with arguments:  --medvram
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
./webui.sh: line 254: 12172 Segmentation fault      (core dumped) "${python_cmd}" "${LAUNCH_SCRIPT}" "$@"

Additional information

Both systems are running it on python 3.11.3
Both systems seem to be unable to detect the GPU if I run it on python 3.10.x.

Arch's hardware: Xeon E5 2650 v3, RX580 2048SP 8GB vram, 32GB RAM.
Manjaro hardware: Ryzen 3 1200, RX 580 2048SP 8GB vram, 16GB Ram

this is my journalctl log:
https://justpaste.it/5cd49
coredump.txt

manjaro $pacman -Q | grep rocm:

python-pytorch-opt-rocm 2.0.1-1
rocm-clang-ocl 5.4.3-1
rocm-cmake 5.4.3-1
rocm-core 5.4.3-4
rocm-device-libs 5.4.3-1
rocm-hip-libraries 5.4.3-2
rocm-hip-runtime 5.4.3-2
rocm-hip-sdk 5.4.3-2
rocm-language-runtime 5.4.3-2
rocm-llvm 5.4.3-1
rocm-opencl-runtime 5.4.3-1
rocm-smi-lib 5.4.3-1
rocminfo 5.4.3-1

manjaro $pacman -Q | grep torchvision:
python-torchvision 0.15.1-2

arch $pacman -Q | grep rocm:

python-pytorch-opt-rocm 2.0.1-7
rocm-clang-ocl 5.6.0-1
rocm-cmake 5.6.0-1
rocm-core 5.6.0-1
rocm-device-libs 5.6.0-1
rocm-hip-libraries 5.6.0-1
rocm-hip-runtime 5.6.0-1
rocm-hip-sdk 5.6.0-1
rocm-language-runtime 5.6.0-1
rocm-llvm 5.6.0-1
rocm-opencl-runtime 5.6.0-1
rocm-smi-lib 5.6.0-1
rocminfo 5.6.0-1

arch $pacman -Q | grep torchvision:
python-torchvision 0.15.2-1

I tried matching the package versions, but it just turned into another error about not being able to detect the GPU.

@faengal faengal added the bug-report Report of a bug, yet to be confirmed label Aug 6, 2023
@catboxanon catboxanon added asking-for-help-with-local-system-issues This issue is asking for help related to local system; please offer assistance platform:amd Issues that apply to AMD manufactured cards and removed bug-report Report of a bug, yet to be confirmed labels Aug 6, 2023
@viebrix
Copy link

viebrix commented Aug 9, 2023

This error comes with rocm 5.6, try using rocm 5.5.0. I'm using Linux Mint 21.2, but I had same error with RX 580
See also:
xuhuisheng/rocm-gfx803#27 (comment)

Edit: 5.5.3 also does not work. Only 5.5.0 did.

@faengal
Copy link
Author

faengal commented Aug 10, 2023

This error comes with rocm 5.6, try using rocm 5.0. I'm using Linux Mint 21.2, but I had same error with RX 580 See also: xuhuisheng/rocm-gfx803#27 (comment)

So guess I'll have to attempt downgrading it somehow, or switch to a debian based distro.
Thank you for the warning.

@faengal faengal closed this as completed Aug 10, 2023
@faengal faengal reopened this Aug 10, 2023
@faengal faengal closed this as not planned Won't fix, can't repro, duplicate, stale Aug 10, 2023
@weengjeowed
Copy link

Following the same instructions as OP, downgrading python-pytorch-opt-rocm to any non-current version triggers the "Torch is not able to use GPU" runtime error instead of the seg fault, which appears on the current version for me as well.

@faengal faengal reopened this Aug 11, 2023
@viebrix
Copy link

viebrix commented Aug 11, 2023

I post my versions currently installed, maybe this help further:

pip list | grep torch
open-clip-torch           2.20.0
pytorch-lightning         1.9.4
pytorch-triton-rocm       2.0.2
torch                     2.0.0
torchdiffeq               0.2.3
torchmetrics              1.0.1
torchsde                  0.2.5
torchvision               0.15.1

python version:

python3 --version
Python 3.10.12

rocm versions:

dpkg -l | grep rocm
ii  rocm-clang-ocl                             0.5.0.50500-63~22.04                       amd64        OpenCL compilation with clang compiler.
ii  rocm-cmake                                 0.8.1.50500-63~22.04                       amd64        rocm-cmake built using CMake
ii  rocm-core                                  5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-dbgapi                                0.70.0.50500-63~22.04                      amd64        Library to provide AMD GPU debugger API
ii  rocm-debug-agent                           2.0.3.50500-63~22.04                       amd64        Radeon Open Compute Debug Agent (ROCdebug-agent)
ii  rocm-developer-tools                       5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-device-libs                           1.0.0.50500-63~22.04                       amd64        Radeon Open Compute - device libraries
ii  rocm-gdb                                   12.1.50500-63~22.04                        amd64        ROCgdb
ii  rocm-hip-libraries                         5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-hip-runtime                           5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-hip-runtime-dev                       5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-hip-sdk                               5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-language-runtime                      5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-llvm                                  16.0.0.23144.50500-63~22.04                amd64        ROCm compiler
ii  rocm-ml-libraries                          5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-ml-sdk                                5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-ocl-icd                               2.0.0.50500-63~22.04                       amd64        opencl built using CMake
ii  rocm-opencl                                2.0.0.50500-63~22.04                       amd64        opencl built using CMake
ii  rocm-opencl-dev                            2.0.0.50500-63~22.04                       amd64        opencl built using CMake
ii  rocm-opencl-runtime                        5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-opencl-sdk                            5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-openmp-sdk                            5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) OpenMP Software development Kit.
ii  rocm-smi-lib                               5.0.0.50500-63~22.04                       amd64        AMD System Management libraries
ii  rocm-utils                                 5.5.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocminfo                                   1.0.0.50500-63~22.04                       amd64        Radeon Open Compute (ROCm) Runtime rocminfo tool

Commandlineargs in webui-user.sh:
export COMMANDLINE_ARGS="--no-half-vae --disable-nan-check --opt-split-attention --medvram"

And you have to set the system variable (or set it every time before the start of webui):

sudo -i
sudo echo ROC_ENABLE_PRE_VEGA=1 >> /etc/environment
sudo echo HSA_OVERRIDE_GFX_VERSION=8.0.3 >> /etc/environment
# Reboot after this

In Ubuntu / Linux Mint you also need:

sudo usermod -aG video $LOGNAME
sudo usermod -aG render $LOGNAME

@faengal
Copy link
Author

faengal commented Aug 11, 2023

@viebrix
The ubuntu setup is quite different from arch's. @weengjeowed When previously trying to downgrade my setup I may have missed a bunch of important packages, so I'm trying some stuff out and will return some feedback whether I succeed or not.

@faengal
Copy link
Author

faengal commented Aug 11, 2023

I tried downgrading more packages which seemed to not match my functional manjaro system, but didn't work out. Still getting "Torch is not able to use GPU".
It's my first time trying to downgrade something at arch, so I probably did something wrong.

https://justpaste.it/9nbym

@faengal
Copy link
Author

faengal commented Sep 2, 2023

I may have been missing some libraries for the amdgpu. I did a fresh arch install manually and installed:
mesa lib32-mesa xf86-video-amdgpu vulkan-radeon lib32-vulkan-radeon libva-mesa-driver lib32-libva-mesa-driver mesa-vdpau lib32-mesa-vdpau

Now instead of crashing before even finishing to start up, it starts the webui but then crashes right afterwards before I can do anything.

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 8.7s (prepare environment: 2.1s, import torch: 3.0s, import gradio: 0.8s, setup paths: 0.7s, other imports: 0.6s, setup codeformer: 0.3s, load scripts: 0.4s, create ui: 0.5s, gradio launch: 0.4s).
Creating model from config: /home/faengal/programs/stable-diffusion-webui/configs/v1-inference.yaml
/usr/include/c++/13.1.1/bits/stl_vector.h:1125: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = hip::DeviceFunc*; _Alloc = std::allocator<hip::DeviceFunc*>; reference = hip::DeviceFunc*&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
./webui.sh: line 255:  4501 Aborted                 (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"

@t0mj
Copy link

t0mj commented Sep 6, 2023

I'm also having the exact same error on Garuda (Arch/KDE).

Starts the webui and crashes with
./webui.sh: line 255: 785702 Segmentation fault (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"

I have an AMD 9700xtx

@viebrix
Copy link

viebrix commented Sep 6, 2023

do you have also a second gpu maybe an internal gpu?
Here is a solution:
#11712

Here is another source and solution for segmentation error:
#6420

@t0mj
Copy link

t0mj commented Sep 7, 2023

do you have also a second gpu maybe an internal gpu? Here is a solution: #11712

Here is another source and solution for segmentation error: #6420

I do have integrated graphics as well. I managed to get it working last night on CPU by downgrading with pyenv to 3.10 and forcing torch to rocm5.4.2 and using
TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.4.2' python launch.py --precision full --no-half --skip-torch-cuda-test

No luck getting my graphics card selected/working though :(

@Ultra119
Copy link

Using rocm version 5.5.0 fixed segfault for me (RX 580):

  1. Re initalize your venv
  2. Enter this: TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.5.0'
  3. Run python3 launch.py --precision full --no-half --opt-sub-quad-attention --lowvram --disable-nan-check --skip-torch-cuda-test

In this case webui.sh does not need to be touched

@riku-ati
Copy link

riku-ati commented Sep 21, 2023

Was having the same issue where the UI would run and crash
After doing some digging finally got it to work.

The command that was useful:
source activate (this depends on your venv)
pip3 list | grep torch (or pip)

pay attention to your rocm installed version. The important part was "rocm5.4.2"
Even though I tried to install nightly build torch for 5.6. When performing that operation it said everything was fine.

Activate your venv if you have to again.
pip3 uninstall torch torchaudio torchvision

Once complete
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6
This may change based on the version you want to run. 5.7 support isn't out yet.

You also have to ensure that you have 5.6 installed remove any other version of rocm<x.x.x>
sudo apt autoremove rocm-core
sudo apt install rocm-hip-sdk5.6.1

pip3 list | grep torch
torch 2.2.0.dev20230920+rocm5.6

Hopefully this helps you resolve your issue! :)
image

The overall point is use the grep torch command to ensure you have the same version of torch installed. The mismatch will cause issues and might not work. (this is even for Rocm5.2)

sorry if the post isn't written properly, posting for the first time.

@hkpD
Copy link

hkpD commented Sep 22, 2023

Was having the same issue where the UI would run and crash After doing some digging finally got it to work.

The command that was useful: source activate (this depends on your venv) pip3 list | grep torch (or pip)

pay attention to your rocm installed version. The important part was "rocm5.4.2" Even though I tried to install nightly build torch for 5.6. When performing that operation it said everything was fine.

Activate your venv if you have to again. pip3 uninstall torch torchaudio torchvision

Once complete pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6 This may change based on the version you want to run. 5.7 support isn't out yet.

You also have to ensure that you have 5.6 installed remove any other version of rocm<x.x.x> sudo apt autoremove rocm-core sudo apt install rocm-hip-sdk5.6.1

pip3 list | grep torch torch 2.2.0.dev20230920+rocm5.6

Hopefully this helps you resolve your issue! :) image

The overall point is use the grep torch command to ensure you have the same version of torch installed. The mismatch will cause issues and might not work. (this is even for Rocm5.2)

sorry if the post isn't written properly, posting for the first time.

hey, thank you

@weengjeowed
Copy link

For those still following this using Arch Linux like in the OP, the cause is this issue affecting python-pytorch-rocm and python-pytorch-opt-rocm.

@faengal
Copy link
Author

faengal commented Nov 13, 2023

Seems like there's an update for python-pytorch-opt-rocm/python-pytorch-rocm to 2.0.1-11, But I did not have the opportunity to check if it's fixed yet.

@faengal
Copy link
Author

faengal commented Nov 18, 2023

It's working properly with python-pytorch-opt-rocm 2.1.0-1 and python-torchvision 0.15.2-1.

@faengal faengal closed this as completed Nov 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asking-for-help-with-local-system-issues This issue is asking for help related to local system; please offer assistance platform:amd Issues that apply to AMD manufactured cards
Projects
None yet
Development

No branches or pull requests

8 participants