Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in rocm 5.3 and newer for gfx1010 #2527

Open
DGdev91 opened this issue Oct 5, 2023 · 51 comments
Open

Regression in rocm 5.3 and newer for gfx1010 #2527

DGdev91 opened this issue Oct 5, 2023 · 51 comments

Comments

@DGdev91
Copy link

DGdev91 commented Oct 5, 2023

Since when pytorch 2 was officially released, i wasn't able to run it on my 5700XT, while i was previously able to use it just fine on pytorch 1.13.1 by setting "export HSA_OVERRIDE_GFX_VERSION=10.3.0"
There are many reporting the same issue on the 5000 series, like for example
AUTOMATIC1111/stable-diffusion-webui#6420

--precison-full and --no-half are also needed because the card seems like can't use fp16 on linux/rocm, as already reported here #1857

i also read about the PCI atomics requirement, following this issue pytorch/pytorch#103973
....But that doesn't seems to be my case. the command "grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties" returns:

/sys/class/kfd/kfd/topology/nodes/0/io_links/0/properties:flags 3
/sys/class/kfd/kfd/topology/nodes/1/io_links/0/properties:flags 1

Also, i tried to compile pytorch using the new "-mprintf-kind=buffered" flag, but it didn't change anything.

Finally, i recently found out that pytorch 2 works just fine on gfx1010 if that's compiled by rocm 5.2, as suggested here pytorch/pytorch#106728

@langyuxf
Copy link

langyuxf commented Oct 6, 2023

What's your motivation to use newer ROCm? Expect better performance?

@DGdev91
Copy link
Author

DGdev91 commented Oct 6, 2023

What's your motivation to use newer ROCm? Expect better performance?

Well, for example to be able to use the official pytorch builds instead of using old nighties or compliling from source.

@kode54
Copy link

kode54 commented Nov 6, 2023

Also, PyTorch deleted their rocm5.2 repo, so all that's available now is the broken 5.6.

Edit: Never mind, I missed that the relevant repositories are specific to Python 3.10.

@kmsedu
Copy link

kmsedu commented Nov 16, 2023

@hongxiayang

Good to hear you make it work, though with the old version. Firstly, I am not sure whether your problem is related to PCIe atomics, or it is just related to gfx arch (1010 is not in the list of compiled targets of the recent wheels). I would hope the 5.7 wheels we build will work if the problem is related to atomics. If you don't have atomics issue, then we should discuss in a different issue.

This is not an atomics issue for gfx1010 users,

λ ~/ grep flags /sys/class/kfd/kfd/topology/nodes/*/io_links/0/properties                                                                                                              INSERT
flags 1

For 1010 users, my understanding is that there is no official target. We have had to use the HSA_OVERRIDE_GFX_VERSION hack to allow ROCm to function at all. Ever since the release of ROCm5.3, some change in memory access code for the gfx1030 arch has prevented us from using this hack, due to OOB errors.

pytorch/pytorch#103973 caught my eye (and likely other users with similar consumer GPUs) because it's the first issue stating that functionality was lost beyond torch1.13+rocm5.2 that has had a thorough looking at.

In addition, the (greatly appreciated) work of @jeffdaily was the first step we've seen with regards to bumping up the usable version of PyTorch for us.

I understand that the work required to isolate and undo whatever memory access changes took place between 5.2-5.3 is probably more than what its worth, considering AMD's stance on maintaining compat for older GPUs, as well as possibly breaking the actually supported gfx1030 GPUs. Therefore we've been left to fend for ourselves a little.

That's how the issue got a little hijacked, apologies for the intrusion there.

I would say that this issue would be the appropriate place for any continued conversation on the matter. 👍

@hongxiayang
Copy link
Collaborator

ok, we will tackle this issue next @kmsedu @DGdev91

@hongxiayang
Copy link
Collaborator

Have you tried to simulate gfx906, like:

export HSA_OVERRIDE_GFX_VERSION=9.0.6

@hongxiayang hongxiayang self-assigned this Nov 17, 2023
@kmsedu
Copy link

kmsedu commented Nov 17, 2023

@hongxiayang

Have you tried to simulate gfx906, like:

export HSA_OVERRIDE_GFX_VERSION=9.0.6

Results below:

(57venv) λ ~/ai/stable-diffusion-webui/ master* python --version                                                                                                                       INSERT
Python 3.10.6

(57venv) λ ~/ai/stable-diffusion-webui/ master* pip list | grep rocm                                                                                                                   INSERT
torch                     2.2.0.dev20231114+rocm5.7
torchvision               0.17.0+rocm5.7

(57venv) λ ~/ai/stable-diffusion-webui/ master* HSA_OVERRIDE_GFX_VERSION=9.0.6 python mnist_main.py --dry-run                                                                          INSERT
use_cuda=True arg no_cuda=False cuda available=True
[1]    141921 segmentation fault (core dumped)  HSA_OVERRIDE_GFX_VERSION=9.0.6 python mnist_main.py --dry-run

Stack trace from coredump here:

                Found module linux-vdso.so.1 with build-id: aa98f5cb7cb88a767d1a384eb7b00d363d9d711e
                Found module libfribidi.so with build-id: 6e075a666e1da8ffdb948d734e75d82b1b6dc0fb
                Found module librt.so.1 with build-id: 1e261495981090dca22c9006c3218baead278c7a
                Found module ld-linux-x86-64.so.2 with build-id: 9718d3757f00d2366056830aae09698dbd35e32c
                Found module libc.so.6 with build-id: a43bfc8428df6623cd498c9c0caeb91aec9be4f9
                Found module libm.so.6 with build-id: d2c7d1fdefc7a876b6017c090ccd55fb21e8d77f
                Found module libutil.so.1 with build-id: 24f02e478ddf82435d8c5e0d7eb96f8338f2670b
                Found module libdl.so.2 with build-id: 8ab13ce8a1e6a9b18a844da65688e882f3eb132d
                Found module libpthread.so.0 with build-id: 81f46d553e2f7c999e43c3eede73a822bc8d5d93
                Stack trace of thread 141921:
                #0  0x00007f4fb826614e n/a (/home/kms/ai/stable-diffusion-webui/57venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so + 0x26614e)

@hongxiayang
Copy link
Collaborator

hongxiayang commented Nov 17, 2023

Thanks for trying. Next thing we can try is to build pytorch on rocm from source. Since you don't have PCIe atomics issue, we will use official pytorch repository. Here is instructions:

  1. start another terminal, start a new container of (rocm/pytorch:latest-base) with the parameters using the instruction above
sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 8G -u root rocm/pytorch:latest-base

(2) clone the pytorch inside your docker container

git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch
python tools/amd_build/build_amd.py 
PYTORCH_ROCM_ARCH=gfx1010 python setup.py develop

(3) run the test again inside pytorch folder.

@kmsedu
Copy link

kmsedu commented Nov 17, 2023

I have compiled torch 2.2.0a0+git6849d75 and torchvision 0.17.0a0+4433680.
Below is the output:

root@mainPC:/home/pytorch# pip list | grep torch
torch                    2.2.0a0+git6849d75 /opt/conda/envs/py_3.9/lib/python3.9/site-packages
torchvision              0.17.0a0+4433680   /home/vision

root@mainPC:/home/pytorch# python examples/mnist/main.py --dry-run
Segmentation fault (core dumped)

root@mainPC:/home/pytorch# python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_arch_list())"
True
['gfx1010']

If a stack trace could be of use, please let me know and I'll figure out how to setup coredumpctl in the docker container.

@DGdev91
Copy link
Author

DGdev91 commented Dec 9, 2023

Any news on this? need more infos?

@DGdev91
Copy link
Author

DGdev91 commented Dec 18, 2023

I tried it also with the new ROCm 6.0, it doesn't really seem to change much. works fine with an old nigtly build of pytorch 2.0 compiled on rocm5.2, but crashes on the last --pre pytorch
Memory access fault by GPU node-1 (Agent handle: 0x968d080) on address 0x7fa860641000. Reason: Page not present or supervisor privilege.
happens with both HSA_OVERRIDE_GFX_VERSION=10.3.0 and without. but it's probably needed since it doesn't work on the 5.2 build without it.

@theron29
Copy link

theron29 commented Jan 11, 2024

Memory access fault by GPU node-1 (Agent handle: 0x968d080) on address 0x7fa860641000. Reason: Page not present or supervisor privilege.
happens with both HSA_OVERRIDE_GFX_VERSION=10.3.0 and without.

I made this work with HSA_OVERRIDE_GFX_VERSION=9.4.0 (and I had to find this out purely by trial & error....).
It is not fully stable, but it worked with SD within at least two separate system boots...

@DGdev91
Copy link
Author

DGdev91 commented Jan 26, 2024

Memory access fault by GPU node-1 (Agent handle: 0x968d080) on address 0x7fa860641000. Reason: Page not present or supervisor privilege.
happens with both HSA_OVERRIDE_GFX_VERSION=10.3.0 and without.

I made this work with HSA_OVERRIDE_GFX_VERSION=9.4.0 (and I had to find this out purely by trial & error....). It is not fully stable, but it worked with SD within at least two separate system boots...

I changed the gpu on my machine and cannot verify that. @kmsedu can you?

Aldo, are you sure you were actually running on the last pytorch version? In automatic1111's webui there's a workaround for making it use by default an older pytorch version compiled on ROCm 5.2 on older navi cards like the RX5700xt (i know that well because... I wrote that workaround).

@theron29
Copy link

Aldo, are you sure you were actually running on the last pytorch version? In automatic1111's webui there's a workaround for making it use by default an older pytorch version compiled on ROCm 5.2 on older navi cards like the RX5700xt (i know that well because... I wrote that workaround).

Hey there. No, I'm not running on the latest version of pytorch.
And yes, you are correct, I've run this on/witth automatic1111's webui with (your 🙏 ❤️) workaround in place. The env HSA_OVERRIDE_GFX_VERSION=9.4.0 made the solution to automatically downgrade the version of pytorch to the version built with rocm 5.2.

@hongxiayang
Copy link
Collaborator

Since gfx1010 is not in the support gfx target list ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus), the latest versions may not work for your gpu.

@DGdev91
Copy link
Author

DGdev91 commented Jan 28, 2024

Hey there. No, I'm not running on the latest version of pytorch. And yes, you are correct, I've run this on/witth automatic1111's webui with (your 🙏 ❤️) workaround in place. The env HSA_OVERRIDE_GFX_VERSION=9.4.0 made the solution to automatically downgrade the version of pytorch to the version built with rocm 5.2.

Ok, then you have just the same problem. We know that anything compiled using Rocm 5.2 or older works just fine on that card.
If you try to force it to a newer version in webui_user.sh probably it's not going to work.
Also, usually HSA_OVERRIDE_GFX_VERSION=10.3.0 is used for the override.
Automatic1111's webui should force it automatically for older gpus, so maybe you were actually using that.

Since gfx1010 is not in the support gfx target list ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus), the latest versions may not work for your gpu.

I know. But it's still wierd that a gpu wich worked perfectly fine with pytorch compiled for an older romc version + HSA_OVERRIDE_GFX_VERSION=10.3.0 (even if not officially supported) suddently stops working on everything compiled using something newer.
I also tried to pick up an old docker image with rocm 5.2 but it doesn't seems able to compile it.

This is also true for other softwares wich rely on rocm, like llama-cpp with hipblas support

@cgmb
Copy link
Collaborator

cgmb commented Jan 28, 2024

I know. But it's still wierd that a gpu wich worked perfectly fine with pytorch compiled for an older romc version + HSA_OVERRIDE_GFX_VERSION=10.3.0 (even if not officially supported) suddently stops working on everything compiled using something newer.

When you use HSA_OVERRIDE_GFX_VERSION=10.3.0, you are telling the ROCm runtime to pretend that your RDNA 1 GPU is an RDNA 2 GPU. The weird thing is that ever worked, not that it stopped working.

This is also true for other softwares wich rely on rocm, like llama-cpp with hipblas support

If you're on Debian 13 or Ubuntu 23.10 or later, use libhipblas-dev. The OS-provided packages have gfx1010 enabled.

@DGdev91
Copy link
Author

DGdev91 commented Jan 28, 2024

If you're on Debian 13 or Ubuntu 23.10 or later, use libhipblas-dev. The OS-provided packages have gfx1010 enabled.

Isn't that just the same as installing the rocm stack (or at least part of it)? It depends on rocblas, wich depends on hip, wich depends on the HSA runtime, and so on.
Also, as i already wrote, the problem isn't the rocm version installed, but the one used to compile the software. And compiling them using the old version usually isn't straightforward/possibile.

@cgmb
Copy link
Collaborator

cgmb commented Jan 29, 2024

Isn't that just the same as installing the rocm stack (or at least part of it)? It depends on rocblas, wich depends on hip, wich depends on the HSA runtime, and so on.

HIP works fine on gfx1010. It's mainly just that the math libraries in AMD's binary packages are (mostly) not built for that architecture.

When I packaged rocBLAS for Debian, I specified for it to be built for gfx1010. I also packaged the test suites for rocBLAS and hipBLAS, and ran them on both the RX 5700 XT, and Radeon W5700. All tests passed.

Of course, nobody has packaged MIOpen for Debian yet, so while the OS packages should be sufficient for llama-cpp, they are not sufficient yet for something like PyTorch.

@DGdev91
Copy link
Author

DGdev91 commented Jan 29, 2024

Isn't that just the same as installing the rocm stack (or at least part of it)? It depends on rocblas, wich depends on hip, wich depends on the HSA runtime, and so on.

HIP works fine on gfx1010. It's mainly just that the math libraries in AMD's binary packages are (mostly) not built for that architecture.

When I packaged rocBLAS for Debian, I specified for it to be built for gfx1010. I also packaged the test suites for rocBLAS and hipBLAS, and ran them on both the RX 5700 XT, and Radeon W5700. All tests passed.

Of course, nobody has packaged MIOpen for Debian yet, so while the OS packages should be sufficient for llama-cpp, they are not sufficient yet for something like PyTorch.

Last time i tried i had a memory access error (just like the newer pytorch versions) when trying to load a model in lama.cpp with both hipblas and clblast offloading, while the second worked fine on Windows. I had the same problem in both ArchLinux and Ubuntu.
But it can easly be a totally different issue.

@DGdev91
Copy link
Author

DGdev91 commented Jan 29, 2024

Since gfx1010 is not in the support gfx target list ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus), the latest versions may not work for your gpu.

When you use HSA_OVERRIDE_GFX_VERSION=10.3.0, you are telling the ROCm runtime to pretend that your RDNA 1 GPU is an RDNA 2 GPU. The weird thing is that ever worked, not that it stopped working.

Anyway, i'm perfectly aware of that. You are right, that card wasn't never supposed to work on rocm in the first place and it works with some older pytorch builds just thanks to a workaround.
That's also true for every consumer-grade amd gpu other than the 7900xt and 7900xtx, wich many users are still using thanks to the same workaround.

But after this reply on november 17th

ok, we will tackle this issue next @kmsedu @DGdev91

I was expecting to see something on this matter anyway.

@cgmb
Copy link
Collaborator

cgmb commented Jan 29, 2024

Last time i tried i had a memory access error [...] when trying to load a model in lama.cpp with both hipblas and clblast offloading, while the second worked fine on Windows. I had the same problem in both ArchLinux and Ubuntu.

To be clear, on Ubuntu were you using libhipblas-dev (which installs to /usr/lib/x86_64-linux-gnu) or were you using hipblas-dev (which installs to /opt/rocm/lib)? If you were using libhipblas-dev, I'm very interested in learning more. Could you provide some instructions on how to reproduce the problem?

That's also true for every consumer-grade amd gpu other than the 7900xt and 7900xtx, wich many users are still using thanks to the same workaround.

Using HSA_OVERRIDE_GFX_VERSION=10.3.0 on RDNA 2 GPUs is fundamentally different from using it on RDNA 1 GPUs. All RDNA 2 GPUs use the exact same instructions, but there's a bunch of differences between the instructions used on RDNA 1 and RDNA 2 GPUs. The only way to undo this 'regression' with HSA_OVERRIDE_GFX_VERSION would be to change LLVM so that the compiler only uses instructions available on RDNA 1, even when asked to compile for RDNA 2. That's not going to happen.

A better path to getting gfx1010 enabled in PyTorch would be to build the ROCm math and AI libraries for gfx1010 (or gfx10.1-generic). That is probably not going to happen in AMD's official packages, but there are other groups building and distributing ROCm packages. I can't speak for other distributions, but I expect to have it enabled later this year on Debian. With that said, my work with Debian is strictly volunteer work (on top of my full-time job), so don't expect it to happen quickly.

@DGdev91
Copy link
Author

DGdev91 commented Jan 29, 2024

Last time i tried i had a memory access error [...] when trying to load a model in lama.cpp with both hipblas and clblast offloading, while the second worked fine on Windows. I had the same problem in both ArchLinux and Ubuntu.

To be clear, on Ubuntu were you using libhipblas-dev (which installs to /usr/lib/x86_64-linux-gnu) or were you using hipblas-dev (which installs to /opt/rocm/lib)? If you were using libhipblas-dev, I'm very interested in learning more. Could you provide some instructions on how to reproduce the problem?

That's also true for every consumer-grade amd gpu other than the 7900xt and 7900xtx, wich many users are still using thanks to the same workaround.

Using HSA_OVERRIDE_GFX_VERSION=10.3.0 on RDNA 2 GPUs is fundamentally different from using it on RDNA 1 GPUs. All RDNA 2 GPUs use the exact same instructions, but there's a bunch of differences between the instructions used on RDNA 1 and RDNA 2 GPUs. The only way to undo this 'regression' with HSA_OVERRIDE_GFX_VERSION would be to change LLVM so that the compiler only uses instructions available on RDNA 1, even when asked to compile for RDNA 2. That's not going to happen.

A better path to getting gfx1010 enabled in PyTorch would be to build the ROCm math and AI libraries for gfx1010 (or gfx10.1-generic). That is probably not going to happen in AMD's official packages, but there are other groups building and distributing ROCm packages. I can't speak for other distributions, but I expect to have it enabled later this year on Debian. With that said, my work with Debian is strictly volunteer work (on top of my full-time job), so don't expect it to happen quickly.

Ok, now it's more clear.
I can confirm i used hipblas-dev

I was also thinking that the hsa override flag was needed for rocblas too, because i couldn't use it on native 1010 since the libs for 1010 were missing in the official packages.

I also just found this PR wich has been merged just 5 days ago wich makes life a bit more simple for compiling the tensile libs for 1010
ROCm/Tensile#1862

@Zakhrov
Copy link

Zakhrov commented Apr 17, 2024

Now that ROCM 6.1 is out, I tried it with the latest pytorch nightly (which still is built with Rocm 6.0) and this is the error I get when trying to run ComfyUI:

:3:hip_platform.cpp         :211 : 14629861941 us: [pid:31113 tid:0x7fa5a37fe700] __hipPopCallConfiguration: Returned hipSuccess : 
:3:hip_module.cpp           :668 : 14629861947 us: [pid:31113 tid:0x7fa5a37fe700]  hipLaunchKernel ( 0x7fa9706d5550, {57,1,1}, {256,1,1}, 0x7fa5a37fc9a0, 0, stream:<null> ) 
:3:hip_module.cpp           :669 : 14629861952 us: [pid:31113 tid:0x7fa5a37fe700] hipLaunchKernel: Returned hipErrorInvalidDeviceFunction : 
:3:hip_error.cpp            :35  : 14629861955 us: [pid:31113 tid:0x7fa5a37fe700]  hipGetLastError (  ) 
:3:hip_error.cpp            :35  : 14629861957 us: [pid:31113 tid:0x7fa5a37fe700]  hipGetLastError (  ) 
:3:hip_device_runtime.cpp   :652 : 14629866892 us: [pid:31113 tid:0x7fa5a37fe700]  hipSetDevice ( 0 ) 
:3:hip_device_runtime.cpp   :656 : 14629866904 us: [pid:31113 tid:0x7fa5a37fe700] hipSetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :652 : 14629866934 us: [pid:31113 tid:0x7fa5a37fe700]  hipSetDevice ( 0 ) 
:3:hip_device_runtime.cpp   :656 : 14629866937 us: [pid:31113 tid:0x7fa5a37fe700] hipSetDevice: Returned hipSuccess : 
!!! Exception during processing !!!
Traceback (most recent call last):
  File "/home/aaron/Projects/personal/ComfyUI/execution.py", line 151, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaron/Projects/personal/ComfyUI/execution.py", line 81, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaron/Projects/personal/ComfyUI/execution.py", line 74, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaron/Projects/personal/ComfyUI/nodes.py", line 1378, in sample
    return common_ksampler(model, noise_seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise, disable_noise=disable_noise, start_step=start_at_step, last_step=end_at_step, force_full_denoise=force_full_denoise)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaron/Projects/personal/ComfyUI/nodes.py", line 1314, in common_ksampler
    samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaron/Projects/personal/ComfyUI/comfy/sample.py", line 37, in sample
    samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaron/Projects/personal/ComfyUI/comfy/samplers.py", line 755, in sample
    return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaron/Projects/personal/ComfyUI/comfy/samplers.py", line 657, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaron/Projects/personal/ComfyUI/comfy/samplers.py", line 644, in sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaron/Projects/personal/ComfyUI/comfy/samplers.py", line 616, in inner_sample
    if latent_image is not None and torch.count_nonzero(latent_image) > 0: #Don't shift the empty latent image.
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: invalid device function
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.


Prompt executed in 6.49 seconds

setting HSA_OVERRIDE_GFX_VERSION to 10.3.0 still doesn't work it just maxes out the GPU clock and graphics pipeline but doesn't actually do anything.

@Zakhrov
Copy link

Zakhrov commented Apr 24, 2024

Pytorch wasn't building because of ROCm/aotriton#18 so I hacked it out and made pytorch compile without aotriton. This actually works and I'm able to run ComfyUI!

Notes:

  • rocBLAS must be built with gfx1010 Tensile libraries and installed into ROCM_PATH, the SLES package does not include them
  • Because of my dirty hacks, pytorch does not have flash or memory efficient attention, so --use-pytorch-cross-attention can lead to HIP out of memory errors, subquadratic attention seems to work better.
  • Forcing fp16 works and gives proper images
  • Forcing fp32 also works
  • You don't need the HSA_OVERRIDE_GFX_VERSION anymore, it works with python main.py
  • Performance is slower than pytorch built against ROCM 5.2 (although this could be because of thermal throttling) but seems to be much more stable (no random GPU resets despite both the CPU and GPU running near the thermal throttle limit - so far)

I'm trying again with aotriton patched with the fix @xinyazhang suggested to see if that helps with easier/more straightforward building and possibly with performance.

@Zakhrov
Copy link

Zakhrov commented Apr 24, 2024

OK after a lot of trial and error I've managed to get a consistent set of steps.

Steps:

  • Build and install rocBLAS to get the Tensile library to build against gfx1010 (can skip if your distribution already builds it, the official SLES package doesnt) use cmake -B build -S . -DCMAKE_INSTALL_PREFIX=/opt/rocm-6.1.0 -DAMDGPU_TARGETS="gfx1010"
  • Patch the comgr library with the code from here: GZGavinZhao/ROCm-CompilerSupport@3419d51
  • Build comgr and install it to ROCM_PATH
  • Build and install rocSPARSE for the gfx1010 target
  • Start the build process for pytorch from source against your system rocm. I used ROCM_PATH=/opt/rocm USE_ROCM=1 PYTORCH_ROCM_ARCH=gfx1010 python3 setup.py develop
  • Abort the pytorch build process once it completes configuring aotriton, and then patch aotriton with this: ROCm/aotriton@0873896 Changing the commit hash in pytorch/External/aotriton.cmake didn't work for me, so I manually edited the file
  • Start the build process again. This can take an insanely long time, thanks to aotriton building triton from source, and compiling a whole bunch of HSA command objects for the MI200 and MI300X. Disabling USE_FLASH_ATTENTION will still cause it to be built so might as well keep it enabled. Stopping and starting the build process will cause all of those HSA command objects / HIP kernels to recompile, so be prepared for pytorch to take over an hour to build
  • Once pytorch has been successfully built and installed, build and install torchvision from source. This won't take as long, but it will need to be done every time you update the pytorch build
  • With pytorch and torchvision built and installed, you can run ComfyUI or any other pytorch workload, without needing the HSA_OVERRIDE_GFX_VERSION variable. Also gfx1010 will now work in both fp16 and fp32 modes, and it should be a little more stable without random SDMA queue timeouts and GPU resets

Caveats:

  • Pytorch's SDP backend for cross attention does not work and fails with the following error:
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 5.98 GiB of which 6.00 MiB is free. Of the allocated memory 5.63 GiB is allocated by PyTorch, and 190.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

  • The expandable memory segments that pytorch suggests you set, also does not work and fails with this error:
UserWarning: expandable_segments not supported on this platform (Triggered internally at ../c10/hip/HIPAllocatorConfig.h:29.)

  • These steps are not something you can easily add into an automated build system or script, as of the time of writing, it requires you to manually edit files in the pytorch build directory after a build has partially completed. Furthermore, this only targets gfx1010

The key takeaway from this exercise is that I would have probably been better off had I not nuked my Windows install and just used DirectML. Needless to say it was both educational and frustrating. Anyway, hopefully this should help others until AMD releases ROCM 6.2 with the fallback libraries in the official rocBLAS packages and the docker images used to build pytorch, and pytorch themselves build pytorch wheels against those

@daniandtheweb
Copy link

daniandtheweb commented Apr 24, 2024

That's amazing, could you share the pytorch files you've built? I'm trying to build on Debian Unstable, which is compatible with gfx1010 by default using the rocm libraries, but I'm unable to build correctly (I've probably messed something up in the setup process of the pytorch packages).

@Zakhrov
Copy link

Zakhrov commented Apr 24, 2024

That's amazing, could you share the pytorch files you've built? I'm trying to build on Debian Unstable, which is compatible with gfx1010 by default using the rocm libraries, but I'm unable to build correctly (I've probably messed something up in the setup process of the pytorch packages).

I'll try to build a wheel from my pytorch setup

@Zakhrov
Copy link

Zakhrov commented Apr 24, 2024

@daniandtheweb here you go: https://drive.google.com/file/d/1Y2kQ3bnoihs892tHOpXHkvfMQJH_gYa9/view?usp=drive_link
I'm not entirely sure if it will work for you. See my updated caveats

@Zakhrov
Copy link

Zakhrov commented Apr 24, 2024

OK more instability.
Trying to run ComfyUI with the dpmpp_2m_sde_gpu sampler and the karras scheduler triggers this error:

:0:rocdevice.cpp            :2881: 33809159190 us: [pid:7301  tid:0x7fb3acfff700] Callback: Queue 0x7fb168400000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29

I suppose I'll have to build torchsde manually for that to work

@Zakhrov
Copy link

Zakhrov commented Apr 24, 2024

OK can confirm that the _gpu and the sde samplers do not work with these workarounds. Perhaps there is more stuff I'm overlooking

@daniandtheweb
Copy link

@Zakhrov try taking a look at this: https://lists.debian.org/debian-ai/2024/02/msg00164.html

@hongxiayang
Copy link
Collaborator

As you might be aware from this documentation (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus), gfx1010 was not among the supported gfx architectures, and therefore, the behavior is undefined. You can close this issue.

@Zakhrov
Copy link

Zakhrov commented Apr 24, 2024

As you might be aware from this documentation (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus), gfx1010 was not among the supported gfx architectures, and therefore, the behavior is undefined. You can close this issue.

I'm well aware of the supported gfx architectures. What we are following is the advice given here: #1735 (comment)

@Zakhrov
Copy link

Zakhrov commented Apr 25, 2024

Update: building rocSPARSE and comgr from source with @GZGavinZhao's patch available here: GZGavinZhao/ROCm-CompilerSupport@3419d51 got the SDE ksamplers to work in ComfyUI. Performance is still slower than with ROCM 5.2 (probably because of missing MiOpen and Composable Kernels)

@GZGavinZhao
Copy link

GZGavinZhao commented Apr 25, 2024

Who pinged me

Ok in all seriousness, this issue should be resolved if ROCm/Tensile#1897 is cherry-picked into a release. I'll open an issue there (ROCm/Tensile#1916). @Zakhrov if you really want to test gfx1010 support, you can try on Solus (the distro that I'm a maintainer for) with Docker. Note that the docker image is experimental/community-maintained so this shouldn't be used for anything serious, just for testing purposes:

# I personally use podman because I don't need to deal with sudo permission issues,
# but if you're more comfortable with Docker, replace `podman` with `sudo docker`
podman run -it --device=/dev/kfd --device=/dev/dri --group-add=video --group-add=render --group-add=nobody --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined silkeh/solus:devel bash

Inside the container, run the following to install PyTorch with ROCm support:

sudo eopkg ur && sudo eopkg up --ignore-comar -y # similar to sudo apt-get update && sudo apt-get upgrade
sudo eopkg it --ignore-comar pytorch python-torchaudio python-torchvision python-torchtext rocm-info -y

Now, you can proceed to run ComfyUI as normal. We didn't test against ComfyUI but we did test against Fooocus, so the process shoudl be similar. I assume you would need to create a venv to run ComfyUI. The only change you would have to make is that when creating the venv (e.g. python3 -m venv <rest-of-your-arguments>, you should add the flag --system-site-packages. With Fooocus, it looks something like this:

# Before everything, check that ROCm is not completely broken
rocminfo
# Assume you're already inside the Fooocus directory 
python3 -m venv venv --system-site-packages
source venv/bin/activate
# You may get warnings about dependencies failed to be uninstalled/updated, but in practice we haven't found this to be problematic
pip3 install -r requirements.txt 
# Now run Fooocus!

If you're familiar with Nix, after NixOS/nixpkgs#298388 is merged, you should also be able to use gfx101* GPUs with Nix. With PyTorch in particular, this may be a bit difficult due to how Nix handles Python packages, but stuff that are just binary executables like llama-cpp should work seamlessly with Nix.

Regarding composable_kernel, gfx1010 is not supported. At Solus, we have a patch that enables some sort of support for it, so you may get a little performance boost, though nobody on the Solus team has a GPU to verify this claim 😅

@Zakhrov
Copy link

Zakhrov commented Apr 25, 2024

Ok in all seriousness, this issue should be resolved if ROCm/Tensile#1897 is cherry-picked into a release. I'll open an issue there. @Zakhrov if you really want to test gfx1010 support, you can try on Solus (the distro that I'm a maintainer for) with Docker.

I'm currently on openSUSE 15.5 (which is compatible with packages built for SLES 15 SP5) and I run rocm on bare metal for the most part. Your patch works fine for everything except torchsde workloads - that needs comgr and rocsparse as well.

Regarding composable_kernel, gfx1010 is not supported. At Solus, we have a patch that enables some sort of support for it, so you may get a little performance boost, though nobody on the Solus team has a GPU to verify this claim 😅

That's what I manually patched in to composable_kernel as well. It just finished building, now have to try building MIOpen. I'll report back once I manage to build that

@GZGavinZhao
Copy link

GZGavinZhao commented Apr 25, 2024

I'm currently on openSUSE 15.5 (which is compatible with packages built for SLES 15 SP5) and I run rocm on bare metal for the most part. Your patch works fine for everything except torchsde workloads - that needs comgr and rocsparse as well.

I'm surprised that you need to rebuild rocSPARSE. comgr is needed because of https://lists.debian.org/debian-ai/2024/02/msg00178.html. In short, ROCm 6.0 changed the behavior such that device code unbundling is handled by comgr (which was not patched), but previously the unbundling is done by clr (which was patched). The fix is to simply switch back the behavior to avoid patching comgr. I'm not sure why rocSPARSE is affected by this.

That's what I manually patched in to composable_kernel as well. It just finished building, now have to try building MIOpen. I'll report back once I manage to build that

I assume that you know you can pass the CMake flag -DGPU_TARGETS=<your-gpu-target> to only build for your own GPU. This can save a ton of time.

@GZGavinZhao
Copy link

Wait @Zakhrov are you not on gfx1010? The comgr patch is only for allowing gfx1011, gfx1012 to run code compiled against gfx1010. If you're already on gfx1010, the comgr patch shouldn't be needed.

@Zakhrov
Copy link

Zakhrov commented Apr 25, 2024

Wait @Zakhrov are you not on gfx1010? The comgr patch is only for allowing gfx1011, gfx1012 to run code compiled against gfx1010. If you're already on gfx1010, the comgr patch shouldn't be needed.

I have a Radeon RX 5600M which shows as gfx1010 in rocminfo. Maybe I built rocBLAS wrong the first time around

@GZGavinZhao
Copy link

GZGavinZhao commented Apr 25, 2024

Good news, ROCm/Tensile#1897 will be included in the ROCm 6.2 release judging from the release-staging/rocm-rel-6.2 branch at ROCm/Tensile. It seems like no additional ROCm 6.1 releases are planned, so this means gfx101* GPUs will likely be fixed in the next (minor) ROCm release.

@cgmb
Copy link
Collaborator

cgmb commented Apr 25, 2024

Note that all the system packages on Ubuntu 24.04 have gfx1010 enabled. However, to use PyTorch you still need MIOpen. Once miopen and pytorch-rocm are packaged for Debian, I will port them to gfx1010 and set up a PPA for Ubuntu 24.04. With that said, performance will probably be quite poor as rocBLAS depends heavily on tuned assembly kernels for optimal performance (and nobody has done tuning for RDNA 1).

@ppanchad-amd
Copy link

@DGdev91 Has your issue been resolved? If so, please close ticket. Thanks!

@DGdev91
Copy link
Author

DGdev91 commented May 14, 2024

@DGdev91 Has your issue been resolved? If so, please close ticket. Thanks!

I can't really test it anymore, as i changed the gpu some months ago (i now have a 7900XT, working fine)

But according to @GZGavinZhao, the fix will be included in 6.2 release, so i was waiting for someone testing that version as soon it's released, before closing

@kchousos
Copy link

Does anyone have an ETA for ROCm 6.2?

@waheedi
Copy link

waheedi commented Jul 18, 2024

That's what I manually patched in to composable_kernel as well. It just finished building, now have to try building MIOpen. I'll report back once I manage to build that

@Zakhrov So were you able to build it with gfx1010, i could not build it (MIOpen), do i need ck to be already built?

Start the build process again. This can take an insanely long time, thanks to aotriton building triton from source, and compiling a whole bunch of HSA command objects for the MI200 and MI300X. Disabling USE_FLASH_ATTENTION will still cause it to be built so might as well keep it enabled. Stopping and starting the build process will cause all of those HSA command objects / HIP kernels to recompile, so be prepared for pytorch to take over an hour to build

there is already a PR going for that part pytorch/pytorch#125230 (comment)

@waheedi
Copy link

waheedi commented Jul 25, 2024

That's what I manually patched in to composable_kernel as well. It just finished building, now have to try building MIOpen. I'll report back once I manage to build that

@Zakhrov So were you able to build it with gfx1010, i could not build it (MIOpen), do i need ck to be already built?

Start the build process again. This can take an insanely long time, thanks to aotriton building triton from source, and compiling a whole bunch of HSA command objects for the MI200 and MI300X. Disabling USE_FLASH_ATTENTION will still cause it to be built so might as well keep it enabled. Stopping and starting the build process will cause all of those HSA command objects / HIP kernels to recompile, so be prepared for pytorch to take over an hour to build

there is already a PR going for that part pytorch/pytorch#125230 (comment)

Ok I actually managed to build the whole stack for gfx1010, with latest pytorch and all develop rocm

So I thought I would make an update on that. Thanks a lot anyway.

@veyn3141
Copy link

@waheedi Amazing, would it be possible to upload the pytorch wheels somewhere?

@waheedi
Copy link

waheedi commented Jul 26, 2024

@waheedi Amazing, would it be possible to upload the pytorch wheels somewhere?

@veyn3141 Amazing what man :), the wheel on its own is not going to help, as that would be bundled with some libraries that you actually won't have with a standard rocm installation so I dont think it would be of any help.

But also I hit a blocker for building the last 200 tasks of torch and right now I'm a bit blocked. #3445

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests