-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault during build of rocFFT on Fedora #422
Comments
Hmm.. I'm not super familiar with this component, but it looks like hipRTC is trying to call a null function pointer in libamd_comgr.so. Is libamd_comgr.so present in the default library search path when this is run? Is it loaded, according to gdb's |
|
@cgmb FYI |
@evetsso, if it helps, these are complete steps to reproduce the issue in a docker image:
You can then get a bit closer to the problem with:
It can also help to install debug packages:
There's too much that has been optimized out for me to understand what is wrong and I'm not very familiar with hiprtc. |
Just as a data point, Debian is using rocFFT from 5.5.1, built with HIP from 5.2.3 and LLVM 15. It has not encountered this problem, so I suspect this might therefore be related to a change in a low-level component like HIP, comgr, or clang. |
Interesting.. I just tried @cgmb's repro steps (i.e. with the latest rawhide as of today) and get different behaviour now that ROCm 5.6 has been released:
So hipRTC in ROCm 5.6 is no longer trying to follow a null function pointer at least. Again, I'm not really familiar with hipRTC, but I also don't see any obvious changes in the clr commit log between ROCm 5.5 and 5.6 that would explain this. This new error seems to be Fedora-specific, and caused by https://src.fedoraproject.org/rpms/rocclr/blob/rawhide/f/0001-add-uint64_t-variant-for-__ffsll.patch - ideally the reason for that patch has gone away with ROCm 5.6 and it can just be reverted. But I'm not familiar enough with the original problem to be sure. |
Hi @tflink, do you have any update? Has the behaviour changed for you with ROCm 5.6? |
Apologies for the delay, I've been messing with how I get notifications and managed to miss a few from github. I see the same behavior you do when I build 5.6.0. I started patching out the kernel cache during the build when I learned that's how the debian package is handling the issue. I rebuilt rocclr to not use the patch you linked to and while the runtime_error goes away, the build just hangs after If I patch out the kernel cache parts in the same way that the debian patch does, rocfft builds and doesn't hang at the end of the build process. Is the best way to help you reproduce issues with container images? |
Also, I spoke with the author of the uint64 patch you linked to (@trixrt) and as far as we know, it's still needed for blender. |
I do not encounter this issue on Debian. rocfft 5.5.0-1 built successfully and did not include that patch. I only patched it out of the build because I had not yet figured out where the cache should be installed according to Debian policy, and it seemed wasteful to build the cache if it wasn't going to be installed. |
Good to know, thanks |
I'm getting the same issue with hanging at the end of the rocfft build on RHEL 9.2 using the AMD supplied packages for deps. Am I just being impatient? I left this build alone for longer, though. |
The end of the build does take a very long time, as that's when we're compiling all of the kernels that we want to distribute with the library, for all enabled architectures. The optional ROCFFT_BUILD_KERNEL_CACHE_PATH CMake parameter that was the subject of #430 can help somewhat, by defining a place where these kernels are persistently stored between builds so that they can be reused. But it doesn't reduce the initial amount of work that the build would want to do. rocFFT will be perfectly functional if the cache is not built or distributed with the library. But that means that all kernels will need to be compiled at runtime, since rocFFT will not find any that are distributed with it. Creation of FFT plans will take longer, but the plans will still work. A middle ground might be to remove architectures from AMDGPU_TARGETS. Fewer architectures in that list means fewer kernels that need to be compiled at the end of the build. rocFFT will still work on an architecture that is not in this list, because it will still compile kernels on-demand at runtime. |
Yeah, I didn't kill the build on RHEL and it turns out that it took about 1.5 hours to do that last part. I'll redo the fedora build locally and let it run. Unless something strange happens, it looks like we're going to have to deal with that patch and find some sort of fix. |
After revering that patch, rocFFT builds and works fine on this system. It turns out that the Thanks for the help with debugging this. |
What is the expected behavior
What actually happens
How to reproduce
Environment
Tagging @Mystro256 as he has worked on packaging many of the dependencies here.
The Issue
Important note: I'm working to package ROCm in Fedora so I'm building upon the bits we already have packaged instead of using AMD's prebuilt packages or building everything in
/opt/rocm
.Before I get to the packaging part, I'm just trying to build rocFFT on my local machine. As part of this, I'm doing the following
The output I get on screen ends with
Using system gdb (not ROCgdb) to get the backtrace:
The text was updated successfully, but these errors were encountered: