-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocFFT Test Suite Fails #439
Comments
I would expect zero failing tests. Debian is still on rocFFT from ROCm 5.5.1 and it was built in release mode, but that version had 289081 passing tests and 587 skipped tests on gfx906. |
Thanks for confirming the expected test results. I wanted to make sure that they were expected to all pass before digging into why they're failing. I'll close the issue and re-file if I find something that isn't self-inflicted. |
Can you provide a couple of examples of failing tests, and any messages that |
I copied a few of the first failures into a file I captured all stdout and stderr from the test suite with normal output and I think I have some output with |
All of the test cases in your sample are half-precision, so it would be interesting to see if any half-precision test cases passed for you. All of the half-precision test cases have If none did, then it sounds plausible that something's wrong with half-precision arithmetic on either the host side or device side. To troubleshoot that, I would isolate a small test case to see what's going on: rocfft-test-d --gtest_filter=pow2_1D_half/accuracy_test.vs_fftw/complex_forward_len_2_half_ip_batch_1_istride_1_CI_ostride_1_CI_idist_2_odist_2_ioffset_0_0_ooffset_0_0 --verbose 4 This will show the actual numbers that we got from both rocFFT and the reference implementation (FFTW). I would imagine one of them looks totally bogus so once we know which one, we can dig further. |
It looks like all of the test failures have Additionally, I'm noticing this at the end of the test runner output that doesn't seem right:
I ran the test you suggested. Here is the output |
Hmm.. it's hard to be sure, but I think something is wrong with half-precision arithmetic on the host as opposed to the device. The way the tests are structured, we're generating FFT input data on the GPU device. The test copies that input data to the host. rocFFT does an FFT on the device copy while FFTW works on the host copy. Then we copy the GPU output back to the host to compare them. What's most suspicious to me is that the rocFFT is using the Assuming that's all OK, I think the next step would be to dig in with a debugger (e.g. rocgdb) - in particular, right after the hipMemcpy around accuracy_test.h:1556 is done, if we can print Is this package (with the debug build) available for download? I wouldn't mind trying to repro this in a docker container. |
FYI I've reproduced this problem in a rawhide Docker container. Something seems to be wrong with optimization and the compiler runtime. With this test program (float16.cpp): #include <iostream>
#include <algorithm>
#include <vector>
int main()
{
const size_t count = 4;
std::vector<_Float16> f16(count);
std::vector<float> f32(count);
for(size_t i = 0 ; i < count; ++i)
f16[i] = static_cast<_Float16>(0.123) * static_cast<_Float16>(i);
std::copy_n(f16.begin(), count, f32.begin());
for(size_t i = 0; i < count; ++i)
{
std::cout << "loop f16=" << static_cast<float>(f16[i]) << " f32=" << f32[i] << std::endl;
}
_Float16 one_f16 = 0.123;
float one_f32 = one_f16;
std::cout << "one f32=" << one_f32 << std::endl;
return 0;
} Build it several ways:
-O3 produces sensible output, but the other versions show varying levels of incorrect output. A converting a single Setting HIPCC_VERBOSE=7 in the environment will show the command line that is ultimately passed down to clang++. I also note that ROCm/rocBLAS#1350 seems to be suspiciously similar, but in rocBLAS instead. I'm not familiar enough with how these options are supposed to interact to know what the fix is. It's quite possible that this has already been fixed upstream (in LLVM, I imagine). I'll ask some colleagues to see if they can shed some light on this problem. |
It might be related to the fp16 ABI change: llvm/llvm-project#56854 |
Thank you for doing that. It was taking me a while to write an example because as it turns out, I was approaching the problem backwards. On the bright side, I learned a lot about FP16 and std::format in C++23. Since I had it set up, I ran your code in a Debian 12 amd64 VM and the console output is consistent for all optimization levels.
I'm not clear on what information you're asking for. Current Fedora rawhide has llvm 16.0.6. The cmake command used when its built (newlines inserted for readability) is:
The full build logs for llvm-16.0.6 The patches used for that build On a semi-related note, llvm17 will be landing in Fedora rawhide before much longer. llvm16 will continue to be available for a while yet, though. |
I think you may need to add -DLLVM_ENABLE_RUNTIMES="compiler-rt" to your cmake command line. Otherwise it may not build compiler-rt, then clang may use the system compiler-rt, which may cause ABI mismatch. |
OK, sounds like a place to start. I'll make some test builds and we'll see if that fixes things. |
After a bit more research, it turns out that compiler-rt is built as a separate package in Fedora and they're both on llvm 16.0.6 (latest released compiler-rt build logs) on Fedora Rawhide, which my test system is running. Is there a way to test for an ABI mismatch other than building a monolithic llvm and potentially breaking a bunch of other stuff? |
According to that bug, the ABI changed in Clang 15. So it seems like Clang 16 and newer should be unaffected. @tflink Are the tests linking against anything that was built with Clang 14? |
Not that I know of, no. AFAIK, everything built with Clang in Fedora was rebuilt with Clang 16 in the F39 mass rebuild last month and my system was updated to Rawhide as of at least 2023-08-14 (and probably more recent but I don't recall off the top of my head) when I submitted this issue. |
maybe use -O0 -save-temps -v to compile and link that test program. Then we can check the dumped .s file to see what API the caller is using. and use objdump -d with the executable to see what ABI the called function is using. Then we investigate why the callee has wrong ABI. |
@tflink Can you post the full buld log from when you built rocFFT? |
I've done that and uploaded everything including the cpp file, the output binary and all the bits generated by hipcc. In particular: |
I didn't save it when I built rocFFT last but I can rebuild and save the output. Are there any particular options that you'd like to see? I assume |
@tflink I actually just wanted to see the build commands used to build the test, and it looks like you posted that in the previous comment. |
I don't know if this is useful but I did the same test program compile on a debian12 VM that is NOT showing this issue. I'm the wrong person to ask about the build details but this Debian 12 system appears to have llvm and clang 14.0 available from the debian repos but
|
Would you be able to test in a Fedora 37 container? Fedora 37 has clang 15.0.6 just like debian 12. |
I will try to do so but the hipcc package is pretty new and was never built in F37. As I understand it, quite a bit of work went into getting it package-able so I'm not terribly optimistic that it'll work but we'll find out :) |
@tflink OK, I wouldn't go to a lot of trouble to get it built then. I thought it would be easy to try. |
hipcc won't be needed. You could take the test from #439 (comment), which is just a plain CPU program and then compile with |
I attempted to run the tests from the earlier comment on an F37 VM (clang 15.0.7) and got the following output:
Assuming that I've understood the non-hipcc reproducer correctly, the same issue doesn't show up in F37 and clang 15. |
I'm going to try rebuilding hipcc on F39 against clang/llvm 15 instead of 16 |
This is going to take more effort than I anticipated, if it's even possible. I'll update again if I'm able to make it work. |
This looks like a bug in clang 16:
|
I've narrowed this down further, and it looks like there is a bug in Fedora's compiler-rt package. |
You could add clang flags -mf16c and -D__STDC_WANT_IEC_60559_TYPES_EXT__=1 to see if it helps the compiler but the former may be default on all your clang versions. |
@tflink Can you install this build and see if it fixes the issue for you:
|
@tstellar I'll do a full rocFFT build and test cycle and report back either later today or tomorrow since that build+test cycle takes almost 6 hours on my machine. In the meantime, the fix looks good for the example posted above. On my F40 machine with LLVM16, I'm now seeing the same output that I was seeing on F37 with LLVM15.
|
@tstellar I rebuilt rocFFT locally with the compiler-rt build you linked and the tests passed; I don't see a single failure in the test output that I captured. Additionally, the half precision stats printed at the end of the test suite look far less suspicious to me.
Unless someone sees something that I don't, I think this can be closed. Thank you all for your help in getting this resolved. |
@tstellar Do you have a link to the compiler-rt change that fixes this problem? ROCm/rocBLAS#1350 seems to be hitting the same problem, but for Gentoo and I'm wondering how they can get this fix. |
As this issue appears to be caused and fixed upstream, I'm closing it now. Thanks everyone, for your input. |
Hello, is this patch submitted to llvm-project upstream? |
@littlewu2508 It's not an upstream bug. It was a bug in the Fedora builds. |
Sorry, I don't quite understand -- the patch is on upstream's CMakeLists.txt, so I would think the bug is from upstream. Also this patches fix Gentoo's problem ROCm/rocBLAS#1350, so I think this is not specific to one distro. Upstreaming the patch also make it easier to distro packages. |
@littlewu2508 Upstream does not support the build configurations that Fedora is using. I'm guessing this is probably also true for Gentoo. So, they won't accept a patch like this. |
Thanks for the explanation! I'll contact Gentoo maintainer @mgorny to help look at this, since this is a major regression with ROCm libraries. @mgorny Can you please take a look at https://src.fedoraproject.org/fork/tstellar/rpms/compiler-rt/blob/0459cbc5d9eb15f1ad51d74707b4988049183708/f/0001-compiler-rt-Fix-FLOAT16-feature-detection.patch and apply it on compiler-rt package? Gentoo ROCm-5.6 stack is having the same problem and I don't bump the version because of this. It at least affects rocBLAS as ROCm/rocBLAS#1350 reported. |
CMAKE_TRY_COMPILE_TARGET_TYPE defaults to EXECUTABLE, which causes any feature detection code snippet without a main function to fail, so we need to make sure it gets explicitly set to STATIC_LIBRARY. Bug: ROCm/rocFFT#439 Bug: ROCm/rocBLAS#1350 Bug: https://bugs.gentoo.org/916069 Closes: #69842 Reviewed by: thesamesam, mgorny
Apologies for opening an issue about this but I couldn't find another way to ask this question.
I'm working to build and package rocFFT in Fedora and while the previous build issue has been resolved (#422) I'm trying to run the test client and I'm unclear on what the expected results or if the test client is expected to have 0 FAILED results.
What is the expected behavior
What actually happens
172385 tests from 50 test suites ran in 3.8 hours
How to reproduce
rocfft-test-d
)Environment
The text was updated successfully, but these errors were encountered: