-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong asm file included in ARMv8 build #1854
Comments
The introduction of the scnrm2_thunderx2t99.c for ARMV8 is/was part of PR #1821 (the one on which you commented two weeks ago). The A57 target kept its earlier implementation as nrm2.S |
Yes, I think that was leaked into ARMv8 by mistake. I'm only beginning to understand the OpenBLAS build system, so I can't judge well what goes where. |
Unfortunately I am not quite getting your remark about "massive regressions ... when building for A57". Is this to say that performance (and/or correctness) has taken a hit compared to earlier versions of OpenBLAS when building for TARGET=CORTEXA57, or that the ARMV8 build of OpenBLAS misbehaves on A57 hardware ? |
Sorry, that was a bit terse. Building with ARMv8 is much faster on a lot of machines because it includes a heavily optimised (2-lane vectorised and 16--way unrolled) TX2 version, which makes the A57 (1-lane VFP 1-way) 10x "slower". However, that code works well on the heavier machines, but could trash the cache on smaller machines, where a more conservative VFP code is more suitable. A SIMD vectorised code would still be good, even for the smallest Arm64, but not unrolled like that for ThunderX2s, which are massive cores. I think we should write NEON code for all Arm64 routines, to make ARMv8 more suitable as the base standard, but it needs to be step-wise and suitable to all Arm64 cores, not just massive ones. Ultimately, the bug is to include ThunderX2-specific files into ARMv8 generic builds, even if that provides some benefits to some machines. |
In view of that, is it still advisable to alias A53 to thunderx2-"tainted" ARMV8 rather than A57, or should that change be delayed until ARMV8 is more generic again ? |
There's more than a few problems here, unfortunately. The background:
So, I would:
In the future, we should start adding more NEON code to ARMv8, especially for use with fast-math. This will benefit all arches, so will improve speed across the board. (*) Compiling for TX2 on another core raises illegal instruction because of the v8.1 issue on the threading library, not OpenBLAS's routines, but that could very well happen if we used them in the asm files. |
PS: I'm more than glad to help defining the boundaries, changing code and testing the changes, I just want to make sure this sentiment is shared more widely before I start digging up old manuals. :) |
So in essence your somewhat alarming "wrong asm" from the title of the issue boils down to "the scnrm2_thunderx2t99.c kernel may be a poor choice on some low-spec ARMV8 compatible hardware ? And CortexA57 would benefit from dropping its old vfp implementation in favor of ARMV8 NEON once that has been made suitably generic (again) ? I suspect the current ARM64 kernels have been written with some specific hardware samples in mind - be that HPC or some SoC in the smartphone(s) of the developer. Moving to a more systematic implementation will have its advantages, but I would hate to sacrifice speed on some widely available consumer hardware just because a formal spec for the minimal implementation of the "same" core does not guarantee some capability. Is there any established and generally applicable API for determining cache sizes on these sub-arches ? |
I did not mean to be alarming, and what I said was "wrong asm file", which is accurate. Using the TX2 kernel is not just a poor choice in low-spec hardware, it's poor in any hardware but TX2, including some other widely available and beefy servers.
Not at all. The current VFP implementation is good and works everywhere. A NEON implementation would be better (faster) but equally correct. For double precision, the main difference is two lanes instead of one, so the gains are not massive and may not warrant a speedy re-write.
Yes, they were developed by one vendor, in detriment to all other vendors. I'm just making sure we keep those changes to the one vendor that wrote them. Furthermore, I have ran the OpenBLAS micro-benchmarks on their own hardware and have found mixed results, comparing to A57 builds. I'm fine if they want to experiment with their hardware, but I'm responsible for all Arm HPC OSS support and I need to limit the damage to their own stuff.
As I said, this is not an accurate assessment. Furthermore, TX2s implement a different standard (ARMv8.1) and can have instructions that other machines will throw Illegal Instruction.
I'm working on it, but the general answer is: no. Kernels as new as 4.18 still do not have all support from most machines. HPC on Arm is a new field and it takes some time to get the little kinks out. The support is going our as HWCAP, via /sys, which is where cpuid gets its info, so once it's out, it's good. But it's not out yet. But that's not enough. In OpenHPC (and Debian's HPC and RedHat's builds, etc), we need to distribute packages and we cannot change the package depending on which host machine they're built. We need to make sure every time we build the packages, native or cross, on this or that server, we get exactly the same package in the end. We can only do that by forcing the TARGET. The DYNAMIC_ARCH is interesting and would be a better way to build packages, but the machines still don't have good HWCAP info and it'd be harder to select the best choice then. I'm also working with OpenHPC to get optimised builds packages, or Spack/Easybuild, but that's a solution for a different problem, and shouldn't factor the base decisions for architecture support. |
Alright, so do we want to revert the nrm2 part of #1821, or do you advocate reverting the entire PR #1821 (where I understood you had agreed with @ashwinyes that further enhancements could go on top of that change) ? (And I have no idea if he is doing this work as an employee of that specific vendor or in his free time. Indeed HPC on ARM is new enough that I am at least as concerned about using OpenBLAS in some function or other on a mid-range smartphone. Recent "experience" with the TLS rewrite patches has proven that OpenBLAS has crept into so many non-specialist applications that it is now easy to break things for people who do not even know what BLAS is) |
Reverting the whole thing would be bad, I think. I agree with the general idea, I just assumed all of that would go exclusively into TX2, not ARMV8. I didn't understand the include system well enough to have picked that up. My idea was just to revisit the include system and remove any thunderx files from the armv8 / A57 builds.
I didn't mean it was intentional, just naturally biased by the hardware he tested on. I test on a larger set, so I get to see the discrepancies. :)
Indeed! I always took BLAS to be HPC only, but I guess nowadays with deep learning going mainstream, even mobile phones are now running inference models. And for that, plain old NEON is good enough (and widely available), as it has 8 lanes of half-precision float support. |
Agree
Defining ARMV8 in all builds might break the builds/functionality as the build uses this define in params.h and other places. Better would be define something in the lines HAVE_VFP, HAVE_ASIMD etc. |
I'd go further and say that we should't use any core-specific defines and have them all as features (we do that in LLVM, for example). It's much easier to add new cores, just set the correct defines and you have good support, right off the bat. But that means removing ifdefs for ARMV8 and cores in favour of feature defines. If we want to separate code from Intel to Arm32 from Arm64, for example, we can use the existing architecture flags that all compilers support: |
Regarding nrm2.s, znrm2.S, dznrm2_thunderx2t99_fast.c, dznrm2_thunderx2t99.c, and scnrm2_thunderx2t99.c
If a generic nrm2 implementation is required, then I would suggest to have single threaded assembly versions of scnrm2_thunderx2t99.c and dznrm2_thunderx2t99.c. |
I would not suggest to remove all core specific defines. Some tunings might not be derivable from the features alone. |
You mean
They are. They use FP instructions and FP registers, ex:
which is ok, because all ARMv8 arches have FP.
These don't perform well on other big Arm servers, so should be restricted to TX2.
Is it not used at all, or only used with fast-math? If not used at all, I don't see why it's in tree.
We absolutely need generic versions. Core specific code should never be the default. How were those implemented before your patch? |
The only case using THUNDERX2T99 define I found is for swap, which is also defined for ARMV8 and it just sets/unsets another macro, SMP. That could very easily be defining SMP in the first place. |
I was not saying about ASIMD NEON Scalar FP instructions . I was saying VFP. I thought VFP was the classic floating point unit in 32-bit arm. TX2 does not support 32-bit execution. See kernel/arm/nrm2_vfp*.S for VFP implementations of nrm2. May we both meant the same, just a misunderstanding here and there.
There was no arm64 specific implementations before my patches (except an initial xgene1 sgemm kernel) . Everything used the generic C kernels from kernel/arm directory. The aim of my work was to optimize for VULCAN (then). Since we didn't have silicon at that time, it was initially written and tested on CortexA57. Later when silicon became available, the focus was only on improving for THUNDERX2T99. I have not focused on generic versions. It will be very easy to implement generic versions based on the TX2 versions. I am ok to any changes for arm64 provided that it does not affect TX2 performance.
Please see param.h also .
I kept it in tree in case if someone wants to use it. This version will give correct answers for most cases. |
Ah, welcome to Arm naming hell. :) You are right, VFP doesn't "exist" in AArch64, though the FP support is very similar. :)
Right, then we should go back using them, and improve generic ARMv8 support in separate.
Indeed, it will. Other sub-arches are negatively affected by your changes, so a generic NEON implementation (to be done later) would be better.
Oh, right. Those are core-centric, you're right.
I think it's dangerous to leave code that could provide wrong answers for some cases lying around, especially without any warning. |
There are also implementations which are functionally correct (See 4x4/4x8 dgemm kernels, 8x8/4x4 sgemm kernels etc) but not used anymore. I am okay to deleting all such files and cleaning up. |
Perhaps we could create a separate (sub)directory for unused (including unusable) implementations that should be kept for reference (or whatever). The same issue exists on x86_64 (at least). |
That'd be nice, yes. It is helpful to have those implementations as base for further work, even if unused, but it needs to be marked somehow. A sub directory with work-in-progress implementations would do that. |
Agree. @rengolin I hope you would be half way thru some of the patches that you are working on. I will leave this clean up to you so that you you don't have to rebase again in case I do it. |
@ashwinyes I hadn't started, but don't worry, I'll do the clean up, as it's easier for me to test on multiple vendors. If anyone has a patch ready to move those out, should be fine. If it happens after I start, don't worry, I can rebase whenever files move. |
I'll suggest "unused" rather than "work-in-progress" for the name of the "attic", but I will not move anything unless told. |
"attic" sounds nice, too. :) |
I would suggest "archived" :) @rengolin It would be good to test the routines once with https://github.com/xianyi/BLAS-Tester and see the regressions. |
@ashwinyes will do, thanks! |
This has been fixed in PR #1876 |
I did some performance analysis between ARMv8 and CORTEXA57 builds on a number of architectures (A53, A57, A72, QDF and ThunderX2) and NRM2 has massive regressions across the board when building for A57 instead of ARMv8. Upon a closer look, it seems ARMv8 build is selecting the
scnrm2_thunderx2t99.c
file (which has actual asm code as #defines). This must have happened after the recent series of patches to introduce ThunderX2 code.Looking at the file itself, there's nothing in there that is actually specific to ThunderX2. All of that is is plain NEON code, while the A57 one is plain VFP code.
The results are a bit mixed: the NEON TX2 code (in ARMv8) generates a lot more
infs
than the the VFP A57, so that may be a reason for instability? I am not using fast-math, an there seems to be a special file for fast math for double precision complex,dznrm2_thunderx2t99_fast.c
, but not single.I think this deserves a thorough investigation on which files are included in which builds and potentially make sure plain NEON code is available to all arm64 variants, not just TX2.
The text was updated successfully, but these errors were encountered: