Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong asm file included in ARMv8 build #1854

Closed
rengolin opened this issue Nov 5, 2018 · 29 comments
Closed

Wrong asm file included in ARMv8 build #1854

rengolin opened this issue Nov 5, 2018 · 29 comments

Comments

@rengolin
Copy link
Contributor

rengolin commented Nov 5, 2018

I did some performance analysis between ARMv8 and CORTEXA57 builds on a number of architectures (A53, A57, A72, QDF and ThunderX2) and NRM2 has massive regressions across the board when building for A57 instead of ARMv8. Upon a closer look, it seems ARMv8 build is selecting the scnrm2_thunderx2t99.c file (which has actual asm code as #defines). This must have happened after the recent series of patches to introduce ThunderX2 code.

Looking at the file itself, there's nothing in there that is actually specific to ThunderX2. All of that is is plain NEON code, while the A57 one is plain VFP code.

The results are a bit mixed: the NEON TX2 code (in ARMv8) generates a lot more infs than the the VFP A57, so that may be a reason for instability? I am not using fast-math, an there seems to be a special file for fast math for double precision complex, dznrm2_thunderx2t99_fast.c, but not single.

I think this deserves a thorough investigation on which files are included in which builds and potentially make sure plain NEON code is available to all arm64 variants, not just TX2.

@martin-frbg
Copy link
Collaborator

The introduction of the scnrm2_thunderx2t99.c for ARMV8 is/was part of PR #1821 (the one on which you commented two weeks ago). The A57 target kept its earlier implementation as nrm2.S

@rengolin
Copy link
Contributor Author

rengolin commented Nov 5, 2018

Yes, I think that was leaked into ARMv8 by mistake. I'm only beginning to understand the OpenBLAS build system, so I can't judge well what goes where.

@martin-frbg
Copy link
Collaborator

Unfortunately I am not quite getting your remark about "massive regressions ... when building for A57". Is this to say that performance (and/or correctness) has taken a hit compared to earlier versions of OpenBLAS when building for TARGET=CORTEXA57, or that the ARMV8 build of OpenBLAS misbehaves on A57 hardware ?

@rengolin
Copy link
Contributor Author

rengolin commented Nov 5, 2018

Sorry, that was a bit terse. Building with ARMv8 is much faster on a lot of machines because it includes a heavily optimised (2-lane vectorised and 16--way unrolled) TX2 version, which makes the A57 (1-lane VFP 1-way) 10x "slower".

However, that code works well on the heavier machines, but could trash the cache on smaller machines, where a more conservative VFP code is more suitable. A SIMD vectorised code would still be good, even for the smallest Arm64, but not unrolled like that for ThunderX2s, which are massive cores.

I think we should write NEON code for all Arm64 routines, to make ARMv8 more suitable as the base standard, but it needs to be step-wise and suitable to all Arm64 cores, not just massive ones.

Ultimately, the bug is to include ThunderX2-specific files into ARMv8 generic builds, even if that provides some benefits to some machines.

@martin-frbg
Copy link
Collaborator

In view of that, is it still advisable to alias A53 to thunderx2-"tainted" ARMV8 rather than A57, or should that change be delayed until ARMV8 is more generic again ?

@rengolin
Copy link
Contributor Author

rengolin commented Nov 5, 2018

There's more than a few problems here, unfortunately.

The background:

  • A53 and A57 are different cores, but implement the exact same instruction set, same extensions and have similar cache structures (they can be used together as a single-macro-core of escalating frequency in mobile chips). Compiler optimisations done for one are very often valid for the other and vice-versa.
  • Arm vanilla cores (A53, A57, A72, A73) are all generic designs. They mandate ISA and extensions, but each vendor can change the cache structure and potentially add extensions. For this reason, having only an A57 entry doesn't make a lot of sense.
  • ThunderX is a vendor-specific A53 with a different cache structure, so may warrant its own category
  • ThunderX2 is Vulcan, but "Vulcan" is dead, so I'll just merge them into the same category or get rid of Vulcan entirely
  • Vulcan/ThunderX2 has mandatory new extensions (ARMv8.1) and new chips will come with even more (v8.2, .3, .4...) and those could have their own class(*), but still they must implement ARMv8.0, so the backwards compatibility applies.

So, I would:

  1. Call A53+A57+A72+A73 as ARMv8 and not have any information about caches other than what's mandated by the manuals (there may be some minimal guarantees, like associativity and minimal sizes), adding VFPs, NEON to the defines.
  2. Make sure all ARMv8 cores define ARMV8 as well as their cores, if any, so that any #ifdef based on ARMv8 can be applied to all sub-arches.
  3. Change all #ifdef CORTEXA57 to ARMV8, but leave the other cores as is, so that it's only about them, not the generic ones.
  4. Get any non-vanilla assembly implementation out of ARMv8.

In the future, we should start adding more NEON code to ARMv8, especially for use with fast-math. This will benefit all arches, so will improve speed across the board.

(*) Compiling for TX2 on another core raises illegal instruction because of the v8.1 issue on the threading library, not OpenBLAS's routines, but that could very well happen if we used them in the asm files.

@rengolin
Copy link
Contributor Author

rengolin commented Nov 5, 2018

PS: I'm more than glad to help defining the boundaries, changing code and testing the changes, I just want to make sure this sentiment is shared more widely before I start digging up old manuals. :)

@martin-frbg
Copy link
Collaborator

So in essence your somewhat alarming "wrong asm" from the title of the issue boils down to "the scnrm2_thunderx2t99.c kernel may be a poor choice on some low-spec ARMV8 compatible hardware ? And CortexA57 would benefit from dropping its old vfp implementation in favor of ARMV8 NEON once that has been made suitably generic (again) ?

I suspect the current ARM64 kernels have been written with some specific hardware samples in mind - be that HPC or some SoC in the smartphone(s) of the developer. Moving to a more systematic implementation will have its advantages, but I would hate to sacrifice speed on some widely available consumer hardware just because a formal spec for the minimal implementation of the "same" core does not guarantee some capability. Is there any established and generally applicable API for determining cache sizes on these sub-arches ?

@rengolin
Copy link
Contributor Author

rengolin commented Nov 6, 2018

So in essence your somewhat alarming "wrong asm" from the title of the issue boils down to "the scnrm2_thunderx2t99.c kernel may be a poor choice on some low-spec ARMV8 compatible hardware ?

I did not mean to be alarming, and what I said was "wrong asm file", which is accurate. Using the TX2 kernel is not just a poor choice in low-spec hardware, it's poor in any hardware but TX2, including some other widely available and beefy servers.

And CortexA57 would benefit from dropping its old vfp implementation in favor of ARMV8 NEON once that has been made suitably generic (again) ?

Not at all. The current VFP implementation is good and works everywhere. A NEON implementation would be better (faster) but equally correct. For double precision, the main difference is two lanes instead of one, so the gains are not massive and may not warrant a speedy re-write.

I suspect the current ARM64 kernels have been written with some specific hardware samples in mind - be that HPC or some SoC in the smartphone(s) of the developer.

Yes, they were developed by one vendor, in detriment to all other vendors. I'm just making sure we keep those changes to the one vendor that wrote them. Furthermore, I have ran the OpenBLAS micro-benchmarks on their own hardware and have found mixed results, comparing to A57 builds.

I'm fine if they want to experiment with their hardware, but I'm responsible for all Arm HPC OSS support and I need to limit the damage to their own stuff.

Moving to a more systematic implementation will have its advantages, but I would hate to sacrifice speed on some widely available consumer hardware just because a formal spec for the minimal implementation of the "same" core does not guarantee some capability.

As I said, this is not an accurate assessment. Furthermore, TX2s implement a different standard (ARMv8.1) and can have instructions that other machines will throw Illegal Instruction.

Is there any established and generally applicable API for determining cache sizes on these sub-arches ?

I'm working on it, but the general answer is: no.

Kernels as new as 4.18 still do not have all support from most machines. HPC on Arm is a new field and it takes some time to get the little kinks out. The support is going our as HWCAP, via /sys, which is where cpuid gets its info, so once it's out, it's good. But it's not out yet.

But that's not enough. In OpenHPC (and Debian's HPC and RedHat's builds, etc), we need to distribute packages and we cannot change the package depending on which host machine they're built. We need to make sure every time we build the packages, native or cross, on this or that server, we get exactly the same package in the end. We can only do that by forcing the TARGET.

The DYNAMIC_ARCH is interesting and would be a better way to build packages, but the machines still don't have good HWCAP info and it'd be harder to select the best choice then.

I'm also working with OpenHPC to get optimised builds packages, or Spack/Easybuild, but that's a solution for a different problem, and shouldn't factor the base decisions for architecture support.

@martin-frbg
Copy link
Collaborator

Alright, so do we want to revert the nrm2 part of #1821, or do you advocate reverting the entire PR #1821 (where I understood you had agreed with @ashwinyes that further enhancements could go on top of that change) ? (And I have no idea if he is doing this work as an employee of that specific vendor or in his free time. Indeed HPC on ARM is new enough that I am at least as concerned about using OpenBLAS in some function or other on a mid-range smartphone. Recent "experience" with the TLS rewrite patches has proven that OpenBLAS has crept into so many non-specialist applications that it is now easy to break things for people who do not even know what BLAS is)

@rengolin
Copy link
Contributor Author

rengolin commented Nov 6, 2018

Alright, so do we want to revert the nrm2 part of #1821, or do you advocate reverting the entire PR #1821 (where I understood you had agreed with @ashwinyes that further enhancements could go on top of that change) ?

Reverting the whole thing would be bad, I think. I agree with the general idea, I just assumed all of that would go exclusively into TX2, not ARMV8. I didn't understand the include system well enough to have picked that up.

My idea was just to revisit the include system and remove any thunderx files from the armv8 / A57 builds.

(And I have no idea if he is doing this work as an employee of that specific vendor or in his free time.

I didn't mean it was intentional, just naturally biased by the hardware he tested on. I test on a larger set, so I get to see the discrepancies. :)

Indeed HPC on ARM is new enough that I am at least as concerned about using OpenBLAS in some function or other on a mid-range smartphone. Recent "experience" with the TLS rewrite patches has proven that OpenBLAS has crept into so many non-specialist applications that it is now easy to break things for people who do not even know what BLAS is)

Indeed! I always took BLAS to be HPC only, but I guess nowadays with deep learning going mainstream, even mobile phones are now running inference models. And for that, plain old NEON is good enough (and widely available), as it has 8 lanes of half-precision float support.

@ashwinyes
Copy link
Contributor

ashwinyes commented Nov 6, 2018

Call A53+A57+A72+A73 as ARMv8 and not have any information about caches other than what's mandated by the manuals (there may be some minimal guarantees, like associativity and minimal sizes), adding VFPs, NEON to the defines.

Agree

Make sure all ARMv8 cores define ARMV8 as well as their cores, if any, so that any #ifdef based on ARMv8 can be applied to all sub-arches.

Defining ARMV8 in all builds might break the builds/functionality as the build uses this define in params.h and other places. Better would be define something in the lines HAVE_VFP, HAVE_ASIMD etc.

@rengolin
Copy link
Contributor Author

rengolin commented Nov 6, 2018

Defining ARMV8 in all builds might break the builds/functionality as the build uses this define in params.h and other places. Better would be define something in the lines HAVE_VFP, HAVE_ASIMD etc.

I'd go further and say that we should't use any core-specific defines and have them all as features (we do that in LLVM, for example).

It's much easier to add new cores, just set the correct defines and you have good support, right off the bat.

But that means removing ifdefs for ARMV8 and cores in favour of feature defines.

If we want to separate code from Intel to Arm32 from Arm64, for example, we can use the existing architecture flags that all compilers support:

https://sourceforge.net/p/predef/wiki/Architectures/

@ashwinyes
Copy link
Contributor

Regarding nrm2.s, znrm2.S, dznrm2_thunderx2t99_fast.c, dznrm2_thunderx2t99.c, and scnrm2_thunderx2t99.c

  • nrm2.s and znrm2.s are scalar implementations of kernel/arm/nrm2.c and kernel/arm/nrm2.c. These are not VFP versions.
  • dznrm2_thunderx2t99.c is vectorized version of kernel/arm/nrm2.c and kernel/arm/nrm2.c. Also it is threaded.
  • scnrm2_thunderx2t99.c does not use the approach in kernel/arm/nrm2.c. It does simple squared addition and finally take square root to calculate the nrm2. But due to loss of precision in this approach, first all the single precision values are promoted to double precision before doing this. And finally the answer is converted back to single precision. Since division operations are avoided in this approach, it is much better performing on TX2. It is also threaded.
  • dznrm2_thunderx2t99_fast.c uses the same approach as in scnrm2_thunderx2t99.c. But since double precision cannot be promoted to any higher precision, this yields wrong results in certain test cases. So it is not used.

If a generic nrm2 implementation is required, then I would suggest to have single threaded assembly versions of scnrm2_thunderx2t99.c and dznrm2_thunderx2t99.c.

@ashwinyes
Copy link
Contributor

I'd go further and say that we should't use any core-specific defines and have them all as features (we do that in LLVM, for example).

I would not suggest to remove all core specific defines. Some tunings might not be derivable from the features alone.

@rengolin
Copy link
Contributor Author

rengolin commented Nov 6, 2018

You mean arm64/nrm2.s.

  • nrm2.s and znrm2.s are scalar implementations of kernel/arm/nrm2.c and kernel/arm/nrm2.c. These are not VFP versions.

They are. They use FP instructions and FP registers, ex:

  fdiv  s2, SCALE, s4
  fmul  s2, s2, s2
  fmul  s3, SSQ, s2
  fadd  SSQ, REGONE, s3

which is ok, because all ARMv8 arches have FP.

  • dznrm2_thunderx2t99.c is vectorized version of kernel/arm/nrm2.c and kernel/arm/nrm2.c. Also it is threaded.
  • scnrm2_thunderx2t99.c does not use the approach in kernel/arm/nrm2.c. It does simple squared addition and finally take square root to calculate the nrm2. But due to loss of precision in this approach, first all the single precision values are promoted to double precision before doing this. And finally the answer is converted back to single precision. Since division operations are avoided in this approach, it is much better performing on TX2. It is also threaded.

These don't perform well on other big Arm servers, so should be restricted to TX2.

  • dznrm2_thunderx2t99_fast.c uses the same approach as in scnrm2_thunderx2t99.c. But since double precision cannot be promoted to any higher precision, this yields wrong results in certain test cases. So it is not used.

Is it not used at all, or only used with fast-math? If not used at all, I don't see why it's in tree.

If a generic nrm2 implementation is required, then I would suggest to have single threaded assembly versions of scnrm2_thunderx2t99.c and dznrm2_thunderx2t99.c.

We absolutely need generic versions. Core specific code should never be the default. How were those implemented before your patch?

@rengolin
Copy link
Contributor Author

rengolin commented Nov 6, 2018

I'd go further and say that we should't use any core-specific defines and have them all as features (we do that in LLVM, for example).

I would not suggest to remove all core specific defines. Some tunings might not be derivable from the features alone.

The only case using THUNDERX2T99 define I found is for swap, which is also defined for ARMV8 and it just sets/unsets another macro, SMP.

That could very easily be defining SMP in the first place.

@ashwinyes
Copy link
Contributor

ashwinyes commented Nov 6, 2018

which is ok, because all ARMv8 arches have FP.

I was not saying about ASIMD NEON Scalar FP instructions . I was saying VFP. I thought VFP was the classic floating point unit in 32-bit arm. TX2 does not support 32-bit execution. See kernel/arm/nrm2_vfp*.S for VFP implementations of nrm2. May we both meant the same, just a misunderstanding here and there.

We absolutely need generic versions. Core specific code should never be the default. How were those implemented before your patch?

There was no arm64 specific implementations before my patches (except an initial xgene1 sgemm kernel) . Everything used the generic C kernels from kernel/arm directory.

The aim of my work was to optimize for VULCAN (then). Since we didn't have silicon at that time, it was initially written and tested on CortexA57. Later when silicon became available, the focus was only on improving for THUNDERX2T99.

I have not focused on generic versions. It will be very easy to implement generic versions based on the TX2 versions. I am ok to any changes for arm64 provided that it does not affect TX2 performance.

The only case using THUNDERX2T99 define I found is for swap, which is also defined for ARMV8 and it just sets/unsets another macro, SMP.

Please see param.h also .

Is it not used at all, or only used with fast-math? If not used at all, I don't see why it's in tree.

I kept it in tree in case if someone wants to use it. This version will give correct answers for most cases.
And it will be 2-3 times faster than the existing version.

@rengolin
Copy link
Contributor Author

rengolin commented Nov 6, 2018

I was not saying NEON FP. I was saying VFP. I thought VFP was the classic floating point unit in 32-bit arm. TX2 does not support 32-bit execution. See kernel/arm/nrm2_vfp*.S for VFP implementations of nrm2. May we both meant the same, just a misunderstanding here and there.

Ah, welcome to Arm naming hell. :)

You are right, VFP doesn't "exist" in AArch64, though the FP support is very similar. :)

There was no arm64 specific implementations before my patches (except an initial xgene1 sgemm kernel) . Everything used the generic C kernels from kernel/arm directory.

Right, then we should go back using them, and improve generic ARMv8 support in separate.

I have not focused on generic versions. It will be very easy to implement generic versions based on the TX2 versions. I am ok to any changes for arm64 provided that it does not affect TX2 performance.

Indeed, it will. Other sub-arches are negatively affected by your changes, so a generic NEON implementation (to be done later) would be better.

Please see param.h also .

Oh, right. Those are core-centric, you're right.

I kept it in tree in case if someone wants to use it. This version will give correct answers for most cases.
And it will be 2-3 times faster than the existing version.

I think it's dangerous to leave code that could provide wrong answers for some cases lying around, especially without any warning.

@ashwinyes
Copy link
Contributor

I think it's dangerous to leave code that could provide wrong answers for some cases lying around, especially without any warning.

There are also implementations which are functionally correct (See 4x4/4x8 dgemm kernels, 8x8/4x4 sgemm kernels etc) but not used anymore. I am okay to deleting all such files and cleaning up.

@martin-frbg
Copy link
Collaborator

Perhaps we could create a separate (sub)directory for unused (including unusable) implementations that should be kept for reference (or whatever). The same issue exists on x86_64 (at least).

@rengolin
Copy link
Contributor Author

rengolin commented Nov 6, 2018

That'd be nice, yes.

It is helpful to have those implementations as base for further work, even if unused, but it needs to be marked somehow. A sub directory with work-in-progress implementations would do that.

@ashwinyes
Copy link
Contributor

Perhaps we could create a separate (sub)directory for unused (including unusable) implementations that should be kept for reference (or whatever). The same issue exists on x86_64 (at least).

Agree.

@rengolin I hope you would be half way thru some of the patches that you are working on. I will leave this clean up to you so that you you don't have to rebase again in case I do it.

@rengolin
Copy link
Contributor Author

rengolin commented Nov 6, 2018

@ashwinyes I hadn't started, but don't worry, I'll do the clean up, as it's easier for me to test on multiple vendors.

If anyone has a patch ready to move those out, should be fine. If it happens after I start, don't worry, I can rebase whenever files move.

@martin-frbg
Copy link
Collaborator

I'll suggest "unused" rather than "work-in-progress" for the name of the "attic", but I will not move anything unless told.

@rengolin
Copy link
Contributor Author

rengolin commented Nov 6, 2018

"attic" sounds nice, too. :)

@ashwinyes
Copy link
Contributor

I would suggest "archived" :)

@rengolin It would be good to test the routines once with https://github.com/xianyi/BLAS-Tester and see the regressions.

@rengolin
Copy link
Contributor Author

rengolin commented Nov 6, 2018

@ashwinyes will do, thanks!

@rengolin
Copy link
Contributor Author

This has been fixed in PR #1876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants