-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Added scaffolding for Oryon arch as in Snapdragon X Elite #5537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Using NeoverseN1 Kernels for now with cache info taken from official specs.
|
Thanks - do you get markedly better performance with this change, compared to the default approach in 0.3.30 of autodetecting this cpu as a regular NEOVERSEN1 ? I would prefer to avoid the code and library size explosion from adding any and all arm64 design variant, so unless the exact model-specific cost tables make a serious difference to the compiler output I'd like to avoid mere duplication. |
|
I need to do some benchmarking, so I'll report back on that. I have to imagine the significant difference in cache layout here is going to do something. |
|
Extremely unscientific runthrough: Stock Rblas.dll: OpenBLAS 3.30.0.dev: OpenBLAS NeoverseN1 kernel w/ oryon cache sizes: |
|
Gonna be completely honest here-I can't quite tell. Looks like there's some sizes for which it performs better and some for which it is worse. Any recs for drilling down a bit deeper? edit: just saw the openblas_loops setting, bear with |
|
3.30.0dev Oryon-modded cache size |
|
I think there's definitely something here, judging by the decent improvement at certain matrix sizes, but this is not it judging by the degraded performance at other matrix sizes. May be worth having it as a full clone of neoverse n1 (ie-removing the cache changes i made here) pending further investigation. |
|
.....I had an idea. This is an 8-wide chip, neoverse is 5-wide. I wonder what happens if i run the VORTEX target (which is 7-wide and should be otherwise compatible. Because I get the feeling the optimization here isn't so much in the cache definitons as much as its in the kernels. |
|
Scratch that, it would do nothing, as there's no difference. |
|
Yes, right now VORTEX is also just ARMV8 with a bunch of NEOVERSEN1 kernels on top. Without dedicated kernels, I think the easiest fix would be to put the proper L1 and L2 cache sizes in cpuid_arm64.c when we're on Windows, to guide the block sizes for GEMM etc. |
|
Yeah-and even if there is optimization here (and there almost certanily is) I don't even know that the cache sizes are an improvement. |
|
Probably needs larger loops to get more stable benchmark results. I do have an Oryon system on loan from Qualcomm, it's just that I'm away from it at the moment but I'll try to run some experiments myself when I have more time for OpenBLAS again - hopefully soon |
Oh, I can absolutely just run them on my laptop. How large are we talking? |
|
I'd guess a hundred instead of ten should help |
Will report back. With bonus ArmPL for comparison. |
|
So, it turns out the issue was mostly that running BLAS on 12 cores well exceeds the heat capacity of my laptop. Fixed that one. Anyway: Seems that there's a thousand to a few thousand megaflops difference in favor of the cache-tuned build at all sizes, which is more what I would have expected. Funnily enough, ArmPL seems to be on par with the n1 build and similarly behind the tuned build. Guess that does make sense, they did optimize for their own cores. Do we know if QC has an optimized implementation? N1 Oryon ArmPL for comparison |
|
Hmm. I'm still not that convinced - looks like there is still a lot of noise in the data, and where it looks like there is an improvement from using the correct cache sizes, it is around 2 percent at most ? |
By noise do you mean the fluctuating MFlops as size increases? That's actually fairly reproducible. And yes, around 2%. I think the bottleneck here isn't so much cache locality as much as it is the difference in execution pipeline size (5-wide vs 8-wide). edit: looking at the block diagrams it appears the correct way of looking at it is 2 NEON/FP units on the N1 and 4 on oryon |
|
Hi @theAeon Out of curiosity, are you going to add/modifying/optimize any kernels for this arch in future? |
|
unfortunately this is not exactly my strong suit, so while I will take a look i am...not expecting to, no. |
|
I can add a small hack to the cpu detection code to put the correct cache sizes in the config file, as that bit of performance gain it is low-hanging (if fairly small) fruit. But frankly I expect the upcoming X2 Elite cpu with its SVE+SME capability to be a markedly more attractive platform for any kind of numerical workload, and it should be quite adequately covered by the ARMV9SME target already. |
|
That sounds like the way to go. |
All I know is that this builds and works fine with clangarm64 on my laptop. Unsure about performance improvement, but certainly no performance regression.
I am not an assembly wizard, so this still uses the neoverse kernels. I imagine there is much optimization to be had. Feel free to edit if I missed a spot.
https://www.hwcooling.net/en/oryon-arm-core-in-snapdragon-x-cpus-architecture-analysis/ for cache reference