Skip to content

Conversation

@theAeon
Copy link
Contributor

@theAeon theAeon commented Nov 18, 2025

All I know is that this builds and works fine with clangarm64 on my laptop. Unsure about performance improvement, but certainly no performance regression.

I am not an assembly wizard, so this still uses the neoverse kernels. I imagine there is much optimization to be had. Feel free to edit if I missed a spot.

https://www.hwcooling.net/en/oryon-arm-core-in-snapdragon-x-cpus-architecture-analysis/ for cache reference

Using NeoverseN1 Kernels for now with cache info
taken from official specs.
@martin-frbg
Copy link
Collaborator

Thanks - do you get markedly better performance with this change, compared to the default approach in 0.3.30 of autodetecting this cpu as a regular NEOVERSEN1 ? I would prefer to avoid the code and library size explosion from adding any and all arm64 design variant, so unless the exact model-specific cost tables make a serious difference to the compiler output I'd like to avoid mere duplication.

@theAeon
Copy link
Contributor Author

theAeon commented Nov 18, 2025

I need to do some benchmarking, so I'll report back on that. I have to imagine the significant difference in cache layout here is going to do something.

@theAeon
Copy link
Contributor Author

theAeon commented Nov 18, 2025

Extremely unscientific runthrough:

Stock Rblas.dll:

& "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=1
      SIZE             Flops                   Time
           128x128 :    1863.67 MFlops   0.030000 sec
           256x256 :    8945.61 MFlops   0.050000 sec
           384x384 :   10063.81 MFlops   0.150000 sec
           512x512 :    7613.29 MFlops   0.470000 sec
           640x640 :   10277.59 MFlops   0.680000 sec
           768x768 :   11286.52 MFlops   1.070000 sec
           896x896 :   11911.28 MFlops   1.610000 sec
         1024x1024 :   10641.62 MFlops   2.690000 sec
         1152x1152 :   12426.35 MFlops   3.280000 sec
         1280x1280 :   12424.46 MFlops   4.500000 sec
         1408x1408 :   12677.39 MFlops   5.870000 sec
         1536x1536 :   11379.58 MFlops   8.490000 sec
         1664x1664 :   11822.37 MFlops  10.390000 sec
         1792x1792 :   12099.15 MFlops  12.680000 sec
         1920x1920 :   12672.70 MFlops  14.890000 sec
         2048x2048 :   11780.23 MFlops  19.440000 sec
From 128 To 2048 Step=128 Loops=1
      SIZE             Flops                   Time
           128x128 :     209.72 MFlops   0.020000 sec
           256x256 :        Inf MFlops   0.000000 sec
           384x384 :        Inf MFlops   0.000000 sec
           512x512 :   13421.77 MFlops   0.020000 sec
           640x640 :   10485.76 MFlops   0.050000 sec
           768x768 :   12942.42 MFlops   0.070000 sec
           896x896 :   11988.72 MFlops   0.120000 sec
         1024x1024 :   11302.55 MFlops   0.190000 sec
         1152x1152 :   10920.17 MFlops   0.280000 sec
         1280x1280 :   10754.63 MFlops   0.390000 sec
         1408x1408 :   10150.22 MFlops   0.550000 sec
         1536x1536 :    9663.68 MFlops   0.750000 sec
         1664x1664 :    9803.07 MFlops   0.940000 sec
         1792x1792 :   10185.11 MFlops   1.130000 sec
         1920x1920 :   10039.56 MFlops   1.410000 sec
         2048x2048 :   10475.53 MFlops   1.640000 sec
From 128 To 2048 Step=128 Loops=1
      SIZE             Flops                   Time
           128x128 :        Inf MFlops   0.000000 sec
           256x256 :    2236.96 MFlops   0.020000 sec
           384x384 :    5033.16 MFlops   0.030000 sec
           512x512 :    5113.06 MFlops   0.070000 sec
           640x640 :    4993.22 MFlops   0.140000 sec
           768x768 :    5252.00 MFlops   0.230000 sec
           896x896 :    5184.31 MFlops   0.370000 sec
         1024x1024 :    4936.74 MFlops   0.580000 sec
         1152x1152 :    5226.75 MFlops   0.780000 sec
         1280x1280 :    5038.20 MFlops   1.110000 sec
         1408x1408 :    5133.44 MFlops   1.450000 sec
         1536x1536 :    4831.84 MFlops   2.000000 sec
         1664x1664 :    4818.24 MFlops   2.550000 sec
         1792x1792 :    4721.71 MFlops   3.250000 sec
         1920x1920 :    4683.47 MFlops   4.030000 sec
         2048x2048 :    4379.83 MFlops   5.230000 sec

OpenBLAS 3.30.0.dev:

➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=1
      SIZE             Flops                   Time
           128x128 :    2795.50 MFlops   0.020000 sec
           256x256 :    7454.68 MFlops   0.060000 sec
           384x384 :   13723.38 MFlops   0.110000 sec
           512x512 :    8132.37 MFlops   0.440000 sec
           640x640 :   19413.22 MFlops   0.360000 sec
           768x768 :   20468.77 MFlops   0.590000 sec
           896x896 :   27010.08 MFlops   0.710000 sec
         1024x1024 :   17039.26 MFlops   1.680000 sec
         1152x1152 :   38451.36 MFlops   1.060000 sec
         1280x1280 :   36071.01 MFlops   1.550000 sec
         1408x1408 :   41806.91 MFlops   1.780000 sec
         1536x1536 :   28249.30 MFlops   3.420000 sec
         1664x1664 :   46528.19 MFlops   2.640000 sec
         1792x1792 :   44858.84 MFlops   3.420000 sec
         1920x1920 :   51556.42 MFlops   3.660000 sec
         2048x2048 :   32073.90 MFlops   7.140000 sec
From 128 To 2048 Step=128 Loops=1
      SIZE             Flops                   Time
           128x128 :        Inf MFlops   0.000000 sec
           256x256 :        Inf MFlops   0.000000 sec
           384x384 :    5662.31 MFlops   0.020000 sec
           512x512 :        Inf MFlops   0.000000 sec
           640x640 :        Inf MFlops   0.000000 sec
           768x768 :   45298.48 MFlops   0.020000 sec
           896x896 :        Inf MFlops   0.000000 sec
         1024x1024 :        Inf MFlops   0.000000 sec
         1152x1152 :  152882.38 MFlops   0.020000 sec
         1280x1280 :  419430.40 MFlops   0.010000 sec
         1408x1408 :  279130.93 MFlops   0.020000 sec
         1536x1536 :  241591.91 MFlops   0.030000 sec
         1664x1664 :  230372.15 MFlops   0.040000 sec
         1792x1792 :  287729.25 MFlops   0.040000 sec
         1920x1920 :  283115.52 MFlops   0.050000 sec
         2048x2048 :  343597.38 MFlops   0.050000 sec
From 128 To 2048 Step=128 Loops=1
      SIZE             Flops                   Time
           128x128 :        Inf MFlops   0.000000 sec
           256x256 :    2236.96 MFlops   0.020000 sec
           384x384 :        Inf MFlops   0.000000 sec
           512x512 :   17895.70 MFlops   0.020000 sec
           640x640 :   34952.53 MFlops   0.020000 sec
           768x768 :  120795.96 MFlops   0.010000 sec
           896x896 :   95909.75 MFlops   0.020000 sec
         1024x1024 :   57266.23 MFlops   0.050000 sec
         1152x1152 :   58240.91 MFlops   0.070000 sec
         1280x1280 :  111848.11 MFlops   0.050000 sec
         1408x1408 :   93043.64 MFlops   0.080000 sec
         1536x1536 :   87851.60 MFlops   0.110000 sec
         1664x1664 :  122865.15 MFlops   0.100000 sec
         1792x1792 :  127879.67 MFlops   0.120000 sec
         1920x1920 :  134816.91 MFlops   0.140000 sec
         2048x2048 :   95443.72 MFlops   0.240000 sec

OpenBLAS NeoverseN1 kernel w/ oryon cache sizes:

➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R && & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=1
      SIZE             Flops                   Time
           128x128 :        Inf MFlops   0.000000 sec
           256x256 :   14909.35 MFlops   0.030000 sec
           384x384 :    6038.29 MFlops   0.250000 sec
           512x512 :    7454.68 MFlops   0.480000 sec
           640x640 :   17471.90 MFlops   0.400000 sec
           768x768 :   22785.99 MFlops   0.530000 sec
           896x896 :   29964.30 MFlops   0.640000 sec
         1024x1024 :   16643.00 MFlops   1.720000 sec
         1152x1152 :   35442.12 MFlops   1.150000 sec
         1280x1280 :   33279.80 MFlops   1.680000 sec
         1408x1408 :   38758.49 MFlops   1.920000 sec
         1536x1536 :   26182.28 MFlops   3.690000 sec
         1664x1664 :   46705.11 MFlops   2.630000 sec
         1792x1792 :   44085.41 MFlops   3.480000 sec
         1920x1920 :   51415.94 MFlops   3.670000 sec
         2048x2048 :   31242.52 MFlops   7.330000 sec
From 128 To 2048 Step=128 Loops=1
      SIZE             Flops                   Time
           128x128 :        Inf MFlops   0.000000 sec
           256x256 :        Inf MFlops   0.000000 sec
           384x384 :   11324.62 MFlops   0.010000 sec
           512x512 :        Inf MFlops   0.000000 sec
           640x640 :   26214.40 MFlops   0.020000 sec
           768x768 :        Inf MFlops   0.000000 sec
           896x896 :        Inf MFlops   0.000000 sec
         1024x1024 :  107374.18 MFlops   0.020000 sec
         1152x1152 :  152882.38 MFlops   0.020000 sec
         1280x1280 :  209715.20 MFlops   0.020000 sec
         1408x1408 :  279130.93 MFlops   0.020000 sec
         1536x1536 :  241591.91 MFlops   0.030000 sec
         1664x1664 :  307162.86 MFlops   0.030000 sec
         1792x1792 :  383639.01 MFlops   0.030000 sec
         1920x1920 :  471859.20 MFlops   0.030000 sec
         2048x2048 :  245426.70 MFlops   0.070000 sec
From 128 To 2048 Step=128 Loops=1
      SIZE             Flops                   Time
           128x128 :        Inf MFlops   0.000000 sec
           256x256 :        Inf MFlops   0.000000 sec
           384x384 :        Inf MFlops   0.000000 sec
           512x512 :        Inf MFlops   0.000000 sec
           640x640 :        Inf MFlops   0.000000 sec
           768x768 :  120795.96 MFlops   0.010000 sec
           896x896 :   95909.75 MFlops   0.020000 sec
         1024x1024 :   57266.23 MFlops   0.050000 sec
         1152x1152 :   81537.27 MFlops   0.050000 sec
         1280x1280 :   93206.76 MFlops   0.060000 sec
         1408x1408 :  124058.19 MFlops   0.060000 sec
         1536x1536 :   80530.64 MFlops   0.120000 sec
         1664x1664 :  122865.15 MFlops   0.100000 sec
         1792x1792 :  109611.14 MFlops   0.140000 sec
         1920x1920 :  125829.12 MFlops   0.150000 sec
         2048x2048 :   88101.89 MFlops   0.260000 sec

@theAeon
Copy link
Contributor Author

theAeon commented Nov 18, 2025

Gonna be completely honest here-I can't quite tell. Looks like there's some sizes for which it performs better and some for which it is worse.

Any recs for drilling down a bit deeper?

edit: just saw the openblas_loops setting, bear with

@theAeon
Copy link
Contributor Author

theAeon commented Nov 18, 2025

3.30.0dev

➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R
From 128 To 2048 Step=128 Loops=10
      SIZE             Flops                   Time
           128x128 :    6988.76 MFlops   0.080000 sec
           256x256 :    9516.61 MFlops   0.470000 sec
           384x384 :   12902.32 MFlops   1.170000 sec
           512x512 :    8580.92 MFlops   4.170000 sec
           640x640 :   21503.87 MFlops   3.250000 sec
           768x768 :   25159.53 MFlops   4.800000 sec
           896x896 :   31697.78 MFlops   6.050000 sec
         1024x1024 :   18209.90 MFlops  15.720000 sec
         1152x1152 :   40475.12 MFlops  10.070000 sec
         1280x1280 :   36187.75 MFlops  15.450000 sec
         1408x1408 :   39269.82 MFlops  18.950000 sec
         1536x1536 :   26629.71 MFlops  36.280000 sec
         1664x1664 :   46178.36 MFlops  26.600000 sec
         1792x1792 :   45282.54 MFlops  33.880000 sec
         1920x1920 :   51683.51 MFlops  36.510000 sec
         2048x2048 :   31657.13 MFlops  72.340000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R
From 128 To 2048 Step=128 Loops=10
      SIZE             Flops                   Time
           128x128 :    1048.58 MFlops   0.040000 sec
           256x256 :        Inf MFlops   0.000000 sec
           384x384 :        Inf MFlops   0.000000 sec
           512x512 :  268435.46 MFlops   0.010000 sec
           640x640 :  524288.00 MFlops   0.010000 sec
           768x768 :  181193.93 MFlops   0.050000 sec
           896x896 :  239774.38 MFlops   0.060000 sec
         1024x1024 :  214748.36 MFlops   0.100000 sec
         1152x1152 :  277967.97 MFlops   0.110000 sec
         1280x1280 :  299593.14 MFlops   0.140000 sec
         1408x1408 :  293822.03 MFlops   0.190000 sec
         1536x1536 :  301989.89 MFlops   0.240000 sec
         1664x1664 :  307162.86 MFlops   0.300000 sec
         1792x1792 :  287729.25 MFlops   0.400000 sec
         1920x1920 :  307734.26 MFlops   0.460000 sec
         2048x2048 :  330382.10 MFlops   0.520000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=10
      SIZE             Flops                   Time
           128x128 :    5592.41 MFlops   0.010000 sec
           256x256 :   44739.24 MFlops   0.010000 sec
           384x384 :   50331.65 MFlops   0.030000 sec
           512x512 :   39768.22 MFlops   0.090000 sec
           640x640 :   63550.06 MFlops   0.110000 sec
           768x768 :   71056.44 MFlops   0.170000 sec
           896x896 :   79924.79 MFlops   0.240000 sec
         1024x1024 :   60921.52 MFlops   0.470000 sec
         1152x1152 :   86741.78 MFlops   0.470000 sec
         1280x1280 :   91678.78 MFlops   0.610000 sec
         1408x1408 :   99246.55 MFlops   0.750000 sec
         1536x1536 :   82595.52 MFlops   1.170000 sec
         1664x1664 :  110689.32 MFlops   1.110000 sec
         1792x1792 :  117141.68 MFlops   1.310000 sec
         1920x1920 :  124173.47 MFlops   1.520000 sec
         2048x2048 :   98311.13 MFlops   2.330000 sec

Oryon-modded cache size

➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R
From 128 To 2048 Step=128 Loops=10
      SIZE             Flops                   Time
           128x128 :    6988.76 MFlops   0.080000 sec
           256x256 :   10909.28 MFlops   0.410000 sec
           384x384 :   12792.98 MFlops   1.180000 sec
           512x512 :    8813.41 MFlops   4.060000 sec
           640x640 :   19967.88 MFlops   3.500000 sec
           768x768 :   23449.66 MFlops   5.150000 sec
           896x896 :   30058.24 MFlops   6.380000 sec
         1024x1024 :   18026.42 MFlops  15.880000 sec
         1152x1152 :   35878.91 MFlops  11.360000 sec
         1280x1280 :   34049.98 MFlops  16.420000 sec
         1408x1408 :   29910.09 MFlops  24.880000 sec
         1536x1536 :   28052.44 MFlops  34.440000 sec
         1664x1664 :   48455.40 MFlops  25.350000 sec
         1792x1792 :   47118.32 MFlops  32.560000 sec
         1920x1920 :   52989.75 MFlops  35.610000 sec
         2048x2048 :   32345.71 MFlops  70.800000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R
From 128 To 2048 Step=128 Loops=10
      SIZE             Flops                   Time
           128x128 :        Inf MFlops   0.000000 sec
           256x256 :        Inf MFlops   0.000000 sec
           384x384 :   37748.74 MFlops   0.030000 sec
           512x512 :  134217.73 MFlops   0.020000 sec
           640x640 :  174762.67 MFlops   0.030000 sec
           768x768 :  301989.89 MFlops   0.030000 sec
           896x896 :  205520.90 MFlops   0.070000 sec
         1024x1024 :  238609.29 MFlops   0.090000 sec
         1152x1152 :  277967.97 MFlops   0.110000 sec
         1280x1280 :  322638.77 MFlops   0.130000 sec
         1408x1408 :  310145.48 MFlops   0.180000 sec
         1536x1536 :  315119.88 MFlops   0.230000 sec
         1664x1664 :  317754.69 MFlops   0.290000 sec
         1792x1792 :  280711.47 MFlops   0.410000 sec
         1920x1920 :  314572.80 MFlops   0.450000 sec
         2048x2048 :  272696.34 MFlops   0.630000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=10
      SIZE             Flops                   Time
           128x128 :        Inf MFlops   0.000000 sec
           256x256 :   22369.62 MFlops   0.020000 sec
           384x384 :   50331.65 MFlops   0.030000 sec
           512x512 :   35791.39 MFlops   0.100000 sec
           640x640 :   63550.06 MFlops   0.110000 sec
           768x768 :   71056.44 MFlops   0.170000 sec
           896x896 :   79924.79 MFlops   0.240000 sec
         1024x1024 :   60921.52 MFlops   0.470000 sec
         1152x1152 :   90596.97 MFlops   0.450000 sec
         1280x1280 :   94786.53 MFlops   0.590000 sec
         1408x1408 :  103381.83 MFlops   0.720000 sec
         1536x1536 :   84031.97 MFlops   1.150000 sec
         1664x1664 :  120456.02 MFlops   1.020000 sec
         1792x1792 :  118042.77 MFlops   1.300000 sec
         1920x1920 :  123361.88 MFlops   1.530000 sec
         2048x2048 :   93495.89 MFlops   2.450000 sec

@theAeon
Copy link
Contributor Author

theAeon commented Nov 18, 2025

I think there's definitely something here, judging by the decent improvement at certain matrix sizes, but this is not it judging by the degraded performance at other matrix sizes.

May be worth having it as a full clone of neoverse n1 (ie-removing the cache changes i made here) pending further investigation.

@theAeon
Copy link
Contributor Author

theAeon commented Nov 18, 2025

.....I had an idea.

This is an 8-wide chip, neoverse is 5-wide.

I wonder what happens if i run the VORTEX target (which is 7-wide and should be otherwise compatible.

Because I get the feeling the optimization here isn't so much in the cache definitons as much as its in the kernels.

@theAeon
Copy link
Contributor Author

theAeon commented Nov 18, 2025

Scratch that, it would do nothing, as there's no difference.

@martin-frbg
Copy link
Collaborator

Yes, right now VORTEX is also just ARMV8 with a bunch of NEOVERSEN1 kernels on top. Without dedicated kernels, I think the easiest fix would be to put the proper L1 and L2 cache sizes in cpuid_arm64.c when we're on Windows, to guide the block sizes for GEMM etc.
Unless the cost tables etc. requested by -mcpu=oryon have a dramatic influence on compilation - but I don't expect that, given that it should mainly affect the generic C parts of the (setup and interfacing) code

@theAeon
Copy link
Contributor Author

theAeon commented Nov 18, 2025

Yeah-and even if there is optimization here (and there almost certanily is) I don't even know that the cache sizes are an improvement.

@martin-frbg
Copy link
Collaborator

Probably needs larger loops to get more stable benchmark results. I do have an Oryon system on loan from Qualcomm, it's just that I'm away from it at the moment but I'll try to run some experiments myself when I have more time for OpenBLAS again - hopefully soon

@theAeon
Copy link
Contributor Author

theAeon commented Nov 18, 2025

Probably needs larger loops to get more stable benchmark results. I do have an Oryon system on loan from Qualcomm, it's just that I'm away from it at the moment but I'll try to run some experiments myself when I have more time for OpenBLAS again - hopefully soon

Oh, I can absolutely just run them on my laptop. How large are we talking?

@martin-frbg
Copy link
Collaborator

I'd guess a hundred instead of ten should help

@theAeon
Copy link
Contributor Author

theAeon commented Nov 18, 2025

I'd guess a hundred instead of ten should help

Will report back. With bonus ArmPL for comparison.

@theAeon
Copy link
Contributor Author

theAeon commented Nov 19, 2025

So, it turns out the issue was mostly that running BLAS on 12 cores well exceeds the heat capacity of my laptop. Fixed that one. Anyway:

Seems that there's a thousand to a few thousand megaflops difference in favor of the cache-tuned build at all sizes, which is more what I would have expected. Funnily enough, ArmPL seems to be on par with the n1 build and similarly behind the tuned build. Guess that does make sense, they did optimize for their own cores. Do we know if QC has an optimized implementation?

N1

➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R
From 128 To 2048 Step=128 Loops=100
      SIZE             Flops                   Time
           128x128 :   10165.47 MFlops   0.550000 sec
           256x256 :   17540.41 MFlops   2.550000 sec
           384x384 :   25328.39 MFlops   5.960000 sec
           512x512 :   13662.64 MFlops  26.190000 sec
           640x640 :   31047.35 MFlops  22.510000 sec
           768x768 :   33140.99 MFlops  36.440000 sec
           896x896 :   40733.12 MFlops  47.080000 sec
         1024x1024 :   27487.96 MFlops 104.140000 sec
         1152x1152 :   46153.82 MFlops  88.310000 sec
         1280x1280 :   45388.92 MFlops 123.180000 sec
         1408x1408 :   48692.21 MFlops 152.830000 sec
         1536x1536 :   37665.73 MFlops 256.500000 sec
         1664x1664 :   51015.21 MFlops 240.780000 sec
         1792x1792 :   49826.97 MFlops 307.900000 sec
         1920x1920 :   51755.81 MFlops 364.590000 sec
         2048x2048 :   38978.70 MFlops 587.520000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R
From 128 To 2048 Step=128 Loops=100
      SIZE             Flops                   Time
           128x128 :   20971.52 MFlops   0.020000 sec
           256x256 :   41943.04 MFlops   0.080000 sec
           384x384 :   47185.92 MFlops   0.240000 sec
           512x512 :   48806.45 MFlops   0.550000 sec
           640x640 :   49932.19 MFlops   1.050000 sec
           768x768 :   52067.22 MFlops   1.740000 sec
           896x896 :   51749.87 MFlops   2.780000 sec
         1024x1024 :   50174.85 MFlops   4.280000 sec
         1152x1152 :   50539.63 MFlops   6.050000 sec
         1280x1280 :   50655.85 MFlops   8.280000 sec
         1408x1408 :   50613.04 MFlops  11.030000 sec
         1536x1536 :   50719.09 MFlops  14.290000 sec
         1664x1664 :   50967.29 MFlops  18.080000 sec
         1792x1792 :   50835.56 MFlops  22.640000 sec
         1920x1920 :   51159.29 MFlops  27.670000 sec
         2048x2048 :   50858.11 MFlops  33.780000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=100
      SIZE             Flops                   Time
           128x128 :   11184.81 MFlops   0.050000 sec
           256x256 :   26317.20 MFlops   0.170000 sec
           384x384 :   30198.99 MFlops   0.500000 sec
           512x512 :   25205.21 MFlops   1.420000 sec
           640x640 :   35848.75 MFlops   1.950000 sec
           768x768 :   36940.66 MFlops   3.270000 sec
           896x896 :   38595.47 MFlops   4.970000 sec
         1024x1024 :   32135.93 MFlops   8.910000 sec
         1152x1152 :   39542.81 MFlops  10.310000 sec
         1280x1280 :   39550.25 MFlops  14.140000 sec
         1408x1408 :   40741.61 MFlops  18.270000 sec
         1536x1536 :   36315.96 MFlops  26.610000 sec
         1664x1664 :   41663.32 MFlops  29.490000 sec
         1792x1792 :   41373.85 MFlops  37.090000 sec
         1920x1920 :   42490.70 MFlops  44.420000 sec
         2048x2048 :   38126.65 MFlops  60.080000 sec

Oryon

➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R
From 128 To 2048 Step=128 Loops=100
      SIZE             Flops                   Time
           128x128 :   10751.94 MFlops   0.520000 sec
           256x256 :   19617.57 MFlops   2.280000 sec
           384x384 :   27101.83 MFlops   5.570000 sec
           512x512 :   14878.36 MFlops  24.050000 sec
           640x640 :   31019.79 MFlops  22.530000 sec
           768x768 :   32568.97 MFlops  37.080000 sec
           896x896 :   40672.65 MFlops  47.150000 sec
         1024x1024 :   28059.16 MFlops 102.020000 sec
         1152x1152 :   45749.74 MFlops  89.090000 sec
         1280x1280 :   44817.69 MFlops 124.750000 sec
         1408x1408 :   48803.98 MFlops 152.480000 sec
         1536x1536 :   37978.15 MFlops 254.390000 sec
         1664x1664 :   50716.11 MFlops 242.200000 sec
         1792x1792 :   49471.88 MFlops 310.110000 sec
         1920x1920 :   52463.78 MFlops 359.670000 sec
         2048x2048 :   39061.81 MFlops 586.270000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R
From 128 To 2048 Step=128 Loops=100
      SIZE             Flops                   Time
           128x128 :   20971.52 MFlops   0.020000 sec
           256x256 :   41943.04 MFlops   0.080000 sec
           384x384 :   53926.77 MFlops   0.210000 sec
           512x512 :   45497.53 MFlops   0.590000 sec
           640x640 :   48998.88 MFlops   1.070000 sec
           768x768 :   53607.67 MFlops   1.690000 sec
           896x896 :   54084.45 MFlops   2.660000 sec
         1024x1024 :   52377.65 MFlops   4.100000 sec
         1152x1152 :   52627.33 MFlops   5.810000 sec
         1280x1280 :   52626.15 MFlops   7.970000 sec
         1408x1408 :   52369.78 MFlops  10.660000 sec
         1536x1536 :   51769.70 MFlops  14.000000 sec
         1664x1664 :   51914.85 MFlops  17.750000 sec
         1792x1792 :   51518.22 MFlops  22.340000 sec
         1920x1920 :   51569.31 MFlops  27.450000 sec
         2048x2048 :   50410.41 MFlops  34.080000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=100
      SIZE             Flops                   Time
           128x128 :   18641.35 MFlops   0.030000 sec
           256x256 :   26317.20 MFlops   0.170000 sec
           384x384 :   32126.58 MFlops   0.470000 sec
           512x512 :   25935.79 MFlops   1.380000 sec
           640x640 :   38199.49 MFlops   1.830000 sec
           768x768 :   39092.54 MFlops   3.090000 sec
           896x896 :   40899.68 MFlops   4.690000 sec
         1024x1024 :   33255.65 MFlops   8.610000 sec
         1152x1152 :   40768.63 MFlops  10.000000 sec
         1280x1280 :   40554.06 MFlops  13.790000 sec
         1408x1408 :   41583.75 MFlops  17.900000 sec
         1536x1536 :   37011.40 MFlops  26.110000 sec
         1664x1664 :   42178.22 MFlops  29.130000 sec
         1792x1792 :   41859.14 MFlops  36.660000 sec
         1920x1920 :   42385.74 MFlops  44.530000 sec
         2048x2048 :   38107.62 MFlops  60.110000 sec

ArmPL for comparison

➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\deig.R
From 128 To 2048 Step=128 Loops=100
      SIZE             Flops                   Time
           128x128 :   10549.07 MFlops   0.530000 sec
           256x256 :   19114.55 MFlops   2.340000 sec
           384x384 :   26253.43 MFlops   5.750000 sec
           512x512 :   14971.73 MFlops  23.900000 sec
           640x640 :   31102.62 MFlops  22.470000 sec
           768x768 :   29455.06 MFlops  41.000000 sec
           896x896 :   35925.73 MFlops  53.380000 sec
         1024x1024 :   24104.04 MFlops 118.760000 sec
         1152x1152 :   40918.02 MFlops  99.610000 sec
         1280x1280 :   42426.83 MFlops 131.780000 sec
         1408x1408 :   48951.66 MFlops 152.020000 sec
         1536x1536 :   40124.85 MFlops 240.780000 sec
         1664x1664 :   55783.12 MFlops 220.200000 sec
         1792x1792 :   53676.17 MFlops 285.820000 sec
         1920x1920 :   51701.92 MFlops 364.970000 sec
         2048x2048 :   38894.62 MFlops 588.790000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dgemm.R
From 128 To 2048 Step=128 Loops=100
      SIZE             Flops                   Time
           128x128 :        Inf MFlops   0.000000 sec
           256x256 :   41943.04 MFlops   0.080000 sec
           384x384 :   47185.92 MFlops   0.240000 sec
           512x512 :   51622.20 MFlops   0.520000 sec
           640x640 :   54050.31 MFlops   0.970000 sec
           768x768 :   54576.49 MFlops   1.660000 sec
           896x896 :   52697.67 MFlops   2.730000 sec
         1024x1024 :   51871.59 MFlops   4.140000 sec
         1152x1152 :   52900.48 MFlops   5.780000 sec
         1280x1280 :   53294.84 MFlops   7.870000 sec
         1408x1408 :   54624.45 MFlops  10.220000 sec
         1536x1536 :   53449.54 MFlops  13.560000 sec
         1664x1664 :   55477.94 MFlops  16.610000 sec
         1792x1792 :   55279.40 MFlops  20.820000 sec
         1920x1920 :   55951.68 MFlops  25.300000 sec
         2048x2048 :   54730.39 MFlops  31.390000 sec
➜ & "C:\Program Files\R-aarch64\R-4.5.2\bin\Rscript.exe" .\benchmark\scripts\R\dsolve.R
From 128 To 2048 Step=128 Loops=100
      SIZE             Flops                   Time
           128x128 :   11184.81 MFlops   0.050000 sec
           256x256 :   23546.97 MFlops   0.190000 sec
           384x384 :   32126.58 MFlops   0.470000 sec
           512x512 :   23091.22 MFlops   1.550000 sec
           640x640 :   38199.49 MFlops   1.830000 sec
           768x768 :   36828.04 MFlops   3.280000 sec
           896x896 :   41973.63 MFlops   4.570000 sec
         1024x1024 :   28519.04 MFlops  10.040000 sec
         1152x1152 :   41856.91 MFlops   9.740000 sec
         1280x1280 :   39217.43 MFlops  14.260000 sec
         1408x1408 :   42803.29 MFlops  17.390000 sec
         1536x1536 :   34686.56 MFlops  27.860000 sec
         1664x1664 :   43050.16 MFlops  28.540000 sec
         1792x1792 :   40921.49 MFlops  37.500000 sec
         1920x1920 :   43792.04 MFlops  43.100000 sec
         2048x2048 :   32144.95 MFlops  71.260000 sec

@martin-frbg
Copy link
Collaborator

Hmm. I'm still not that convinced - looks like there is still a lot of noise in the data, and where it looks like there is an improvement from using the correct cache sizes, it is around 2 percent at most ?

@theAeon
Copy link
Contributor Author

theAeon commented Nov 19, 2025

Hmm. I'm still not that convinced - looks like there is still a lot of noise in the data, and where it looks like there is an improvement from using the correct cache sizes, it is around 2 percent at most ?

By noise do you mean the fluctuating MFlops as size increases? That's actually fairly reproducible.

And yes, around 2%. I think the bottleneck here isn't so much cache locality as much as it is the difference in execution pipeline size (5-wide vs 8-wide).

edit: looking at the block diagrams it appears the correct way of looking at it is 2 NEON/FP units on the N1 and 4 on oryon

@abhishek-iitmadras
Copy link
Contributor

Hi @theAeon

Out of curiosity, are you going to add/modifying/optimize any kernels for this arch in future?

@theAeon
Copy link
Contributor Author

theAeon commented Nov 20, 2025

unfortunately this is not exactly my strong suit, so while I will take a look i am...not expecting to, no.

@martin-frbg
Copy link
Collaborator

I can add a small hack to the cpu detection code to put the correct cache sizes in the config file, as that bit of performance gain it is low-hanging (if fairly small) fruit. But frankly I expect the upcoming X2 Elite cpu with its SVE+SME capability to be a markedly more attractive platform for any kind of numerical workload, and it should be quite adequately covered by the ARMV9SME target already.

@theAeon
Copy link
Contributor Author

theAeon commented Nov 20, 2025

That sounds like the way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants