Accelerate SVE128 SBGEMM/BGEMM by fadara01 · Pull Request #5667 · OpenMathLib/OpenBLAS

fadara01 · 2026-03-05T13:52:28Z

This accelerates SBGEMM/BGEMM by extending the existing 8x4 kernel to 8x8 (unrolling N by 8)

Not sure if it's a good idea to delete the previous 8x4 kernel?

Here are the speedups on single core Neoverse-V2 (SVE128) compared to prev state:

  M=N=K=64: SBGEMM 1.164x (16.42%), BGEMM 1.133x (13.30%)
  M=N=K=128: SBGEMM 1.220x (22.02%), BGEMM 1.186x (18.56%)
  M=N=K=256: SBGEMM 1.241x (24.08%), BGEMM 1.235x (23.54%)
  M=N=K=512: SBGEMM 1.240x (23.95%), BGEMM 1.227x (22.75%)
  M=N=K=1024: SBGEMM 1.251x (25.11%), BGEMM 1.232x (23.23%)
  M=N=K=2048: SBGEMM 1.235x (23.47%), BGEMM 1.246x (24.64%)

and here are the speedups for the same benchmark on Neoverse-N2:

  M=N=K=64: SBGEMM 1.019x (1.93%), BGEMM 1.055x (5.49%)
  M=N=K=128: SBGEMM 1.030x (3.02%), BGEMM 1.053x (5.31%)
  M=N=K=256: SBGEMM 1.129x (12.90%), BGEMM 1.121x (12.06%)
  M=N=K=512: SBGEMM 1.143x (14.28%), BGEMM 1.132x (13.25%)
  M=N=K=1024: SBGEMM 1.144x (14.41%), BGEMM 1.137x (13.69%)

This accelerates SBGEMM/BGEMM by extending the existing 8x4 kernel to 8x8 (unrolling N by 8) Not sure if it's a good idea to delete the previous 8x4 kernel? Here are the speedups on single core Neoverse-V2 (SVE128) compared to prev state: Per-shape speedup M=N=K=64: SBGEMM 1.164x (16.42%), BGEMM 1.133x (13.30%) M=N=K=128: SBGEMM 1.220x (22.02%), BGEMM 1.186x (18.56%) M=N=K=256: SBGEMM 1.241x (24.08%), BGEMM 1.235x (23.54%) M=N=K=512: SBGEMM 1.240x (23.95%), BGEMM 1.227x (22.75%) M=N=K=1024: SBGEMM 1.251x (25.11%), BGEMM 1.232x (23.23%) M=N=K=2048: SBGEMM 1.235x (23.47%), BGEMM 1.246x (24.64%) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

fadara01 · 2026-03-05T13:53:51Z

Hi @martin-frbg - could you please have a look?

(This currently copies the 8x4 kernel and extends it to 8x8 - please let me know if it's a good idea to remove the 8x4 kernel)

martin-frbg · 2026-03-05T21:00:54Z

Looks good to me, thanks. And I still think it makes sense to leave older kernels around, even if nothing uses them at the moment - they might still turn out to be adequate (or at least provide inspiration for kernels) for other cpus (or data types) that do not benefit from the latest enhancement or unrolling pattern

fadara01 · 2026-03-05T23:55:28Z

thanks for reviewing!
it would be great if we could get this merged before the next release OpenBLAS release for us to pick it up in PyTorch

martin-frbg added this to the 0.3.32 milestone Mar 5, 2026

martin-frbg merged commit 3726265 into OpenMathLib:develop Mar 6, 2026
100 of 102 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate SVE128 SBGEMM/BGEMM#5667

Accelerate SVE128 SBGEMM/BGEMM#5667
martin-frbg merged 1 commit intoOpenMathLib:developfrom
fadara01:accelerate_sve128_sbgemm

fadara01 commented Mar 5, 2026 •

edited

Loading

Uh oh!

fadara01 commented Mar 5, 2026 •

edited

Loading

Uh oh!

martin-frbg commented Mar 5, 2026

Uh oh!

fadara01 commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fadara01 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fadara01 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Mar 5, 2026

Uh oh!

fadara01 commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fadara01 commented Mar 5, 2026 •

edited

Loading

fadara01 commented Mar 5, 2026 •

edited

Loading