Skip to content

Accelerate SVE128 SBGEMM/BGEMM#5667

Merged
martin-frbg merged 1 commit intoOpenMathLib:developfrom
fadara01:accelerate_sve128_sbgemm
Mar 6, 2026
Merged

Accelerate SVE128 SBGEMM/BGEMM#5667
martin-frbg merged 1 commit intoOpenMathLib:developfrom
fadara01:accelerate_sve128_sbgemm

Conversation

@fadara01
Copy link
Contributor

@fadara01 fadara01 commented Mar 5, 2026

This accelerates SBGEMM/BGEMM by extending the existing 8x4 kernel to 8x8 (unrolling N by 8)

Not sure if it's a good idea to delete the previous 8x4 kernel?

Here are the speedups on single core Neoverse-V2 (SVE128) compared to prev state:

  M=N=K=64: SBGEMM 1.164x (16.42%), BGEMM 1.133x (13.30%)
  M=N=K=128: SBGEMM 1.220x (22.02%), BGEMM 1.186x (18.56%)
  M=N=K=256: SBGEMM 1.241x (24.08%), BGEMM 1.235x (23.54%)
  M=N=K=512: SBGEMM 1.240x (23.95%), BGEMM 1.227x (22.75%)
  M=N=K=1024: SBGEMM 1.251x (25.11%), BGEMM 1.232x (23.23%)
  M=N=K=2048: SBGEMM 1.235x (23.47%), BGEMM 1.246x (24.64%)

and here are the speedups for the same benchmark on Neoverse-N2:

  M=N=K=64: SBGEMM 1.019x (1.93%), BGEMM 1.055x (5.49%)
  M=N=K=128: SBGEMM 1.030x (3.02%), BGEMM 1.053x (5.31%)
  M=N=K=256: SBGEMM 1.129x (12.90%), BGEMM 1.121x (12.06%)
  M=N=K=512: SBGEMM 1.143x (14.28%), BGEMM 1.132x (13.25%)
  M=N=K=1024: SBGEMM 1.144x (14.41%), BGEMM 1.137x (13.69%)

This accelerates SBGEMM/BGEMM by extending the existing 8x4 kernel to 8x8 (unrolling N by 8)

Not sure if it's a good idea to delete the previous 8x4 kernel?

Here are the speedups on single core Neoverse-V2 (SVE128) compared to prev state:

Per-shape speedup
  M=N=K=64: SBGEMM 1.164x (16.42%), BGEMM 1.133x (13.30%)
  M=N=K=128: SBGEMM 1.220x (22.02%), BGEMM 1.186x (18.56%)
  M=N=K=256: SBGEMM 1.241x (24.08%), BGEMM 1.235x (23.54%)
  M=N=K=512: SBGEMM 1.240x (23.95%), BGEMM 1.227x (22.75%)
  M=N=K=1024: SBGEMM 1.251x (25.11%), BGEMM 1.232x (23.23%)
  M=N=K=2048: SBGEMM 1.235x (23.47%), BGEMM 1.246x (24.64%)

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
@fadara01
Copy link
Contributor Author

fadara01 commented Mar 5, 2026

Hi @martin-frbg - could you please have a look?

(This currently copies the 8x4 kernel and extends it to 8x8 - please let me know if it's a good idea to remove the 8x4 kernel)

@martin-frbg
Copy link
Collaborator

Looks good to me, thanks. And I still think it makes sense to leave older kernels around, even if nothing uses them at the moment - they might still turn out to be adequate (or at least provide inspiration for kernels) for other cpus (or data types) that do not benefit from the latest enhancement or unrolling pattern

@martin-frbg martin-frbg added this to the 0.3.32 milestone Mar 5, 2026
@fadara01
Copy link
Contributor Author

fadara01 commented Mar 5, 2026

thanks for reviewing!
it would be great if we could get this merged before the next release OpenBLAS release for us to pick it up in PyTorch

@martin-frbg martin-frbg merged commit 3726265 into OpenMathLib:develop Mar 6, 2026
100 of 102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants