Skip to content

Conversation

@hideaki-motoki
Copy link
Contributor

Resolves #5553.
The parameters [SD]GEMM_DEFAULT_[PQR] have been tuned to obtain the performance improvement in [SD]GEMM under the multi-process evaluation using all cores of A64FX. This change improves the performance of [SD]GEMM shown in the left and center figures. In this pull-request, performance is compared between OpenBLAS v0.3.30 and modified one (labeled as update). I also confirmed that the performance improves under the single-process evaluation shown in right figure.
1
While the performance improves in most Level 3 BLAS kernels, the performance degrades in kernels related to triangular matrix (TRMM and TRSM), which comes from the same reason described in Issue#4742.
2
Above figures show the performance change in GEMM, TRMM and TRSM.
To understand the extent of the performance degradation in TRMM and TRSM, I evaluated the performance ratio relative to the v0.3.30 up to size=5,000 and summarized the results in the table below.

kernel update/v0.3.30 (update/v0.3.30)-1
dgemm.nn 1.0846 +0.0846
dtrmm.n 0.9268 -0.0732
dtrsm.n 0.9398 -0.0602

This indicates that while the pert of performance of TRMM and TRSM decreases, there are benefits to fine-turn the [SD]GEMM_DEFAULT_[PQR] parameters for A64FX.

@abhishek-iitmadras
Copy link
Contributor

abhishek-iitmadras commented Nov 29, 2025

Hi @hideaki-motoki -san

Overall LGTM.

For the performance comparisons between v0.3.30 and this updated version, could you clarify the build configuration?
In particular, was OpenBLAS built with USE_OPENMP=1 or USE_OPENMP=0?
This will help interpret the multi-process results on A64FX.

@hideaki-motoki
Copy link
Contributor Author

hideaki-motoki commented Dec 1, 2025

Hi, @abhishek-iitmadras -san.
Thank you for reviewing the results.

For the performance comparisons between v0.3.30 and this updated version, could you clarify the build configuration?
In particular, was OpenBLAS built with USE_OPENMP=1 or USE_OPENMP=0?

It was built with USE_OPENMP=1 as follows:
make DYNAMIC_ARCH=1 USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Setting optimized [SD]GEMM_DEFAULT_[PQR] parameters for A64FX

2 participants