Setting optimized [SD]GEMM_DEFAULT_[PQR] parameters for A64FX
#5554
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #5553.


The parameters
[SD]GEMM_DEFAULT_[PQR]have been tuned to obtain the performance improvement in[SD]GEMMunder the multi-process evaluation using all cores ofA64FX. This change improves the performance of[SD]GEMMshown in the left and center figures. In this pull-request, performance is compared between OpenBLAS v0.3.30 and modified one (labeled as update). I also confirmed that the performance improves under the single-process evaluation shown in right figure.While the performance improves in most Level 3 BLAS kernels, the performance degrades in kernels related to triangular matrix (
TRMMandTRSM), which comes from the same reason described in Issue#4742.Above figures show the performance change in
GEMM,TRMMandTRSM.To understand the extent of the performance degradation in
TRMMandTRSM, I evaluated the performance ratio relative to the v0.3.30 up to size=5,000 and summarized the results in the table below.This indicates that while the pert of performance of
TRMMandTRSMdecreases, there are benefits to fine-turn the[SD]GEMM_DEFAULT_[PQR]parameters forA64FX.