Multi-threaded DGEMM becomes less efficient on many-core CPUs #4644

yamazakimitsufumi · 2024-04-15T10:29:54Z

OpenBLAS DGEMM achieves high efficiency, for example, over 90% of peak performance with 1 thread on Graviton3E, but the efficiency drops to about 73% when running DGEMM with 64 threads.
As is known, it is becoming difficult to keep high efficiency for multi-thread execution on recent many-core CPUs, even if high-performance kernels are implemented for single-thread execution.

I am considering to adjust the shape of the submatrix handled by each thread by modifying 2D thread distribution.
I would appreciate it if you could let me know if you have any suggestions.

brada4 · 2024-04-15T23:55:17Z

8mb matrices are probably less than caches by magnitude.

yamazakimitsufumi mentioned this issue Apr 18, 2024

Expanding the scope of 2D thread distribution to improve multi-threaded DGEMM performance #4655

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded DGEMM becomes less efficient on many-core CPUs #4644

Multi-threaded DGEMM becomes less efficient on many-core CPUs #4644

yamazakimitsufumi commented Apr 15, 2024

brada4 commented Apr 15, 2024

Multi-threaded DGEMM becomes less efficient on many-core CPUs #4644

Multi-threaded DGEMM becomes less efficient on many-core CPUs #4644

Comments

yamazakimitsufumi commented Apr 15, 2024

brada4 commented Apr 15, 2024