2D thread distribution for multi-threaded GEMMs #1320

timmoon10 · 2017-10-04T20:12:46Z

This is a refinement on #1316 that attempts to maximize thread utilization when a matrix dimension is small. In particular, all threads will be active if M is small and N is large (or vice versa).

Below are some scaling experiments on a 24-core system with hyper-threading (2 x Intel Xeon E5-2695v2, gcc 4.9.3, TARGET=SANDYBRIDGE USE_OPENMP=1 NUM_THREADS=48 INTERFACE64=0). "Unpatched" refers to 00c42dc, "Patch v1" to a89d671, and "Patch v2" to 30486a3. Varying N:

Varying M:

Allows maximum use of available cores if one of M and N is small and the other is large.

martin-frbg · 2017-10-04T21:10:31Z

Impressive (Though somewhat intriguing to see the first sgemm test "lose" a quarter of its just gained (v1) performance again at N=130 (and also N=65, N=32/33). I wonder if it would make sense to add special cases for such hardware-specific sweet spots at some point ?

timmoon10 · 2017-10-04T21:57:25Z

For v1, the number of threads is N/4. It seems we get the best performance when the number of threads is a power of 2 and we take a hit when we cross that boundary. For v2, we use all available threads from the get-go, so we don't see this effect.

timmoon10 · 2017-10-04T21:59:49Z

We could probably add some hardware-dependent macros to determine how we partition the computation. Right now, I use SWITCH_RATIO to determine partitions in the M and N dimensions, but it may be worthwhile experimenting with separate values.

brada4 · 2017-10-04T23:10:51Z

If you look at early marketing diagrams it is not n^2 but a 4-core clusters of cores with somewhat more adjacent / shareable caches.

Tim Moon added 3 commits October 3, 2017 13:43

Use 2D thread distribution for small GEMMs.

860dcfc

Allows maximum use of available cores if one of M and N is small and the other is large.

Cleaning up and documenting multi-threaded GEMM code.

9de52b4

Reduce number of data partitions in n.

30486a3

martin-frbg merged commit db72ad8 into OpenMathLib:develop Oct 8, 2017

martin-frbg mentioned this pull request Nov 16, 2017

Significant performance increase for gemm, but not uniform (v0.30.0 develop) #1360

Closed

martin-frbg mentioned this pull request Feb 16, 2018

performance on AMD Ryzen and Threadripper #1461

Open

timmoon10 mentioned this pull request Feb 23, 2018

Minor performance optimization and bugfixes LLNL/lbann#233

Merged

martin-frbg mentioned this pull request May 4, 2018

A problem about relation between "openblas_set_num_threads" and CPU waste and time cost? #1544

Closed

martin-frbg mentioned this pull request May 26, 2018

Above average kernel times causing slow performance #1560

Open

martin-frbg mentioned this pull request Aug 11, 2019

Some thoughts for improving Haswell sgemm performance #2210

Closed

yamazakimitsufumi mentioned this pull request Apr 15, 2024

Multi-threaded DGEMM becomes less efficient on many-core CPUs #4644

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2D thread distribution for multi-threaded GEMMs #1320

2D thread distribution for multi-threaded GEMMs #1320

timmoon10 commented Oct 4, 2017

martin-frbg commented Oct 4, 2017

timmoon10 commented Oct 4, 2017

timmoon10 commented Oct 4, 2017

brada4 commented Oct 4, 2017

2D thread distribution for multi-threaded GEMMs #1320

2D thread distribution for multi-threaded GEMMs #1320

Conversation

timmoon10 commented Oct 4, 2017

martin-frbg commented Oct 4, 2017

timmoon10 commented Oct 4, 2017

timmoon10 commented Oct 4, 2017

brada4 commented Oct 4, 2017