[GEMM] Enhance serial implementation #21

mratsim · 2019-01-30T01:04:24Z

With #20, the parallel schedule seems to scale perfectly on many cores:

$  OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 ./build/gemm_f32_serialWarmup: 0.9036 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    230.400 GFLOP/s
Theoretical peak multi:         4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 1.238 seconds
Average time: 123.713 ms
Stddev  time: 0.444 ms
Min     time: 123.335 ms
Max     time: 124.890 ms
Perf:         114.425 GFLOP/s

Laser production implementation
Collected 10 samples in 1.465 seconds
Average time: 146.392 ms
Stddev  time: 0.644 ms
Min     time: 146.006 ms
Max     time: 147.802 ms
Perf:         96.697 GFLOP/s
Mean Relative Error compared to OpenBLAS: 1.243059557509696e-07

------------------------------------------------------------

$  ./build/gemm_f32_omp
Warmup: 0.9021 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    230.400 GFLOP/s
Theoretical peak multi:         4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 0.079 seconds
Average time: 7.739 ms
Stddev  time: 4.368 ms
Min     time: 6.020 ms
Max     time: 20.097 ms
Perf:         1829.200 GFLOP/s

Laser production implementation
Collected 10 samples in 0.083 seconds
Average time: 8.126 ms
Stddev  time: 4.777 ms
Min     time: 6.241 ms
Max     time: 21.632 ms
Perf:         1742.123 GFLOP/s
Mean Relative Error compared to OpenBLAS: 0.01456451416015625

with 96.7 GFLOP/s * 18 cores = 1740 on my machine.

However the single-threaded implementation is still quite often below OpenBLAS.

Causes:

To fix regressions in Improve gemm threading #20, interleaving loading the next A micropanel with the computation on the current A micro panel had to be removed and is currently commented out: https://github.com/numforge/laser/blob/ebb01ad40f30d495f0f4b02ef1ff49c3f54230cd/laser/primitives/matrix_multiplication/gemm_ukernel_generator.nim#L237-L242

It should be reintroduced.

mc and kc should be tuned depending on available L1 and L2 cache and the TLB.

The text was updated successfully, but these errors were encountered:

mratsim · 2019-01-30T08:34:34Z

Note that with the new AVX512 you do not need explicit broadcast saving on registers.
Unfortunately there is no way to ensure the compiler uses those, GCC fails to before GCC9:

See https://colfaxresearch.com/skl-avx512/#sec-2-8 - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351

This paper also mentions using 10~14 FMA chains onBbroadwell and 8 minimum for Skylake-X

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GEMM] Enhance serial implementation #21

[GEMM] Enhance serial implementation #21

mratsim commented Jan 30, 2019

mratsim commented Jan 30, 2019 •

edited

[GEMM] Enhance serial implementation #21

[GEMM] Enhance serial implementation #21

Comments

mratsim commented Jan 30, 2019

mratsim commented Jan 30, 2019 • edited

mratsim commented Jan 30, 2019 •

edited