Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEMM] Enhance serial implementation #21

Open
mratsim opened this issue Jan 30, 2019 · 1 comment
Open

[GEMM] Enhance serial implementation #21

mratsim opened this issue Jan 30, 2019 · 1 comment

Comments

@mratsim
Copy link
Owner

mratsim commented Jan 30, 2019

With #20, the parallel schedule seems to scale perfectly on many cores:

$  OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 ./build/gemm_f32_serialWarmup: 0.9036 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    230.400 GFLOP/s
Theoretical peak multi:         4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 1.238 seconds
Average time: 123.713 ms
Stddev  time: 0.444 ms
Min     time: 123.335 ms
Max     time: 124.890 ms
Perf:         114.425 GFLOP/s

Laser production implementation
Collected 10 samples in 1.465 seconds
Average time: 146.392 ms
Stddev  time: 0.644 ms
Min     time: 146.006 ms
Max     time: 147.802 ms
Perf:         96.697 GFLOP/s
Mean Relative Error compared to OpenBLAS: 1.243059557509696e-07

------------------------------------------------------------

$  ./build/gemm_f32_omp
Warmup: 0.9021 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    230.400 GFLOP/s
Theoretical peak multi:         4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 0.079 seconds
Average time: 7.739 ms
Stddev  time: 4.368 ms
Min     time: 6.020 ms
Max     time: 20.097 ms
Perf:         1829.200 GFLOP/s

Laser production implementation
Collected 10 samples in 0.083 seconds
Average time: 8.126 ms
Stddev  time: 4.777 ms
Min     time: 6.241 ms
Max     time: 21.632 ms
Perf:         1742.123 GFLOP/s
Mean Relative Error compared to OpenBLAS: 0.01456451416015625

with 96.7 GFLOP/s * 18 cores = 1740 on my machine.

However the single-threaded implementation is still quite often below OpenBLAS.

Causes:

  1. To fix regressions in Improve gemm threading #20, interleaving loading the next A micropanel with the computation on the current A micro panel had to be removed and is currently commented out: https://github.com/numforge/laser/blob/ebb01ad40f30d495f0f4b02ef1ff49c3f54230cd/laser/primitives/matrix_multiplication/gemm_ukernel_generator.nim#L237-L242

It should be reintroduced.

  1. mc and kc should be tuned depending on available L1 and L2 cache and the TLB.
@mratsim
Copy link
Owner Author

mratsim commented Jan 30, 2019

Note that with the new AVX512 you do not need explicit broadcast saving on registers.
Unfortunately there is no way to ensure the compiler uses those, GCC fails to before GCC9:

See https://colfaxresearch.com/skl-avx512/#sec-2-8 - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351

This paper also mentions using 10~14 FMA chains onBbroadwell and 8 minimum for Skylake-X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant