Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mysterious 2x perf regression on GEMM #40

Open
mratsim opened this issue Oct 24, 2019 · 2 comments
Open

Mysterious 2x perf regression on GEMM #40

mratsim opened this issue Oct 24, 2019 · 2 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Oct 24, 2019

With no code or hardware change at all, after month there is a 2x perf regression, OpenBLAS also is a bit slower (with no package update):

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 0.101 seconds
Average time: 9.440 ms
Stddev  time: 0.141 ms
Min     time: 9.315 ms
Max     time: 9.733 ms
Perf:         1499.508 GFLOP/s

Laser production implementation
Collected 10 samples in 0.146 seconds
Average time: 14.000 ms
Stddev  time: 25.706 ms
Min     time: 5.839 ms
Max     time: 87.161 ms
Perf:         1011.102 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10 samples in 2.041 seconds
Average time: 204.123 ms
Stddev  time: 0.763 ms
Min     time: 203.362 ms
Max     time: 205.862 ms
Perf:         69.349 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 10 samples in 0.351 seconds
Average time: 34.305 ms
Stddev  time: 5.588 ms
Min     time: 30.013 ms
Max     time: 49.684 ms
Perf:         412.645 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 10 samples in 0.130 seconds
Average time: 11.230 ms
Stddev  time: 8.353 ms
Min     time: 7.725 ms
Max     time: 34.426 ms
Perf:         1260.573 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 10 samples in 0.083 seconds
Average time: 7.716 ms
Stddev  time: 7.932 ms
Min     time: 4.601 ms
Max     time: 30.078 ms
Perf:         1834.643 GFLOP/s
Mean Relative Error compared to vendor BLAS: 3.045843413929106e-06

I suspect an issue with glibc OpenMP. (MKL-DNN is linked to Intel OpenMP)

@mratsim
Copy link
Owner Author

mratsim commented Oct 24, 2019

But running laser alone actually brings great improvements:

$  nim cpp -r -d:release -d:openmp -d:danger --outdir:build benchmarks/gemm/gemm_bench_float32.nim
Hint: used config file '/home/beta/.choosenim/toolchains/nim-1.0.2/config/nim.cfg' [Conf]
Hint: used config file '/home/beta/Programming/Nim/laser/nim.cfg' [Conf]
Hint: operation successful (340 lines compiled; 0.025 sec total; 5.754MiB peakmem; Dangerous Release Build) [SuccessX]
Hint: /home/beta/Programming/Nim/laser/build/gemm_bench_float32  [Exec]

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Laser production implementation
Collected 10 samples in 0.076 seconds
Average time: 6.928 ms
Stddev  time: 3.038 ms
Min     time: 5.896 ms
Max     time: 15.573 ms
Perf:         2043.146 GFLOP/s

@mratsim
Copy link
Owner Author

mratsim commented Oct 24, 2019

And changing the order can slow down OpenBLAS as well

$  nim cpp -r -d:release -d:openmp -d:danger --outdir:build benchmarks/gemm/gemm_bench_float32.nim
Hint: used config file '/home/beta/.choosenim/toolchains/nim-1.0.2/config/nim.cfg' [Conf]
Hint: used config file '/home/beta/Programming/Nim/laser/nim.cfg' [Conf]
Hint: operation successful (340 lines compiled; 0.025 sec total; 5.754MiB peakmem; Dangerous Release Build) [SuccessX]
Hint: /home/beta/Programming/Nim/laser/build/gemm_bench_float32  [Exec]

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Laser production implementation
Collected 10 samples in 0.071 seconds
Average time: 6.416 ms
Stddev  time: 1.526 ms
Min     time: 5.861 ms
Max     time: 10.753 ms
Perf:         2206.263 GFLOP/s

OpenBLAS benchmark
Collected 10 samples in 0.151 seconds
Average time: 14.448 ms
Stddev  time: 10.255 ms
Min     time: 9.415 ms
Max     time: 37.410 ms
Perf:         979.779 GFLOP/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant