Open
Description
cc @Laurae2
On benchmark on dual Xeon Gold 6154 vs MKL:
Warmup: 0.9943 s, result 224 (displayed to avoid compiler optimizing warmup away)
A matrix shape: (M: 2304, N: 2304)
B matrix shape: (M: 2304, N: 2304)
Output shape: (M: 2304, N: 2304)
Required number of operations: 24461.181 millions
Required bytes: 42.467 MB
Arithmetic intensity: 576.000 FLOP/byte
Theoretical peak single-core: 118.400 GFLOP/s
Theoretical peak multi: 4262.400 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
Intel MKL benchmark
Collected 100 samples in 0.658 seconds
Average time: 6.211 ms
Stddev time: 2.274 ms
Min time: 5.648 ms
Max time: 28.398 ms
Perf: 3938.203 GFLOP/s
Display output[0] to make sure it's not optimized away
566.68505859375
Laser production implementation
Collected 100 samples in 4.067 seconds
Average time: 40.303 ms
Stddev time: 12.542 ms
Min time: 35.367 ms
Max time: 121.945 ms
Perf: 606.927 GFLOP/s
Display output[0] to make sure it's not optimized away
566.68505859375
PyTorch Glow: libjit matmul implementation
Collected 100 samples in 36.837 seconds
Average time: 368.372 ms
Stddev time: 3.071 ms
Min time: 362.655 ms
Max time: 380.193 ms
Perf: 66.403 GFLOP/s
Display output[0] to make sure it's not optimized away
566.6849975585938
According to the paper
[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
Smith et al
Parallelism should be done around jc
(dimension nc
)
Note that nc
is often 4096 so we might need another distribution scheme.
Metadata
Metadata
Assignees
Labels
No labels