Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark example using Intel MKL (for history) #10

Open
Laurae2 opened this issue Dec 26, 2018 · 1 comment
Open

Benchmark example using Intel MKL (for history) #10

Laurae2 opened this issue Dec 26, 2018 · 1 comment

Comments

@Laurae2
Copy link

Laurae2 commented Dec 26, 2018

(this issue for history and potential improvements for laser later: especially AVX-512 and dual port AVX-512)

After chatting for hours with @mratsim to find benchmark Laser with a 72 thread machine and getting a working MKL setup, here is an example benchmark using Intel MKL. We are assuming multiple MKL installations, and using a specific version stored in /opt/intel/compilers_and_libraries_2019.0.117 with the following settings:

image

We assume also you do not have any nim installation, if you do have you know what lines to skip.

Change the number of threads right at the beginning (OMP_NUM_THREADS). We are using commit 990e59f

source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1

curl https://nim-lang.org/choosenim/init.sh -sSf | sh
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout 990e59f
git submodule init
git submodule update

Before compiling, change the following in https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/blas.nim#L5 to the following (change the MKL folders if needed):

  const blas = "libmkl_intel_ilp64.so"
  {.passC: "-I' /opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include' -L'/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'".}

Change the following here https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/gemm/gemm_bench_float32.nim#L53-L55 to:

  M     = 2304
  K     = 2304
  N     = 2304

Tune the following to your likings, here I used my Dual Xeon Gold 6154 and put 100 repeated computations:

  NbSamples = 100    # This might stresss the allocator when packing if the matrices are big
  CpuGhz = 3.7      # Assuming no turbo
  NumCpuCores = 36
  CpuFlopCycle = 32 # AVX2: 2xFMA/cycle = 2x8x2 - 2 x 8 floats x (1 add + 1 mul)

For the CpuFlopCycle, you need to check the implemented instructions here:

https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_ukernel_avx_fma.nim#L10-L23

Also, tune this to your preference https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_tiling.nim#L234-L235 (I tuned again for my Dual Xeon Gold 6154):

  result.mc = min(768 div T.sizeof, M)
  result.kc = min(4096 div T.sizeof, K)

And now you can compile with MKL (change the MKL folders if needed):

mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim

On a Dual Xeon 6154 setup (36 physical cores / 72 logical cores, 3.7 GHz all turbo), you should get the following:

Tool FLOPS
Intel MKL 4 TFLOPS
Laser 600 GFLOPS
PyTorch Glow 60 GFLOPS

As you can see, we are nearly reaching the maximum possible theoretical performance:

image

@Laurae2
Copy link
Author

Laurae2 commented Mar 24, 2019

Newer results:

cd Downloads/Nim
rm -rf laser
source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout dbfb31d
git submodule init
git submodule update

cd build
gedit benchmarks/third_party/blas.nim
gedit benchmarks/gemm/gemm_bench_float32.nim
gedit laser/primitives/matrix_multiplication/gemm_tiling.nim

export OMP_NUM_THREADS=72
rm -rf build
mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim

Results (OpenBLAS = MKL):

Hint: /home/laurae/Downloads/Nim/laser/build/bench_gemm  [Exec]

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    179.200 GFLOP/s
Theoretical peak multi:         6451.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10000 samples in 52.753 seconds
Average time: 3.492 ms
Stddev  time: 0.736 ms
Min     time: 3.162 ms
Max     time: 31.532 ms
Perf:         4053.552 GFLOP/s

Laser production implementation
Collected 10000 samples in 131.145 seconds
Average time: 11.152 ms
Stddev  time: 8.307 ms
Min     time: 7.360 ms
Max     time: 132.827 ms
Perf:         1269.368 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10000 samples in 2277.353 seconds
Average time: 227.735 ms
Stddev  time: 4.743 ms
Min     time: 224.159 ms
Max     time: 249.707 ms
Perf:         62.159 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 10000 samples in 268.515 seconds
Average time: 24.684 ms
Stddev  time: 6.915 ms
Min     time: 21.277 ms
Max     time: 86.476 ms
Perf:         573.477 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 10000 samples in 89.331 seconds
Average time: 6.755 ms
Stddev  time: 4.773 ms
Min     time: 5.215 ms
Max     time: 77.110 ms
Perf:         2095.728 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 10000 samples in 61.314 seconds
Average time: 4.260 ms
Stddev  time: 4.065 ms
Min     time: 3.071 ms
Max     time: 60.712 ms
Perf:         3322.757 GFLOP/s
Mean Relative Error compared to vendor BLAS: 4.056792022311129e-06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant