Description
(this issue for history and potential improvements for laser later: especially AVX-512 and dual port AVX-512)
After chatting for hours with @mratsim to find benchmark Laser with a 72 thread machine and getting a working MKL setup, here is an example benchmark using Intel MKL. We are assuming multiple MKL installations, and using a specific version stored in /opt/intel/compilers_and_libraries_2019.0.117
with the following settings:
We assume also you do not have any nim installation, if you do have you know what lines to skip.
Change the number of threads right at the beginning (OMP_NUM_THREADS
). We are using commit 990e59f
source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1
curl https://nim-lang.org/choosenim/init.sh -sSf | sh
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout 990e59f
git submodule init
git submodule update
Before compiling, change the following in https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/blas.nim#L5 to the following (change the MKL folders if needed):
const blas = "libmkl_intel_ilp64.so"
{.passC: "-I' /opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include' -L'/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'".}
Change the following here https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/gemm/gemm_bench_float32.nim#L53-L55 to:
M = 2304
K = 2304
N = 2304
Tune the following to your likings, here I used my Dual Xeon Gold 6154 and put 100 repeated computations:
NbSamples = 100 # This might stresss the allocator when packing if the matrices are big
CpuGhz = 3.7 # Assuming no turbo
NumCpuCores = 36
CpuFlopCycle = 32 # AVX2: 2xFMA/cycle = 2x8x2 - 2 x 8 floats x (1 add + 1 mul)
For the CpuFlopCycle, you need to check the implemented instructions here:
Also, tune this to your preference https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_tiling.nim#L234-L235 (I tuned again for my Dual Xeon Gold 6154):
result.mc = min(768 div T.sizeof, M)
result.kc = min(4096 div T.sizeof, K)
And now you can compile with MKL (change the MKL folders if needed):
mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim
On a Dual Xeon 6154 setup (36 physical cores / 72 logical cores, 3.7 GHz all turbo), you should get the following:
Tool | FLOPS |
---|---|
Intel MKL | 4 TFLOPS |
Laser | 600 GFLOPS |
PyTorch Glow | 60 GFLOPS |
As you can see, we are nearly reaching the maximum possible theoretical performance: