Skip to content

Benchmark example using Intel MKL (for history) #10

Open
@Laurae2

Description

@Laurae2

(this issue for history and potential improvements for laser later: especially AVX-512 and dual port AVX-512)

After chatting for hours with @mratsim to find benchmark Laser with a 72 thread machine and getting a working MKL setup, here is an example benchmark using Intel MKL. We are assuming multiple MKL installations, and using a specific version stored in /opt/intel/compilers_and_libraries_2019.0.117 with the following settings:

image

We assume also you do not have any nim installation, if you do have you know what lines to skip.

Change the number of threads right at the beginning (OMP_NUM_THREADS). We are using commit 990e59f

source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1

curl https://nim-lang.org/choosenim/init.sh -sSf | sh
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout 990e59f
git submodule init
git submodule update

Before compiling, change the following in https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/blas.nim#L5 to the following (change the MKL folders if needed):

  const blas = "libmkl_intel_ilp64.so"
  {.passC: "-I' /opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include' -L'/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'".}

Change the following here https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/gemm/gemm_bench_float32.nim#L53-L55 to:

  M     = 2304
  K     = 2304
  N     = 2304

Tune the following to your likings, here I used my Dual Xeon Gold 6154 and put 100 repeated computations:

  NbSamples = 100    # This might stresss the allocator when packing if the matrices are big
  CpuGhz = 3.7      # Assuming no turbo
  NumCpuCores = 36
  CpuFlopCycle = 32 # AVX2: 2xFMA/cycle = 2x8x2 - 2 x 8 floats x (1 add + 1 mul)

For the CpuFlopCycle, you need to check the implemented instructions here:

https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_ukernel_avx_fma.nim#L10-L23

Also, tune this to your preference https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_tiling.nim#L234-L235 (I tuned again for my Dual Xeon Gold 6154):

  result.mc = min(768 div T.sizeof, M)
  result.kc = min(4096 div T.sizeof, K)

And now you can compile with MKL (change the MKL folders if needed):

mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim

On a Dual Xeon 6154 setup (36 physical cores / 72 logical cores, 3.7 GHz all turbo), you should get the following:

Tool FLOPS
Intel MKL 4 TFLOPS
Laser 600 GFLOPS
PyTorch Glow 60 GFLOPS

As you can see, we are nearly reaching the maximum possible theoretical performance:

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions