Benchmark example using Intel MKL (for history)

(this issue for history and potential improvements for laser later: especially AVX-512 and dual port AVX-512)

After chatting for hours with @mratsim to find benchmark Laser with a 72 thread machine and getting a working MKL setup, here is an example benchmark using Intel MKL. We are assuming multiple MKL installations, and using a specific version stored in `/opt/intel/compilers_and_libraries_2019.0.117` with the following settings:

![image](https://user-images.githubusercontent.com/9083669/50443791-2c0c4980-0905-11e9-8b71-f5107b0d2d63.png)

We assume also you do not have any nim installation, if you do have you know what lines to skip.

Change the number of threads right at the beginning (`OMP_NUM_THREADS`). We are using commit 990e59f

```sh
source /opt/intel/mkl/bin/mklvars.sh intel64
export OMP_NUM_THREADS=1

curl https://nim-lang.org/choosenim/init.sh -sSf | sh
git clone --recursive git://github.com/numforge/laser
cd laser
git checkout 990e59f
git submodule init
git submodule update
```

Before compiling, change the following in https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/blas.nim#L5 to the following (change the MKL folders if needed):

```
  const blas = "libmkl_intel_ilp64.so"
  {.passC: "-I' /opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/include' -L'/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin'".}
```

Change the following here https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/benchmarks/gemm/gemm_bench_float32.nim#L53-L55 to:

```
  M     = 2304
  K     = 2304
  N     = 2304
```

Tune the following to your likings, here I used my Dual Xeon Gold 6154 and put 100 repeated computations:

```
  NbSamples = 100    # This might stresss the allocator when packing if the matrices are big
  CpuGhz = 3.7      # Assuming no turbo
  NumCpuCores = 36
  CpuFlopCycle = 32 # AVX2: 2xFMA/cycle = 2x8x2 - 2 x 8 floats x (1 add + 1 mul)
```

For the CpuFlopCycle, you need to check the implemented instructions here:

https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_ukernel_avx_fma.nim#L10-L23

Also, tune this to your preference https://github.com/numforge/laser/blob/990e59fffe50779cdef33aa0b8f22da19e1eb328/laser/primitives/matrix_multiplication/gemm_tiling.nim#L234-L235 (I tuned again for my Dual Xeon Gold 6154):

```
  result.mc = min(768 div T.sizeof, M)
  result.kc = min(4096 div T.sizeof, K)
```

And now you can compile with MKL (change the MKL folders if needed):

```sh
mkdir build
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin nim cpp --dynlibOverride:libmkl_intel_ilp64 --passL:"/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.a -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl" --passC:"-D_GNU_SOURCE -L$MKLROOT/lib/intel64_lin -DMKL_ILP64 -m64" -r -d:release -d:openmp -o:build/bench_gemm benchmarks/gemm/gemm_bench_float32.nim
```

On a Dual Xeon 6154 setup (36 physical cores / 72 logical cores, 3.7 GHz all turbo), you should get the following:

| Tool | FLOPS |
| :--- | ---: |
| Intel MKL | 4 TFLOPS |
| Laser | 600 GFLOPS |
| PyTorch Glow | 60 GFLOPS |

As you can see, we are nearly reaching the maximum possible theoretical performance:

![image](https://user-images.githubusercontent.com/9083669/50443870-8ad1c300-0905-11e9-86d8-5f178ec29d42.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark example using Intel MKL (for history) #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Benchmark example using Intel MKL (for history) #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions