Add ZEN support #1133

steckdenis · 2017-03-19T17:54:34Z

This patch adds the following features

ZEN target for static builds
Zen/Ryzen auto-detection (CPU family 15h, extended family 8) so that "make" compiles for Zen
Dynamic architecture support for Zen

Zen is currently heavily based on Haswell (Excavator param.h tuning, Haswell kernels). I've tried to tune OpenBLAS for Zen but started to get incorrect result notifications. This patch does not do any Zen-specific tuning.

If you are interested, here is what I have observed by trying to optimize several parameters from param.h using blind brute-force:

SNUMOPT=8 and DNUMOPT=4 seem to work best (I tested 8/8, 16/8 and 16/16)
GEMM_DEFAULT_ALIGN=0x1fffUL is a couple percent faster than any other alignment value.
SYMV_P=8 is faster than 4 or 16
SWITCH_RATIO=4 works well (any other value decreases performance)
The sgemm and zgemm kernels like having their N and M parameters set to 4 (for both kernels). zlinkpack goes from ~15GFLOP to ~22GFLOP while doing so, but I start to get incorrect results. I have seen that KERNEL.ZEN should be changed when N and M are changed, but I don't know what are the valid combinations.
GEMM_DEFAULT_OFFSET_A=256 and GEMM_DEFAULT_OFFSET_B=1024 was the fastest combination for the zlinpack.goto benchmark
Regarding benchmarks, zlinpack is quite sensitive to the parameters and works best with 8 threads. slinpack really wants only one thread to be used and seems memory-bound: I can do whatever I want with the parameters, performance doesn't change in any meaningful way.

Please note that I have tested all the above values without really knowing what they mean. Some of them may not make any sense.

As a remaining problem, OpenBLAS detects 16 cores while my Ryzen CPU has 8 cores and 16 threads. Manually forcing OMP_NUM_THREADS to 8 leads to quite a nice performance boost as the threads stop competing for cache and memory accesses.

If you want SSH access to a Ryzen 1700 machine (that has a public IP address), we can arrange that.

brada4 · 2017-03-20T10:07:58Z

In the second to last paragraph - is it slower with both threads active or no gain over single thread?
In linux it would be something like

OPENBLAS_NUM_THREADS=1 /usr/bin/time taskset 0x1 (random benchmark)
OPENBLAS_NUM_THREADS=2 /usr/bin/time taskset 0x3 (same benchmark)

steckdenis · 2017-03-20T10:33:10Z

Performance depends on the number of threads in quite a complex way. Here are detailed timing results for zlinpack (last row, "200"):

OPENBLAS_NUM_THREADS=1 taskset 0x1 ./zlinpack.goto => 10.4 GFLOPS
OPENBLAS_NUM_THREADS=2 taskset 0x5 ./zlinpack.goto => 11.7 GFLOPS
OPENBLAS_NUM_THREADS=3 taskset 0x15 ./zlinpack.goto => 15.2 GFLOPS
OPENBLAS_NUM_THREADS=4 taskset 0x55 ./zlinpack.goto => 18.8 GFLOPS
OPENBLAS_NUM_THREADS=5 taskset 0x155 ./zlinpack.goto => 19.1 GFLOPS*
OPENBLAS_NUM_THREADS=6 taskset 0x555 ./zlinpack.goto => 19.4 GFLOPS
OPENBLAS_NUM_THREADS=7 taskset 0x1555 ./zlinpack.goto => 18.8 GFLOPS
OPENBLAS_NUM_THREADS=8 taskset 0x5555 ./zlinpack.goto => 18.0 GFLOPS (with high variance)
OPENBLAS_NUM_THREADS=9 taskset 0x5557 ./zlinpack.goto => 12.6 GFLOPS
OPENBLAS_NUM_THREADS=9 ./zlinpack.goto => 12.3 GFLOPS
OPENBLAS_NUM_THREADS=10 taskset 0x555F ./zlinpack.goto => 12.9 GFLOPS*
OPENBLAS_NUM_THREADS=11 taskset 0x557F ./zlinpack.goto => 12.7 GFLOPS
OPENBLAS_NUM_THREADS=16 taskset 0xFFFF ./zlinpack.goto => 10.6 GFLOPS (with high variance)

0.004 GFLOPS for sizes 20 to 150 (exact sizes affected vary from run to run)!! 50% of the CPU time is in inner_advanced_thread when this performance hit occurs. Bad performance happens in the "Decompose" phase of the benchmark.

We see that going above 8 cores starts to use SMT threads and sort of kills performance. Odd number of threads also seems to exhibit strange behavior.

Here are the results for slinpack:

OPENBLAS_NUM_THREADS=1 taskset 0x1 ./slinpack.goto => 13.5 GFLOPS
OPENBLAS_NUM_THREADS=2 taskset 0x5 ./slinpack.goto => 13.7 GFLOPS
OPENBLAS_NUM_THREADS=3 taskset 0x15 ./slinpack.goto => 15.1 GFLOPS
OPENBLAS_NUM_THREADS=4 taskset 0x55 ./slinpack.goto => 16.3 GFLOPS
OPENBLAS_NUM_THREADS=5 taskset 0x155 ./slinpack.goto => 13.2 GFLOPS*
OPENBLAS_NUM_THREADS=6 taskset 0x555 ./slinpack.goto => 12.0 GFLOPS
OPENBLAS_NUM_THREADS=7 taskset 0x1555 ./slinpack.goto => 11.4 GFLOPS
OPENBLAS_NUM_THREADS=8 taskset 0x5555 ./slinpack.goto => 10.5 GFLOPS
OPENBLAS_NUM_THREADS=9 taskset 0x5557 ./slinpack.goto => 7.7 GFLOPS
OPENBLAS_NUM_THREADS=10 ./slinpack.goto => 6.7 GFLOPS
OPENBLAS_NUM_THREADS=16 ./slinpack.goto => 5.6 GFLOPS

Tests run on an AMD Ryzen 7 1700 at stock clock speeds (3.0 Ghz base, 3.2 Ghz all-core boost, I cannot see on Linux whether boost was enabled), on an MSI B350 Tomahawk board with 2x 4GB DDR4 2600 Mhz RAM.

brada4 · 2017-03-20T19:03:30Z

Something like this - 1st hyperthread vs both running for a second or 10, i.e if there is gain or loss in concurrent use of same core (as you see with ivy laptop i3 result is not regression):

>OPENBLAS_NUM_THREADS=1 taskset 0x1 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   6.198883e-03    35147.91 MFlops    4225.47 MFlops   34994.34 MFlops
>OPENBLAS_NUM_THREADS=1 taskset 0x2 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   4.708052e-03    35225.97 MFlops    4372.92 MFlops   35077.56 MFlops
>OPENBLAS_NUM_THREADS=2 taskset 0x3 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   2.069950e-03    40787.35 MFlops    4014.45 MFlops   40564.54 MFlops

steckdenis · 2017-03-20T19:19:09Z

Ok, I understand now (I purposefully avoided putting threads on the same SMT cores):

> OPENBLAS_NUM_THREADS=1 taskset 0x1 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   4.219890e-03    33108.05 MFlops    6485.93 MFlops   33026.76 MFlops
> OPENBLAS_NUM_THREADS=1 taskset 0x2 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   4.902124e-03    32844.99 MFlops    6450.78 MFlops   32764.61 MFlops
> OPENBLAS_NUM_THREADS=2 taskset 0x3 ./slinpack.goto 5000 5000 1
From : 5000  To : 5000 Step =   1
   SIZE       Residual     Decompose            Solve           Total
   5000 :   1.134348e-02    25569.05 MFlops    3279.55 MFlops   25465.27 MFlops

By the way, I'm very impressed by your laptop CPU. Fortunately, I've tested with 8 threads on my Ryzen and I get 195343.00 MFlops, so scaling seems to work well. With all 16 threads used, I get 157717.47 MFlops.

brada4 · 2017-03-20T21:03:24Z

Indeed yours back your point that 2 threads per core is a loss....
Can you run something like lstopo, i.e if kernel correctly recognizes topology:
lstopo --of console

steckdenis · 2017-03-20T22:13:24Z

Here it is

Machine (7997MB)
  Socket L#0
    L3 L#0 (8192KB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#5)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3
        PU L#6 (P#6)
        PU L#7 (P#7)
    L3 L#1 (8192KB)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4
        PU L#8 (P#8)
        PU L#9 (P#9)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5
        PU L#10 (P#10)
        PU L#11 (P#11)
      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6
        PU L#12 (P#12)
        PU L#13 (P#13)
      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7
        PU L#14 (P#14)
        PU L#15 (P#15)
  HostBridge L#0
    PCIBridge
      PCI 1022:43b7
        Block L#0 "sda"
      PCIBridge
        PCIBridge
          PCI 10ec:8168
            Net L#1 "eth0"
    PCIBridge
      PCI 1002:6779
        GPU L#2 "renderD128"
        GPU L#3 "card0"
        GPU L#4 "controlD64"
    PCIBridge
      PCI 1022:7901

brada4 · 2017-03-21T00:16:51Z

Looks reasonable,most likely absolutely correct.

martin-frbg · 2017-03-24T12:47:07Z

Thanks for the patch - looks good to me, I had only held back on committing to give more senior team members a chance to comment.
OpenBLAS detecting 16 cores is normal (though not always desirable) I think for a system capable of "hyperthreading" (MAX_CPU_NUMBER gets set from NUM_THREADS in the build system). I have not been able to find any whitepaper on optimizing for Ryzen yet - only some rather dubious claims of
"avoid avx" or "avoid software prefetch". Perhaps it would make sense to copy your implementation notes to the wiki so that they do not get buried here.

Add ZEN support (tested for auto-detected static backend)

c9ff735

martin-frbg mentioned this pull request Mar 22, 2017

New release #1118

Closed

martin-frbg merged commit 66dc10b into OpenMathLib:develop Mar 24, 2017

This was referenced Apr 10, 2017

Ryzen/ZEN DYNAMIC_ARCH support is broken #1146

Closed

ZEN kernels perform worse, give wrong results, compared to HASWELL kernels on Zen/Ryzen #1147

Closed

martin-frbg mentioned this pull request Feb 13, 2018

thread safety in openblas 0.2.20 #1425

Closed

tkswe88 mentioned this pull request Feb 14, 2018

performance on AMD Ryzen and Threadripper #1461

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ZEN support #1133

Add ZEN support #1133

steckdenis commented Mar 19, 2017 •

edited

Loading

brada4 commented Mar 20, 2017

steckdenis commented Mar 20, 2017

brada4 commented Mar 20, 2017

steckdenis commented Mar 20, 2017 •

edited

Loading

brada4 commented Mar 20, 2017 •

edited

Loading

steckdenis commented Mar 20, 2017

brada4 commented Mar 21, 2017

martin-frbg commented Mar 24, 2017

Add ZEN support #1133

Add ZEN support #1133

Conversation

steckdenis commented Mar 19, 2017 • edited Loading

brada4 commented Mar 20, 2017

steckdenis commented Mar 20, 2017

brada4 commented Mar 20, 2017

steckdenis commented Mar 20, 2017 • edited Loading

brada4 commented Mar 20, 2017 • edited Loading

steckdenis commented Mar 20, 2017

brada4 commented Mar 21, 2017

martin-frbg commented Mar 24, 2017

steckdenis commented Mar 19, 2017 •

edited

Loading

steckdenis commented Mar 20, 2017 •

edited

Loading

brada4 commented Mar 20, 2017 •

edited

Loading