ZEN kernels perform worse, give wrong results, compared to HASWELL kernels on Zen/Ryzen #1147

gcp · 2017-04-10T15:44:46Z

My application extensively uses SGEMM kernels with sizes:

M=128, N=361, K=1152
M=32, N=361, K=288

(This is an im2col+SGEMM combo for DCNN computation)
Single-threaded (application itself is multithreaded) with OPENBLAS_CORETYPE=Haswell

1000 predictions in 37.00 seconds -> 27 p/s
1000 evaluations in 4.29 seconds -> 233 p/s

Static build for Zen (see previous issue, dynamic dispatch is broken):
1000 predictions in 40.23 seconds -> 24 p/s
1000 evaluations in 4.50 seconds -> 222 p/s

So performance tanks about 5% to 20%.

So, "Zen" support in OpenBLAS actually worsens performance on Zen.

brada4 · 2017-04-10T15:48:40Z

Which part of combo is _gemm ?

gcp · 2017-04-10T15:50:27Z

They both use im2col and sgemm alternations, the used matrix sizes are in the second paragraph. Isolating the sgemm kernels will likely indicate a much larger performance drop.

brada4 · 2017-04-10T16:04:43Z

Can you record performance (perf record -- 'command') and show the report (you can vary different OPENBLAS_CORETYPE -s)
If im2col is from octave then it uses no BLAS at all, it is really important to isolate it out....

gcp · 2017-04-10T16:14:02Z

This is custom written C++ code. I have a standalone tester at https://github.com/gcp/sgemm

I'll produce a run on the Ryzen with the ZEN kernels. The ones for Haswell and comparison to MKL are already there: https://github.com/gcp/sgemm/blob/master/results/sgemm.txt

gcp · 2017-04-10T16:31:44Z

OPENBLAS_CORETYPE=Haswell (dynamic)

Running SGEMM with M=128, N=361, K=1152, alpha=1.000000, lda=1152, ldb=361, beta=0.000000, ldc=361
13.082 FLOP/clk

TARGET=ZEN (static)
Running SGEMM with M=128, N=361, K=1152, alpha=1.000000, lda=1152, ldb=361, beta=0.000000, ldc=361
11.566 FLOP/clk

About 13% worse.

martin-frbg · 2017-04-10T17:16:55Z

@steckdenis any idea ? Could be different matrix sizes than you used for your performance tests perhaps ?

brada4 · 2017-04-10T17:36:59Z

@gcp I see you fixed cpuid...
can you try excavator core too ? and one thread vs all vs half of all threads?

gcp · 2017-04-10T17:42:56Z

I'm not sure Excavator can work correctly as Ryzen does not have FMA4, only FMA3.

gcp · 2017-04-10T17:49:24Z

BLAS Core: Excavator
Running SGEMM with M=128, N=361, K=1152, alpha=1.000000, lda=1152, ldb=361, beta=0.000000, ldc=361
11.743 FLOP/clk

Maybe the "Excavator param.h tuning" wasn't a good idea for Zen, as it seems the default Haswell values are just better here. I don't think Zen has much in common with Excavator anyway...

gcp · 2017-04-10T17:52:43Z

By the way, if I increase the thread count:

export OPENBLAS_NUM_THREADS=8

BLAS Core: Zen
Running SGEMM with M=128, N=361, K=1152, alpha=1.000000, lda=1152, ldb=361, beta=0.000000, ldc=361
4387428 3867300 4269600 4361508 4261752 4263048 4265460 4298256 4250124 4253616 4372416: med 4265460. 24.959 FLOP/clk
1.051e+01/1.133e+01=9.275e-01. 6.426e+01 at [ 47, 8] 1.6712797165e+01 vs -4.7552200317e+01 FAIL !!!
1.062e+01/1.134e+01=9.365e-01. 6.271e+01 at [ 20,243] 2.5226888657e+01 vs -3.7482173920e+01 FAIL !!!
1.043e+01/1.128e+01=9.242e-01. 6.465e+01 at [ 4,217] -3.6501312256e+01 vs 2.8144220352e+01 FAIL !!!
1.047e+01/1.133e+01=9.246e-01. 6.424e+01 at [ 93, 22] 3.9190971375e+01 vs -2.5052385330e+01 FAIL !!!
1.049e+01/1.130e+01=9.278e-01. 6.287e+01 at [103, 18] -3.9003196716e+01 vs 2.3862531662e+01 FAIL !!!
1.062e+01/1.135e+01=9.360e-01. 8.162e+01 at [127,173] -3.5125022888e+01 vs 4.6492496490e+01 FAIL !!!
1.052e+01/1.134e+01=9.277e-01. 5.987e+01 at [117,170] -2.8317489624e+01 vs 3.1555690765e+01 FAIL !!!
1.051e+01/1.128e+01=9.316e-01. 6.313e+01 at [ 68, 96] 5.1148719788e+01 vs -1.1981718063e+01 FAIL !!!
1.050e+01/1.136e+01=9.246e-01. 6.462e+01 at [116,216] -2.9091817856e+01 vs 3.5525985718e+01 FAIL !!!
1.056e+01/1.129e+01=9.350e-01. 7.081e+01 at [ 85, 24] 3.0425312042e+01 vs -4.0386077881e+01 FAIL !!!
1.055e+01/1.134e+01=9.302e-01. 7.643e+01 at [117, 43] 3.8741443634e+01 vs -3.7692474365e+01 FAIL !!!

Note the "FAIL!!!" warnings.

The Zen kernels don't even give the right result.

brada4 · 2017-04-10T18:25:50Z

Your CPU clocks between 3.0 and 3.7Ghz, in conjunction with constand+nonstop tsc+rdtsc that makes the dozen percent variation.
Can you disable turbo boost /sys/devices/system/cpu/cpufreq/boost for measurements?
(using 2 accurate kernels, 1 thread, whole data moved is few megs in this case, 8 CPUs may hurt if engaged)

gcp · 2017-04-10T18:46:25Z

My CPU is locked to a constant 3.6GHz, exactly to avoid this problem. The measurement takes the median of 11 passes. This is not a measurement problem. The performance degradation is clearly and repeatedly measurable.

To say nothing of the kernel not even producing correct results when using multiple threads...

brada4 · 2017-04-10T19:05:24Z

Official spec says 3.0GHz. Can you retry please (With HASWELL and EXCAVATOR sgemm) as nobody else has such new CPU?

fshi98 · 2017-04-10T19:53:59Z

If build with " make TARGET=ZEN" , there are build errors:
zherk_k.c:(.text+0x44c): undefined reference to zgemm_incopy' zherk_k.c:(.text+0x824): undefined reference to zgemm_incopy'
collect2: error: ld returned 1 exit status
Makefile:165: recipe for target 'zblat3' failed
make[1]: *** [zblat3] Error 1
make[1]: *** Waiting for unfinished jobs....
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./dblat2 < ./dblat2.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./cblat2 < ./cblat2.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat2 < ./zblat2.dat
rm -f ?BLAT2.SUMM

However, if build without "TARGET=ZEN" , it will be ok.

fshi98 · 2017-04-10T19:56:30Z

Other problem for AMD Ryzen cpu is: if build with "# NO_AVX2 = 1", the output will be wrong (although there are no problem to build and run the code). Uncomment " NO_AVX2 = 1", the results are correct

gcp · 2017-04-10T20:00:37Z

Official spec says 3.0GHz.

I know what they say. I repeat: this was tested at a fixed clock of 3.6GHz, exactly to avoid measurement problems due to boosting.

Can you retry please (With HASWELL and EXCAVATOR sgemm) as nobody else has such new CPU?

Those results are already included above. I did not check correctness of the Excavator kernel with multiple threads, but Haswell works correctly and is faster.

fshi98 · 2017-04-10T20:06:08Z

for haswell, there is no any problem.
if build with "# NO_AVX2 = 1" or " NO_AVX2 = 1", the results are same

gcp · 2017-04-10T20:08:08Z

Thanks @fshi98, that confirms what I'm seeing: the Zen kernels are broken and give erroneous results.

fshi98 · 2017-04-10T20:11:03Z

@gcp exactly. If not build with AVX2, it runs ok and gives correct results. I think the problem is with ZEN AVX2 support

gcp · 2017-04-10T20:12:25Z

Running with Excavator kernels also produces correct (but slower) results. It's only OPENBLAS_CORETYPE=ZEN that produces wrong results.

fshi98 · 2017-04-10T20:14:06Z

speed wise, if build HASWELL with AVX2, ZEN and i7-4790 almost same, but if not build them (run from ubuntu apt-get, the ZEN 1800 is around 15% faster. I am testing the code for CAFFE

brada4 · 2017-04-10T22:41:26Z

@gcp - please test with stable hpt, or with wall clock. Permanenet turbo mode os not possible on any cpu with any cooling.

brada4 · 2017-04-10T22:43:27Z

@fshi98 - if you run make clean between builds, they have better chance to succeed.

fshi98 · 2017-04-10T22:50:05Z

@brada4 Thanks. I did make clean for couple of times, and also tried fresh git clone, but without luck if build with " make TARGET=ZEN". Will try more.

gcp · 2017-04-11T07:13:18Z

please test with stable hpt, or with wall clock. Permanenet turbo mode os not possible on any cpu with any cooling.

The first result I posted here (combined kernel) was with wall clock. Permanently locking the clockspeed is perfectly possible with many mainboards. For the THIRD time: this is a reproducible regression regardless of measurement method, with and without turbo, with wallclock measurement, and with rdtsc+IPC/FLOPS clock measurement. Stop this line of arguing, you're wasting both of our time and I'm tired of it.

The kernel gives WRONG RESULTS anyway, so who cares about performance. Right now OpenBLAS is outright broken on Zen.

brada4 · 2017-04-11T07:40:28Z

I dont doubt (in)accuracy of 'zen' kernel.
I just want to know if current Zen can be replaced by Haswell (or Excavator with FMA4 replaced with one used by Haswell) kernels right away, but you posted timing measurements within the error caused by turbo frequency.

martin-frbg · 2017-04-11T09:49:25Z

@brada4 I do not think we are at a point yet where small variations in benchmark numbers are likely to make any difference.
The current Zen support is preliminary and some caveats can be found in its PR thread #1133 , I committed it to get it more exposure as no serious concerns were voiced in the week since it was posted, but did not expect it to perform that poorly.

brada4 · 2017-04-11T09:58:55Z

@martin-frbg It could be haswell-copy IMO (it mostly is), just that %% for it is not counted in good weather...

steckdenis · 2017-04-11T10:08:14Z

Hi,

This issue is very intriguing. I added Zen support by making Zen use the exact same kernels as Haswell. However, I also observed lower performance than TARGET=HASWELL, without being able to explain it. It may be the case that some kernels are somewhat generic, and contain a couple of #ifdefs on the CPU type, that I may have missed.

Regarding incorrect results, I got them only when I tried to fine-tune param.h for Zen, but there may still be a problem that I did not see.

All in all, I once thought that the fastest way of supporting Ryzen would be to detect it as HASWELL (not as a different ZEN CPU), and use everything Haswell. However, given recent results that, for instance, say that movntp* instructions are extremely slow on Ryzen, I still prefer the "correct" route of detecting Ryzen as Zen and maybe implement new kernels for it.

By the way, the reason I submitted that patch is that I ran a Numpy program on my openSUSE 42.2 distro, with a dynamic OpenBLAS, and my CPU was not detected at all. OpenBLAS used generic and very slow kernels. Proper Zen detection sped up my program by almost 3x.

martin-frbg · 2017-04-11T10:26:00Z

@steckdenis I wonder if you could reproduce gcp's SGEMM failures with the standalone tester he posted above (and also if/why building with "make TARGET=ZEN" appears to be broken now as reported by fshi98) ? I do not think yesterday's patches for the runtime dynamic detection could be at fault, but perhaps your PR may not have included all of what you changed and tested locally ?

steckdenis · 2017-04-11T11:08:03Z

Allright, I tested everything and I have good news. It seems that my param.h, copied from Excavator, was responsible for the inaccuracies, compile errors and bad performance. The problem was hidden in my original patch because I forgot to increase an "i" somewhere (see one of the latest "fix Zen" commits).

I now use the complete Haswell param.h defines for Zen (see fix.txt). This allows me to compile OpenBLAS with TARGET=ZEN, and the sgemm test suite seems to work (see results.txt). Moreover, on my Ryzen 1700 overclocked to 3.7 Ghz (constant) with 4x 4GB DDR4 2400 Mhz RAM, OpenBLAS does 14.423 flops/cycle (default "./sgemm", so I think this is multithreaded), which is above any of the hacks.

martin-frbg · 2017-04-11T11:26:28Z

Sounds great. Could you try this with "OPENBLAS_NUM_THREADS=8" as well please (as that appears to be what triggered most of the failures above), and create a new PR if all looks good ?

steckdenis · 2017-04-11T11:36:35Z

It also seems to be working with 8 threads. I'll just wait for a confirmation from @gcp that my results.txt file actually indicates success.

el-oso · 2017-04-16T20:24:25Z

Hmm... I actually ran into a similar issue. I was running T-SNE with Torch and I got nan's using OpenBLAS, I recompiled it using TARGET=HASWELL and NO_AVX2=1 and now nan's are gone. I am not sure about the speed though. I could not recompile it using TARGET=ZEN. I got the same errors as gcp. Applied his patch and seems working as well.

Will give more info as I work it out.

brada4 · 2017-04-16T21:37:59Z

Just post last core from /proc/cpuinfo , you can be trusted errors are same :-)

gcp · 2017-04-18T10:43:48Z

Reverting to Haswell param.h (the fix.txt) works for me. Performance is good again, results are correct.

@steckdenis I added your patch to my branch and will issue a pull request from there.

steckdenis · 2017-04-18T10:46:39Z

Wonderful! Yes, feel free to issue a pull request from your branch.

martin-frbg · 2017-04-18T11:35:01Z

Merged, thanks for your work all of you. I guess we are now back to something like "support AMD Ryzen through Haswell kernels" - which is not a bad thing at all IMHO.

gcp · 2017-04-18T16:25:05Z

One small cleanup remaining:

getarch.c in FORCE_ZEN has LIBNAME and CORENAME = excavator

gcp · 2017-04-19T16:39:26Z

I'm going to close this issue because I think we fixed all known problems reported here.

gcp · 2017-05-02T09:33:38Z

I'm not sure Excavator can work correctly as Ryzen does not have FMA4, only FMA3.

For future reference: it turns out that Ryzen does support FMA4, it just does not document this through CPUID flags. That's why the Excavator kernels didn't crash with SIGILL.

brada4 · 2017-05-02T10:33:48Z

Haswell is better choice, some virtualisation may pedantically mask off-cpuid instructions out.

gcp changed the title ~~ZEN kernels perform worse than HASWELL kernels on Zen/Ryzen~~ ZEN kernels perform worse, give wrong results, compared to HASWELL kernels on Zen/Ryzen Apr 10, 2017

gcp mentioned this issue Apr 18, 2017

Revert Zen param.h to Haswell values (instead of Excavator). #1157

Merged

martin-frbg mentioned this issue Apr 19, 2017

Make FORCE_ZEN option in getarch.c actually set target names to ZEN #1158

Merged

gcp closed this as completed Apr 19, 2017

martin-frbg mentioned this issue Dec 6, 2017

symv broken on development version for ZEN #1388

Closed

martin-frbg mentioned this issue Feb 13, 2018

thread safety in openblas 0.2.20 #1425

Closed

tkswe88 mentioned this issue Feb 14, 2018

performance on AMD Ryzen and Threadripper #1461

Open

ZEN kernels perform worse, give wrong results, compared to HASWELL kernels on Zen/Ryzen #1147

ZEN kernels perform worse, give wrong results, compared to HASWELL kernels on Zen/Ryzen #1147

Comments

gcp commented Apr 10, 2017

brada4 commented Apr 10, 2017

gcp commented Apr 10, 2017

brada4 commented Apr 10, 2017

gcp commented Apr 10, 2017

gcp commented Apr 10, 2017

martin-frbg commented Apr 10, 2017

brada4 commented Apr 10, 2017

gcp commented Apr 10, 2017

gcp commented Apr 10, 2017

gcp commented Apr 10, 2017 • edited Loading

brada4 commented Apr 10, 2017 • edited Loading

gcp commented Apr 10, 2017 • edited Loading

brada4 commented Apr 10, 2017

fshi98 commented Apr 10, 2017

fshi98 commented Apr 10, 2017

gcp commented Apr 10, 2017 • edited Loading

fshi98 commented Apr 10, 2017

gcp commented Apr 10, 2017

fshi98 commented Apr 10, 2017

gcp commented Apr 10, 2017

fshi98 commented Apr 10, 2017

brada4 commented Apr 10, 2017

brada4 commented Apr 10, 2017

fshi98 commented Apr 10, 2017

gcp commented Apr 11, 2017 • edited Loading

brada4 commented Apr 11, 2017

martin-frbg commented Apr 11, 2017

brada4 commented Apr 11, 2017

steckdenis commented Apr 11, 2017

martin-frbg commented Apr 11, 2017

steckdenis commented Apr 11, 2017

martin-frbg commented Apr 11, 2017

steckdenis commented Apr 11, 2017

el-oso commented Apr 16, 2017 • edited Loading

brada4 commented Apr 16, 2017

gcp commented Apr 18, 2017

steckdenis commented Apr 18, 2017

martin-frbg commented Apr 18, 2017

gcp commented Apr 18, 2017

gcp commented Apr 19, 2017 • edited Loading

gcp commented May 2, 2017

brada4 commented May 2, 2017

gcp commented Apr 10, 2017 •

edited

Loading

brada4 commented Apr 10, 2017 •

edited

Loading

gcp commented Apr 10, 2017 •

edited

Loading

gcp commented Apr 10, 2017 •

edited

Loading

gcp commented Apr 11, 2017 •

edited

Loading

el-oso commented Apr 16, 2017 •

edited

Loading

gcp commented Apr 19, 2017 •

edited

Loading