Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEN kernels perform worse, give wrong results, compared to HASWELL kernels on Zen/Ryzen #1147

Closed
gcp opened this issue Apr 10, 2017 · 42 comments

Comments

@gcp
Copy link
Contributor

gcp commented Apr 10, 2017

My application extensively uses SGEMM kernels with sizes:

M=128, N=361, K=1152
M=32, N=361, K=288

(This is an im2col+SGEMM combo for DCNN computation)
Single-threaded (application itself is multithreaded) with OPENBLAS_CORETYPE=Haswell

1000 predictions in 37.00 seconds -> 27 p/s
1000 evaluations in 4.29 seconds -> 233 p/s

Static build for Zen (see previous issue, dynamic dispatch is broken):
1000 predictions in 40.23 seconds -> 24 p/s
1000 evaluations in 4.50 seconds -> 222 p/s

So performance tanks about 5% to 20%.

So, "Zen" support in OpenBLAS actually worsens performance on Zen.

@brada4
Copy link
Contributor

brada4 commented Apr 10, 2017

Which part of combo is _gemm ?

@gcp
Copy link
Contributor Author

gcp commented Apr 10, 2017

They both use im2col and sgemm alternations, the used matrix sizes are in the second paragraph. Isolating the sgemm kernels will likely indicate a much larger performance drop.

@brada4
Copy link
Contributor

brada4 commented Apr 10, 2017

Can you record performance (perf record -- 'command') and show the report (you can vary different OPENBLAS_CORETYPE -s)
If im2col is from octave then it uses no BLAS at all, it is really important to isolate it out....

@gcp
Copy link
Contributor Author

gcp commented Apr 10, 2017

This is custom written C++ code. I have a standalone tester at https://github.com/gcp/sgemm

I'll produce a run on the Ryzen with the ZEN kernels. The ones for Haswell and comparison to MKL are already there: https://github.com/gcp/sgemm/blob/master/results/sgemm.txt

@gcp
Copy link
Contributor Author

gcp commented Apr 10, 2017

OPENBLAS_CORETYPE=Haswell (dynamic)

Running SGEMM with M=128, N=361, K=1152, alpha=1.000000, lda=1152, ldb=361, beta=0.000000, ldc=361
13.082 FLOP/clk

TARGET=ZEN (static)
Running SGEMM with M=128, N=361, K=1152, alpha=1.000000, lda=1152, ldb=361, beta=0.000000, ldc=361
11.566 FLOP/clk

About 13% worse.

@martin-frbg
Copy link
Collaborator

@steckdenis any idea ? Could be different matrix sizes than you used for your performance tests perhaps ?

@brada4
Copy link
Contributor

brada4 commented Apr 10, 2017

@gcp I see you fixed cpuid...
can you try excavator core too ? and one thread vs all vs half of all threads?

@gcp
Copy link
Contributor Author

gcp commented Apr 10, 2017

I'm not sure Excavator can work correctly as Ryzen does not have FMA4, only FMA3.

@gcp
Copy link
Contributor Author

gcp commented Apr 10, 2017

BLAS Core: Excavator
Running SGEMM with M=128, N=361, K=1152, alpha=1.000000, lda=1152, ldb=361, beta=0.000000, ldc=361
11.743 FLOP/clk

Maybe the "Excavator param.h tuning" wasn't a good idea for Zen, as it seems the default Haswell values are just better here. I don't think Zen has much in common with Excavator anyway...

@gcp
Copy link
Contributor Author

gcp commented Apr 10, 2017

By the way, if I increase the thread count:

export OPENBLAS_NUM_THREADS=8

BLAS Core: Zen
Running SGEMM with M=128, N=361, K=1152, alpha=1.000000, lda=1152, ldb=361, beta=0.000000, ldc=361
4387428 3867300 4269600 4361508 4261752 4263048 4265460 4298256 4250124 4253616 4372416: med 4265460. 24.959 FLOP/clk
1.051e+01/1.133e+01=9.275e-01. 6.426e+01 at [ 47, 8] 1.6712797165e+01 vs -4.7552200317e+01 FAIL !!!
1.062e+01/1.134e+01=9.365e-01. 6.271e+01 at [ 20,243] 2.5226888657e+01 vs -3.7482173920e+01 FAIL !!!
1.043e+01/1.128e+01=9.242e-01. 6.465e+01 at [ 4,217] -3.6501312256e+01 vs 2.8144220352e+01 FAIL !!!
1.047e+01/1.133e+01=9.246e-01. 6.424e+01 at [ 93, 22] 3.9190971375e+01 vs -2.5052385330e+01 FAIL !!!
1.049e+01/1.130e+01=9.278e-01. 6.287e+01 at [103, 18] -3.9003196716e+01 vs 2.3862531662e+01 FAIL !!!
1.062e+01/1.135e+01=9.360e-01. 8.162e+01 at [127,173] -3.5125022888e+01 vs 4.6492496490e+01 FAIL !!!
1.052e+01/1.134e+01=9.277e-01. 5.987e+01 at [117,170] -2.8317489624e+01 vs 3.1555690765e+01 FAIL !!!
1.051e+01/1.128e+01=9.316e-01. 6.313e+01 at [ 68, 96] 5.1148719788e+01 vs -1.1981718063e+01 FAIL !!!
1.050e+01/1.136e+01=9.246e-01. 6.462e+01 at [116,216] -2.9091817856e+01 vs 3.5525985718e+01 FAIL !!!
1.056e+01/1.129e+01=9.350e-01. 7.081e+01 at [ 85, 24] 3.0425312042e+01 vs -4.0386077881e+01 FAIL !!!
1.055e+01/1.134e+01=9.302e-01. 7.643e+01 at [117, 43] 3.8741443634e+01 vs -3.7692474365e+01 FAIL !!!

Note the "FAIL!!!" warnings.

The Zen kernels don't even give the right result.

@gcp gcp changed the title ZEN kernels perform worse than HASWELL kernels on Zen/Ryzen ZEN kernels perform worse, give wrong results, compared to HASWELL kernels on Zen/Ryzen Apr 10, 2017
@brada4
Copy link
Contributor

brada4 commented Apr 10, 2017

Your CPU clocks between 3.0 and 3.7Ghz, in conjunction with constand+nonstop tsc+rdtsc that makes the dozen percent variation.
Can you disable turbo boost /sys/devices/system/cpu/cpufreq/boost for measurements?
(using 2 accurate kernels, 1 thread, whole data moved is few megs in this case, 8 CPUs may hurt if engaged)

@gcp
Copy link
Contributor Author

gcp commented Apr 10, 2017

My CPU is locked to a constant 3.6GHz, exactly to avoid this problem. The measurement takes the median of 11 passes. This is not a measurement problem. The performance degradation is clearly and repeatedly measurable.

To say nothing of the kernel not even producing correct results when using multiple threads...

@brada4
Copy link
Contributor

brada4 commented Apr 10, 2017

Official spec says 3.0GHz. Can you retry please (With HASWELL and EXCAVATOR sgemm) as nobody else has such new CPU?

@fshi98
Copy link

fshi98 commented Apr 10, 2017

If build with " make TARGET=ZEN" , there are build errors:
zherk_k.c:(.text+0x44c): undefined reference to zgemm_incopy' zherk_k.c:(.text+0x824): undefined reference to zgemm_incopy'
collect2: error: ld returned 1 exit status
Makefile:165: recipe for target 'zblat3' failed
make[1]: *** [zblat3] Error 1
make[1]: *** Waiting for unfinished jobs....
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./dblat2 < ./dblat2.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./cblat2 < ./cblat2.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat2 < ./zblat2.dat
rm -f ?BLAT2.SUMM

However, if build without "TARGET=ZEN" , it will be ok.

@fshi98
Copy link

fshi98 commented Apr 10, 2017

Other problem for AMD Ryzen cpu is: if build with "# NO_AVX2 = 1", the output will be wrong (although there are no problem to build and run the code). Uncomment " NO_AVX2 = 1", the results are correct

@gcp
Copy link
Contributor Author

gcp commented Apr 10, 2017

Official spec says 3.0GHz.

I know what they say. I repeat: this was tested at a fixed clock of 3.6GHz, exactly to avoid measurement problems due to boosting.

Can you retry please (With HASWELL and EXCAVATOR sgemm) as nobody else has such new CPU?

Those results are already included above. I did not check correctness of the Excavator kernel with multiple threads, but Haswell works correctly and is faster.

@fshi98
Copy link

fshi98 commented Apr 10, 2017

for haswell, there is no any problem.
if build with "# NO_AVX2 = 1" or " NO_AVX2 = 1", the results are same

@gcp
Copy link
Contributor Author

gcp commented Apr 10, 2017

Thanks @fshi98, that confirms what I'm seeing: the Zen kernels are broken and give erroneous results.

@fshi98
Copy link

fshi98 commented Apr 10, 2017

@gcp exactly. If not build with AVX2, it runs ok and gives correct results. I think the problem is with ZEN AVX2 support

@gcp
Copy link
Contributor Author

gcp commented Apr 10, 2017

Running with Excavator kernels also produces correct (but slower) results. It's only OPENBLAS_CORETYPE=ZEN that produces wrong results.

@fshi98
Copy link

fshi98 commented Apr 10, 2017

speed wise, if build HASWELL with AVX2, ZEN and i7-4790 almost same, but if not build them (run from ubuntu apt-get, the ZEN 1800 is around 15% faster. I am testing the code for CAFFE

@brada4
Copy link
Contributor

brada4 commented Apr 10, 2017

@gcp - please test with stable hpt, or with wall clock. Permanenet turbo mode os not possible on any cpu with any cooling.

@brada4
Copy link
Contributor

brada4 commented Apr 10, 2017

@fshi98 - if you run make clean between builds, they have better chance to succeed.

@fshi98
Copy link

fshi98 commented Apr 10, 2017

@brada4 Thanks. I did make clean for couple of times, and also tried fresh git clone, but without luck if build with " make TARGET=ZEN". Will try more.

@gcp
Copy link
Contributor Author

gcp commented Apr 11, 2017

please test with stable hpt, or with wall clock. Permanenet turbo mode os not possible on any cpu with any cooling.

The first result I posted here (combined kernel) was with wall clock. Permanently locking the clockspeed is perfectly possible with many mainboards. For the THIRD time: this is a reproducible regression regardless of measurement method, with and without turbo, with wallclock measurement, and with rdtsc+IPC/FLOPS clock measurement. Stop this line of arguing, you're wasting both of our time and I'm tired of it.

The kernel gives WRONG RESULTS anyway, so who cares about performance. Right now OpenBLAS is outright broken on Zen.

@brada4
Copy link
Contributor

brada4 commented Apr 11, 2017

I dont doubt (in)accuracy of 'zen' kernel.
I just want to know if current Zen can be replaced by Haswell (or Excavator with FMA4 replaced with one used by Haswell) kernels right away, but you posted timing measurements within the error caused by turbo frequency.

@martin-frbg
Copy link
Collaborator

@brada4 I do not think we are at a point yet where small variations in benchmark numbers are likely to make any difference.
The current Zen support is preliminary and some caveats can be found in its PR thread #1133 , I committed it to get it more exposure as no serious concerns were voiced in the week since it was posted, but did not expect it to perform that poorly.

@brada4
Copy link
Contributor

brada4 commented Apr 11, 2017

@martin-frbg It could be haswell-copy IMO (it mostly is), just that %% for it is not counted in good weather...

@steckdenis
Copy link
Contributor

Hi,

This issue is very intriguing. I added Zen support by making Zen use the exact same kernels as Haswell. However, I also observed lower performance than TARGET=HASWELL, without being able to explain it. It may be the case that some kernels are somewhat generic, and contain a couple of #ifdefs on the CPU type, that I may have missed.

Regarding incorrect results, I got them only when I tried to fine-tune param.h for Zen, but there may still be a problem that I did not see.

All in all, I once thought that the fastest way of supporting Ryzen would be to detect it as HASWELL (not as a different ZEN CPU), and use everything Haswell. However, given recent results that, for instance, say that movntp* instructions are extremely slow on Ryzen, I still prefer the "correct" route of detecting Ryzen as Zen and maybe implement new kernels for it.

By the way, the reason I submitted that patch is that I ran a Numpy program on my openSUSE 42.2 distro, with a dynamic OpenBLAS, and my CPU was not detected at all. OpenBLAS used generic and very slow kernels. Proper Zen detection sped up my program by almost 3x.

@martin-frbg
Copy link
Collaborator

@steckdenis I wonder if you could reproduce gcp's SGEMM failures with the standalone tester he posted above (and also if/why building with "make TARGET=ZEN" appears to be broken now as reported by fshi98) ? I do not think yesterday's patches for the runtime dynamic detection could be at fault, but perhaps your PR may not have included all of what you changed and tested locally ?

@steckdenis
Copy link
Contributor

Allright, I tested everything and I have good news. It seems that my param.h, copied from Excavator, was responsible for the inaccuracies, compile errors and bad performance. The problem was hidden in my original patch because I forgot to increase an "i" somewhere (see one of the latest "fix Zen" commits).

I now use the complete Haswell param.h defines for Zen (see fix.txt). This allows me to compile OpenBLAS with TARGET=ZEN, and the sgemm test suite seems to work (see results.txt). Moreover, on my Ryzen 1700 overclocked to 3.7 Ghz (constant) with 4x 4GB DDR4 2400 Mhz RAM, OpenBLAS does 14.423 flops/cycle (default "./sgemm", so I think this is multithreaded), which is above any of the hacks.

@martin-frbg
Copy link
Collaborator

Sounds great. Could you try this with "OPENBLAS_NUM_THREADS=8" as well please (as that appears to be what triggered most of the failures above), and create a new PR if all looks good ?

@steckdenis
Copy link
Contributor

It also seems to be working with 8 threads. I'll just wait for a confirmation from @gcp that my results.txt file actually indicates success.

@el-oso
Copy link

el-oso commented Apr 16, 2017

Hmm... I actually ran into a similar issue. I was running T-SNE with Torch and I got nan's using OpenBLAS, I recompiled it using TARGET=HASWELL and NO_AVX2=1 and now nan's are gone. I am not sure about the speed though. I could not recompile it using TARGET=ZEN. I got the same errors as gcp. Applied his patch and seems working as well.

Will give more info as I work it out.

@brada4
Copy link
Contributor

brada4 commented Apr 16, 2017

Just post last core from /proc/cpuinfo , you can be trusted errors are same :-)

@gcp
Copy link
Contributor Author

gcp commented Apr 18, 2017

Reverting to Haswell param.h (the fix.txt) works for me. Performance is good again, results are correct.

@steckdenis I added your patch to my branch and will issue a pull request from there.

@steckdenis
Copy link
Contributor

Wonderful! Yes, feel free to issue a pull request from your branch.

@martin-frbg
Copy link
Collaborator

Merged, thanks for your work all of you. I guess we are now back to something like "support AMD Ryzen through Haswell kernels" - which is not a bad thing at all IMHO.

@gcp
Copy link
Contributor Author

gcp commented Apr 18, 2017

One small cleanup remaining:

getarch.c in FORCE_ZEN has LIBNAME and CORENAME = excavator

@gcp
Copy link
Contributor Author

gcp commented Apr 19, 2017

I'm going to close this issue because I think we fixed all known problems reported here.

@gcp gcp closed this as completed Apr 19, 2017
@gcp
Copy link
Contributor Author

gcp commented May 2, 2017

I'm not sure Excavator can work correctly as Ryzen does not have FMA4, only FMA3.

For future reference: it turns out that Ryzen does support FMA4, it just does not document this through CPUID flags. That's why the Excavator kernels didn't crash with SIGILL.

@brada4
Copy link
Contributor

brada4 commented May 2, 2017

Haswell is better choice, some virtualisation may pedantically mask off-cpuid instructions out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants