-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZEN kernels perform worse, give wrong results, compared to HASWELL kernels on Zen/Ryzen #1147
Comments
Which part of combo is _gemm ? |
They both use im2col and sgemm alternations, the used matrix sizes are in the second paragraph. Isolating the sgemm kernels will likely indicate a much larger performance drop. |
Can you record performance (perf record -- 'command') and show the report (you can vary different OPENBLAS_CORETYPE -s) |
This is custom written C++ code. I have a standalone tester at https://github.com/gcp/sgemm I'll produce a run on the Ryzen with the ZEN kernels. The ones for Haswell and comparison to MKL are already there: https://github.com/gcp/sgemm/blob/master/results/sgemm.txt |
OPENBLAS_CORETYPE=Haswell (dynamic) Running SGEMM with M=128, N=361, K=1152, alpha=1.000000, lda=1152, ldb=361, beta=0.000000, ldc=361 TARGET=ZEN (static) About 13% worse. |
@steckdenis any idea ? Could be different matrix sizes than you used for your performance tests perhaps ? |
@gcp I see you fixed cpuid... |
I'm not sure Excavator can work correctly as Ryzen does not have FMA4, only FMA3. |
BLAS Core: Excavator Maybe the "Excavator param.h tuning" wasn't a good idea for Zen, as it seems the default Haswell values are just better here. I don't think Zen has much in common with Excavator anyway... |
By the way, if I increase the thread count: export OPENBLAS_NUM_THREADS=8 BLAS Core: Zen Note the "FAIL!!!" warnings. The Zen kernels don't even give the right result. |
Your CPU clocks between 3.0 and 3.7Ghz, in conjunction with constand+nonstop tsc+rdtsc that makes the dozen percent variation. |
My CPU is locked to a constant 3.6GHz, exactly to avoid this problem. The measurement takes the median of 11 passes. This is not a measurement problem. The performance degradation is clearly and repeatedly measurable. To say nothing of the kernel not even producing correct results when using multiple threads... |
Official spec says 3.0GHz. Can you retry please (With HASWELL and EXCAVATOR sgemm) as nobody else has such new CPU? |
If build with " make TARGET=ZEN" , there are build errors: However, if build without "TARGET=ZEN" , it will be ok. |
Other problem for AMD Ryzen cpu is: if build with "# NO_AVX2 = 1", the output will be wrong (although there are no problem to build and run the code). Uncomment " NO_AVX2 = 1", the results are correct |
I know what they say. I repeat: this was tested at a fixed clock of 3.6GHz, exactly to avoid measurement problems due to boosting.
Those results are already included above. I did not check correctness of the Excavator kernel with multiple threads, but Haswell works correctly and is faster. |
for haswell, there is no any problem. |
Thanks @fshi98, that confirms what I'm seeing: the Zen kernels are broken and give erroneous results. |
@gcp exactly. If not build with AVX2, it runs ok and gives correct results. I think the problem is with ZEN AVX2 support |
Running with Excavator kernels also produces correct (but slower) results. It's only OPENBLAS_CORETYPE=ZEN that produces wrong results. |
speed wise, if build HASWELL with AVX2, ZEN and i7-4790 almost same, but if not build them (run from ubuntu apt-get, the ZEN 1800 is around 15% faster. I am testing the code for CAFFE |
@gcp - please test with stable hpt, or with wall clock. Permanenet turbo mode os not possible on any cpu with any cooling. |
@fshi98 - if you run |
@brada4 Thanks. I did make clean for couple of times, and also tried fresh git clone, but without luck if build with " make TARGET=ZEN". Will try more. |
The first result I posted here (combined kernel) was with wall clock. Permanently locking the clockspeed is perfectly possible with many mainboards. For the THIRD time: this is a reproducible regression regardless of measurement method, with and without turbo, with wallclock measurement, and with rdtsc+IPC/FLOPS clock measurement. Stop this line of arguing, you're wasting both of our time and I'm tired of it. The kernel gives WRONG RESULTS anyway, so who cares about performance. Right now OpenBLAS is outright broken on Zen. |
I dont doubt (in)accuracy of 'zen' kernel. |
@brada4 I do not think we are at a point yet where small variations in benchmark numbers are likely to make any difference. |
@martin-frbg It could be haswell-copy IMO (it mostly is), just that %% for it is not counted in good weather... |
Hi, This issue is very intriguing. I added Zen support by making Zen use the exact same kernels as Haswell. However, I also observed lower performance than TARGET=HASWELL, without being able to explain it. It may be the case that some kernels are somewhat generic, and contain a couple of #ifdefs on the CPU type, that I may have missed. Regarding incorrect results, I got them only when I tried to fine-tune param.h for Zen, but there may still be a problem that I did not see. All in all, I once thought that the fastest way of supporting Ryzen would be to detect it as HASWELL (not as a different ZEN CPU), and use everything Haswell. However, given recent results that, for instance, say that movntp* instructions are extremely slow on Ryzen, I still prefer the "correct" route of detecting Ryzen as Zen and maybe implement new kernels for it. By the way, the reason I submitted that patch is that I ran a Numpy program on my openSUSE 42.2 distro, with a dynamic OpenBLAS, and my CPU was not detected at all. OpenBLAS used generic and very slow kernels. Proper Zen detection sped up my program by almost 3x. |
@steckdenis I wonder if you could reproduce gcp's SGEMM failures with the standalone tester he posted above (and also if/why building with "make TARGET=ZEN" appears to be broken now as reported by fshi98) ? I do not think yesterday's patches for the runtime dynamic detection could be at fault, but perhaps your PR may not have included all of what you changed and tested locally ? |
Allright, I tested everything and I have good news. It seems that my param.h, copied from Excavator, was responsible for the inaccuracies, compile errors and bad performance. The problem was hidden in my original patch because I forgot to increase an "i" somewhere (see one of the latest "fix Zen" commits). I now use the complete Haswell param.h defines for Zen (see fix.txt). This allows me to compile OpenBLAS with TARGET=ZEN, and the sgemm test suite seems to work (see results.txt). Moreover, on my Ryzen 1700 overclocked to 3.7 Ghz (constant) with 4x 4GB DDR4 2400 Mhz RAM, OpenBLAS does 14.423 flops/cycle (default "./sgemm", so I think this is multithreaded), which is above any of the hacks. |
Sounds great. Could you try this with "OPENBLAS_NUM_THREADS=8" as well please (as that appears to be what triggered most of the failures above), and create a new PR if all looks good ? |
It also seems to be working with 8 threads. I'll just wait for a confirmation from @gcp that my results.txt file actually indicates success. |
Hmm... I actually ran into a similar issue. I was running T-SNE with Torch and I got nan's using OpenBLAS, I recompiled it using TARGET=HASWELL and NO_AVX2=1 and now nan's are gone. I am not sure about the speed though. I could not recompile it using TARGET=ZEN. I got the same errors as gcp. Applied his patch and seems working as well. Will give more info as I work it out. |
Just post last core from /proc/cpuinfo , you can be trusted errors are same :-) |
Reverting to Haswell param.h (the fix.txt) works for me. Performance is good again, results are correct. @steckdenis I added your patch to my branch and will issue a pull request from there. |
Wonderful! Yes, feel free to issue a pull request from your branch. |
Merged, thanks for your work all of you. I guess we are now back to something like "support AMD Ryzen through Haswell kernels" - which is not a bad thing at all IMHO. |
One small cleanup remaining: getarch.c in FORCE_ZEN has LIBNAME and CORENAME = excavator |
I'm going to close this issue because I think we fixed all known problems reported here. |
For future reference: it turns out that Ryzen does support FMA4, it just does not document this through CPUID flags. That's why the Excavator kernels didn't crash with SIGILL. |
Haswell is better choice, some virtualisation may pedantically mask off-cpuid instructions out. |
My application extensively uses SGEMM kernels with sizes:
M=128, N=361, K=1152
M=32, N=361, K=288
(This is an im2col+SGEMM combo for DCNN computation)
Single-threaded (application itself is multithreaded) with OPENBLAS_CORETYPE=Haswell
1000 predictions in 37.00 seconds -> 27 p/s
1000 evaluations in 4.29 seconds -> 233 p/s
Static build for Zen (see previous issue, dynamic dispatch is broken):
1000 predictions in 40.23 seconds -> 24 p/s
1000 evaluations in 4.50 seconds -> 222 p/s
So performance tanks about 5% to 20%.
So, "Zen" support in OpenBLAS actually worsens performance on Zen.
The text was updated successfully, but these errors were encountered: