-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Above average kernel times causing slow performance #1560
Comments
What CPU, which version of OpenBLAS, what matrix size(s) ? Does limiting OPENBLAS_NUM_THREADS further (even to just 2) improve performance ? |
I've tested with various CPU families (mostly Nehalem, but also with AMD ryzen/threadripper). The above mentioned aspect (too much time spent with kernel calls) happens with every tested system. In fact OPENBLAS_NUM_THREADS=1 has better performance than any other number of threads, problem size is 200x200 or 300x300. When running with only one thread CPU time is 100% green, with two or more threads the first thread is 100% green and the other ones are mostly red (kernel times) and results in worse performance. I would like to understand why this is happening and how to avoid it (while using multiple threads). Maybe an openmp build of OpenBLAS has better parallelism in this particular case? I can provide a quick example/fortran source later... |
kernel time on the "other" threads is probably spent in sched_yield() - either waiting on a lock, or simply waiting for something to do. Which version of OpenBLAS are you using - 0.2.20 or a snapshot of the develop branch ? (The latter has a changed GEMM multithreading which may help) |
Once can experiment with YIELDING macro. No idea why but sched_yield there spins CPU in kernel to 100% while having noop there runs cpus nearly idle at no penalty to overall time. |
Issue #900, but previous experiments there have been quite inconclusive. I would not exclude the possibilty that processes spending their time in YIELDING (for whatever implementation of that) is just a symptom and not the issue itself. |
I suspect sched_yield became CPU hog at some point, but what it hogs otherwise would go unused... |
I wonder if it is same observation as #1544 |
Observation from #1544 is not quite clear yet, and ARMV7 already has |
I don't know if the "problem" is related to sched_yield(), I'm afraid I don't have the right tools to check... So instead I provide the example code below so the experts here can profile/debug :-) |
Thank you for sample.
|
This may in part be a LAPACK issue, recent LAPACK includes an alternative, OpenMP-parallelized version of ZHEEV called ZHEEV_2STAGE that may show better performance (have not gotten around to trying with your example yet, sorry). On the BLAS side, it seems interface/zaxpy.c did not receive the same ("temporary") fix for inefficient multithreading of small problem sizes as interface/axpy.c did (7 years ago, for issue #27). Not sure yet if that is related either... |
According to |
The problem is that threads doing nothing but spun up are not accounted in perf, they land as yielding instead. |
Preliminary - changing sched_yield to nop does not directly affect running time, but gets rid of busy waiting that would drive cpu temperature (possibly leading to thermal throttling on poorly designed hardware). Dropping zaxpy to single threading is the only change that leads to a small speedup, while changing the thresholds for multithreading in zgemm, zhemv only reduces performance. As noted above, the majority of the time is spent in unoptimized LAPACK zlasr - for which MKL probably uses a better algorithm than the reference implementation. Also most of the lock/unlock cycles spent in the testcase appear to be from libc's random() used to fill the input matrix. (I ran the testcase 1000 times in a loop to get somewhat better data, but still the ratio between times spent in setup and actual calculation is a bit poor - which probably also explains the huge overhead from creating threads that are hardly needed afterwards.) Probably need to rewrite the testcase first when I find time for this again. |
http://www.cs.utexas.edu/users/flame/pubs/flawn60.pdf contains a discussion of the fundamental reasons for the low performance of the zlasr function, and of alternative implementations. |
In view of the discussion in #1614, you could try if uncommenting the |
You should see some speedup and much less overhead with a current "develop" snapshot now (see #1624). Unfortunately this does not change the low performance of ZLASR itself, and I have now found that the new ZHEEV_2STAGE implementation I suggested earlier does not yet support the JOBZ=V case, i.e. computation of eigenvectors. (The reason for this is not clear to me, the code seems to be in place but is prevented from being called) |
Quoting from @fenrus75 from #1614
I ended up debugging the same thing on a 24 core Opteron today and came to the same conclusion. Could the THREAD_TIMEOUT maybe be made much smaller? There is probably little harm in spinning a few microseconds. I tried
which drastically reduced the reduced the number of CPU cycles spent for a simple test case:
It's probably possible to tune this better, but that simple change would be a good start if it shows no regressions in other tests. |
Good point - note that THREAD_TIMEOUT can be overriden in Makefile.rule already so no need to hack the actual code (as long as you are building with |
Could you share the test case? Slower with SMP is regression on its own. Another is - that sched_yield (aka YIELDING macro) is used in a busy loop, some nanosleep could do better instead. |
Since this issue is still affecting many libraries/softwares that relies on OpenBLAS, I've created a minimal example file showing the issue. Now that Intel OneAPI is easily available for Linux/WSL2, you can compare the two subroutines ( |
Could you check with threaded and non-threaded OpenBLAS |
Just for reference, timings for current |
Relevant perf tool report for gcc+openblas:
Relevant perf tool report for intel+mkl:
Relevant perf tool report for gcc+openblas (single-threading):
|
Report for AMD Ryzen 5 5600G running only ZHEEV
Relevant perf tool report for gcc+openblas:
Relevant perf tool report for gcc+openblas-st (single-threading with
Relevant perf tool report for intel+mkl:
|
Looks like MKL may "simply" be using a different implementation of ZHEEV that avoids the expensive call to ZLASR. (Remember that almost all the LAPACK in OpenBLAS is a direct copy of https://github.com/Reference-LAPACK/lapack a.k.a "netlib" - unfortunately nothing there has changed w.r.t the implementation status of ZHEEV_2STAGE compared to my above comment from 2018) |
That's really unfortunate! Maybe we should report elsewhere @martin-frbg (any netlib lapack forum?). |
see link for their github issue tracker. Old forum archived at https://icl.utk.edu/lapack-forum/ , since then replaced by a google group at https://groups.google.com/a/icl.utk.edu/g/lapack |
would probably make sense to rerun your test with pure "netlib" LAPACK and BLAS before reporting there though |
gfortran using unoptimized (and single-threaded) Reference-LAPACK (and associated BLAS) on same AMD hardware: |
Ok, since netlib's lapack exhibits the same behavior, it's really not a bug in openblas but a performance issue with lapack's ZLASR. It seems that the only consistent (performance wise) routine for the complex hermitian eigenproblem (when requesting all eigenpairs) in lapack/openblas are those using Relatively Robust Representations, that implies only ?HEEVR (CHEEVR/ZHEEVR)... at least until the work on Additionally, do yourself a favor and explicitly set from scipy import linalg as LA
...
w, v = LA.eigh(mat, driver='evr') |
Agreed. Unfortunately there is no sign of ongoing work on the 2stage codes since their inclusion. The FLAME group paper I linked above four years ago at least sketches a much faster implementation of what zlasr does, but actually coding it looks non-trivial |
Created a LAPACK ticket to inquire about implementation status. |
Hi!
In comparing OpenBLAS performance with Intel MKL I've noticed that (at least in my particular case: real or hermitian eigenvalue problem, e.g. ZHEEV) OpenBLAS is consuming too much more kernel times (red bars in htop) than Intel MKL and maybe this is why it is so slow (three to five times slower, depending on matrix size) compared to MKL. Does anybody know what is causing so much kernel threads/time and how to avoid it? I've already limited OPENBLAS_NUM_THREADS to 4 or 8... TIA.
The text was updated successfully, but these errors were encountered: