Above average kernel times causing slow performance #1560

bgeneto · 2018-05-14T16:24:47Z

Hi!

In comparing OpenBLAS performance with Intel MKL I've noticed that (at least in my particular case: real or hermitian eigenvalue problem, e.g. ZHEEV) OpenBLAS is consuming too much more kernel times (red bars in htop) than Intel MKL and maybe this is why it is so slow (three to five times slower, depending on matrix size) compared to MKL. Does anybody know what is causing so much kernel threads/time and how to avoid it? I've already limited OPENBLAS_NUM_THREADS to 4 or 8... TIA.

martin-frbg · 2018-05-14T16:48:56Z

What CPU, which version of OpenBLAS, what matrix size(s) ? Does limiting OPENBLAS_NUM_THREADS further (even to just 2) improve performance ?
MKL may simply have a more efficient implementation of ZHEEV than the one from the netlib reference implementation of LAPACK that OpenBLAS uses, or may be better at choosing the appropriate number of threads for your problem size in a BLAS call from some part of ZHEEV. (Can you tell where in ZHEEV the time is spent, or can you provide a code sample that shows the problem ? At a quick glance, netlib ZHEEV calls at least ZHETRD, ZSTEQR and either DSETRF or ZUNGTR, and those four will in turn call other routines...)

bgeneto · 2018-05-14T17:14:21Z

I've tested with various CPU families (mostly Nehalem, but also with AMD ryzen/threadripper). The above mentioned aspect (too much time spent with kernel calls) happens with every tested system. In fact OPENBLAS_NUM_THREADS=1 has better performance than any other number of threads, problem size is 200x200 or 300x300. When running with only one thread CPU time is 100% green, with two or more threads the first thread is 100% green and the other ones are mostly red (kernel times) and results in worse performance. I would like to understand why this is happening and how to avoid it (while using multiple threads). Maybe an openmp build of OpenBLAS has better parallelism in this particular case? I can provide a quick example/fortran source later...

martin-frbg · 2018-05-14T19:08:41Z

kernel time on the "other" threads is probably spent in sched_yield() - either waiting on a lock, or simply waiting for something to do. Which version of OpenBLAS are you using - 0.2.20 or a snapshot of the develop branch ? (The latter has a changed GEMM multithreading which may help)

brada4 · 2018-05-14T19:22:36Z

Once can experiment with YIELDING macro. No idea why but sched_yield there spins CPU in kernel to 100% while having noop there runs cpus nearly idle at no penalty to overall time.
I dont remember the issue in the past around it

martin-frbg · 2018-05-14T19:40:08Z

Issue #900, but previous experiments there have been quite inconclusive. I would not exclude the possibilty that processes spending their time in YIELDING (for whatever implementation of that) is just a symptom and not the issue itself.

brada4 · 2018-05-15T09:16:55Z

I suspect sched_yield became CPU hog at some point, but what it hogs otherwise would go unused...

brada4 · 2018-05-15T14:09:08Z

I wonder if it is same observation as #1544

martin-frbg · 2018-05-20T14:16:38Z

Observation from #1544 is not quite clear yet, and ARMV7 already has nop instead of sched_yield

bgeneto · 2018-05-24T22:44:02Z

I don't know if the "problem" is related to sched_yield(), I'm afraid I don't have the right tools to check... So instead I provide the example code below so the experts here can profile/debug :-)

zheev-example

brada4 · 2018-05-25T09:23:22Z

Thank you for sample.

official doc shortlists BLAS functions that may have wrong multiprocessing thresholds (abovemost diagram, 4-o-clock corner)
sched_yield that eats a lot, but experimenting to eliminate it did not give conclusive improvement.

martin-frbg · 2018-05-26T16:59:24Z

This may in part be a LAPACK issue, recent LAPACK includes an alternative, OpenMP-parallelized version of ZHEEV called ZHEEV_2STAGE that may show better performance (have not gotten around to trying with your example yet, sorry). On the BLAS side, it seems interface/zaxpy.c did not receive the same ("temporary") fix for inefficient multithreading of small problem sizes as interface/axpy.c did (7 years ago, for issue #27). Not sure yet if that is related either...

martin-frbg · 2018-05-26T21:18:30Z

According to perf, most of the time (on Kaby Lake mobile hardware at least) appears to be spent in zlasr, with zaxpy playing a minor role (but indeed doing needless multithreading for small sizes). zgemm seems to be more prominent, though from #1320 its behaviour should be quite good already.

brada4 · 2018-05-27T16:48:10Z

The problem is that threads doing nothing but spun up are not accounted in perf, they land as yielding instead.
What about adding another thread only when previous sort of utilizes all l3 cache in in+temp+out ?
I know it is shared and repartitioned between cores/clusters/whatever on modern CPUs, but at least to start with good approximation.
I guess at few cores axpy will saturate memory bw anyway.

martin-frbg · 2018-05-29T12:40:30Z

Preliminary - changing sched_yield to nop does not directly affect running time, but gets rid of busy waiting that would drive cpu temperature (possibly leading to thermal throttling on poorly designed hardware). Dropping zaxpy to single threading is the only change that leads to a small speedup, while changing the thresholds for multithreading in zgemm, zhemv only reduces performance. As noted above, the majority of the time is spent in unoptimized LAPACK zlasr - for which MKL probably uses a better algorithm than the reference implementation. Also most of the lock/unlock cycles spent in the testcase appear to be from libc's random() used to fill the input matrix. (I ran the testcase 1000 times in a loop to get somewhat better data, but still the ratio between times spent in setup and actual calculation is a bit poor - which probably also explains the huge overhead from creating threads that are hardly needed afterwards.) Probably need to rewrite the testcase first when I find time for this again.

martin-frbg · 2018-06-10T16:26:42Z

http://www.cs.utexas.edu/users/flame/pubs/flawn60.pdf contains a discussion of the fundamental reasons for the low performance of the zlasr function, and of alternative implementations.

martin-frbg · 2018-06-13T21:30:19Z

In view of the discussion in #1614, you could try if uncommenting the THREAD_TIMEOUT option in Makefile.rule and setting its value to 20 before recompiling makes a difference.

martin-frbg · 2018-06-22T07:00:19Z

You should see some speedup and much less overhead with a current "develop" snapshot now (see #1624). Unfortunately this does not change the low performance of ZLASR itself, and I have now found that the new ZHEEV_2STAGE implementation I suggested earlier does not yet support the JOBZ=V case, i.e. computation of eigenvectors. (The reason for this is not clear to me, the code seems to be in place but is prevented from being called)

arndb · 2020-02-18T22:01:51Z

Quoting from @fenrus75 from #1614

the sad part is that glibc has a flag you can set on the pthread locks/etc that makes glibc spin an appropriate amount of time normally apps then don't have to do their own spinning on top ;-) 100 msec is forever for spinning though. The other sad part is that a sched_yield() is approximately as expensive as just waking up from a cond_wait() (at least in terms of order of magnitude and the work they do in the process scheduler)

I ended up debugging the same thing on a 24 core Opteron today and came to the same conclusion. Could the THREAD_TIMEOUT maybe be made much smaller? There is probably little harm in spinning a few microseconds. I tried


diff --git a/driver/others/blas_server.c b/driver/others/blas_server.c
index 6f4e2061..0b074646 100644
--- a/driver/others/blas_server.c
+++ b/driver/others/blas_server.c
@@ -143,7 +143,7 @@ typedef struct {
 static thread_status_t thread_status[MAX_CPU_NUMBER] _attribute__((aligned(ATTRIBUTE_SIZE)));
 
 #ifndef THREAD_TIMEOUT
-#define THREAD_TIMEOUT 28
+#define THREAD_TIMEOUT 10
 #endif
 
 static unsigned int thread_timeout = (1U << (THREAD_TIMEOUT));

which drastically reduced the reduced the number of CPU cycles spent for a simple test case:

24 threads, THREAD_TIMEOUT 10

real    0m46.798s
user    2m5.579s
sys     0m52.336s

24 threads, THREAD_TIMEOUT 28

real    0m47.692s
user    6m27.935s
sys     9m15.834s

single-threaded

real    0m39.774s
user    0m38.020s
sys     0m1.653s

It's probably possible to tune this better, but that simple change would be a good start if it shows no regressions in other tests.

martin-frbg · 2020-02-18T22:08:52Z

Good point - note that THREAD_TIMEOUT can be overriden in Makefile.rule already so no need to hack the actual code (as long as you are building with make - this option is not yet available in cmake builds)

brada4 · 2020-02-19T02:17:37Z

Could you share the test case? Slower with SMP is regression on its own.

Another is - that sched_yield (aka YIELDING macro) is used in a busy loop, some nanosleep could do better instead.

bgeneto · 2022-07-26T21:30:40Z

Since this issue is still affecting many libraries/softwares that relies on OpenBLAS, I've created a minimal example file showing the issue. Now that Intel OneAPI is easily available for Linux/WSL2, you can compare the two subroutines (zheev and zheevr) performance with ifort+mkl and gfortran+openblas. You will see that the mkl version is not affected by this bug in zheev (or whatever function it calls). Unfortunately the THREAD_TIMEOUT minimizes the problem but don't solves it, even when using few threads (four) and a relatively large matrix.

brada4 · 2022-07-26T22:41:15Z

Could you check with threaded and non-threaded OpenBLAS perf record ; perf report to see which syscall is performed in excess? On Linux, certainly not WSL or XEN.

martin-frbg · 2022-07-27T10:22:01Z

Just for reference, timings for current develop on 6c/12t AMD (Ryzen 5-4600H) running Linux:
Intel ZHEEVR 22.231 ZHEEV 64.898 (and no positive effect from setting MKL_DEBUG_CPU_TYPE=5)
GCC ZHEEVR 16.310 ZHEEV 166.612 seconds

bgeneto · 2022-07-27T12:30:24Z

OpenBLAS v0.3.20
Linux pop-os 5.18.10
12th Gen Intel(R) Core(TM) i5-12600K
(limited to 4 threads via env vars)
Problem/matrix size/shape: 4096 x 4096

gcc + openblas
ZHEEVR took: 10.982 seconds (100%)
ZHEEV took: 133.658 seconds (1117%)

intel + mkl
ZHEEVR took: 10.830 seconds (100%)
ZHEEV took: 12.454 seconds (115%)

Relevant perf tool report for gcc+openblas:

Overhead  Command         Shared Object                    Symbol
  67,60%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zlasr_
  17,18%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zhemv_U
   7,11%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_kernel_r
   3,67%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_kernel_l
   1,49%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_incopy
   0,47%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] dlaneg_
   0,31%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zlar1v_
   0,18%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zlarfb_
   0,16%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] blas_thread_server
   0,15%  hermitianEigen  libm.so.6                        [.] hypot
   0,14%  hermitianEigen  libc.so.6                        [.] __sched_yield
   0,13%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemv_kernel_4x4
   0,12%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zcopy_k
   0,11%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_itcopy
   0,10%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemv_kernel_4x4
   0,10%  hermitianEigen  [unknown]                        [k] 0xffffffff87e00158
   0,09%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] dlasq5_
   0,06%  hermitianEigen  hermitianEigen                   [.] MAIN__
   0,06%  hermitianEigen  libgfortran.so.5.0.0             [.] _gfortran_arandom_r8
   0,06%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] lsame_
   0,06%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] dlartg_
   0,05%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] dlamch_
   0,04%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_otcopy
   0,04%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zsteqr_
   0,03%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zaxpy_kernel_4
   0,03%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] ztrmm_kernel_RR
   0,03%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] ztrmm_kernel_RC
   0,03%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] ztrmm_kernel_RN

Relevant perf tool report for intel+mkl:

Overhead  Command          Shared Object             Symbol
  42,57%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_lapack_ps_avx2_zhemv_nb
  29,34%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_zgemm_kernel_0
   8,55%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dgemm_kernel_0
   4,05%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dgemm_dcopy_down12_ea
   2,43%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dtrmm_kernel_rl_0
   2,07%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dtrmm_kernel_ru_0
   2,06%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_xdlacpy
   1,74%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_zgemm_zccopy_right6_ea
   1,19%  hermitianEigen-  hermitianEigen-ifort      [.] for__acquire_semaphore_threaded
   1,15%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_dlaneg
   0,94%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_zlar1v
   0,35%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_xzgemv
   0,32%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dgemm_kernel_0_b0
   0,26%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_dlaq6
   0,25%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_zgemm_zcopy_down6_ea
   0,25%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_xdrotm
   0,22%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_zlarfb
   0,20%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_zgemm_zcopy_right2_ea
   0,20%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_dlasq5
   0,16%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dgemm_kernel_nocopy_NN_b1
   0,15%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_xzcopy
   0,15%  hermitianEigen-  hermitianEigen-ifort      [.] MAIN__
   0,12%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_ztrmm_kernel_ru_0
   0,12%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_zgemm_zccopy_down2_ea
   0,08%  hermitianEigen-  hermitianEigen-ifort      [.] for_simd_random_number
   0,08%  hermitianEigen-  libmkl_avx2.so.2          [.] mkl_blas_avx2_dgemm_dcopy_right4_ea
   0,07%  hermitianEigen-  libmkl_intel_thread.so.2  [.] mkl_lapack_dlasr3
   0,05%  hermitianEigen-  libmkl_intel_thread.so.2  [.] mkl_lapack_zlatrd

Same config as above but now using a non-threaded OpenBLAS version built with USE_THREAD=0

gcc + openblas (single-threading)
ZHEEVR took: 22.923 seconds
ZHEEV took: 139.490 seconds (+508%)

Relevant perf tool report for gcc+openblas (single-threading):

Overhead  Command          Shared Object                                Symbol
  76,16%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zlasr_
   9,90%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zhemv_U
   6,77%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_kernel_r
   3,48%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_kernel_l
   1,05%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_incopy
   0,54%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlaneg_
   0,36%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zlar1v_
   0,19%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zlarfb_
   0,18%  hermitianEigen-  libm.so.6                                    [.] hypot
   0,13%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zcopy_k
   0,12%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemv_kernel_4x4
   0,11%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemv_kernel_4x4
   0,10%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlasq5_
   0,07%  hermitianEigen-  hermitianEigen-nt                            [.] MAIN__
   0,07%  hermitianEigen-  libgfortran.so.5.0.0                         [.] _gfortran_arandom_r8
   0,07%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] lsame_
   0,07%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlartg_
   0,06%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_itcopy
   0,06%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlamch_
   0,04%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_otcopy
   0,04%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zsteqr_
   0,03%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] ztrmm_kernel_RC
   0,03%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] ztrmm_kernel_RN
   0,03%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] ztrmm_kernel_RR
   0,03%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zaxpy_kernel_4
   0,03%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] __powidf2
   0,02%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] lsame_@plt
   0,02%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlarrf_

bgeneto · 2022-07-27T13:16:52Z

Report for AMD Ryzen 5 5600G running only ZHEEV

OpenBLAS v0.3.20 (ZHEEV only, no ZHEEVR)
Linux pop-os 5.18.10
AMD Ryzen 5 5600G
(limited to 4 threads via env vars)
Problem/matrix size/shape: 4096 x 4096

gcc + openblas
ZHEEV took: 161.779 seconds

intel + mkl (faster on AMD, evidence of zheev/zlasr openblas issue)
ZHEEV took: 41.201 seconds

Relevant perf tool report for gcc+openblas:

Overhead  Command         Shared Object                    Symbol
  81,37%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zlasr_
   8,77%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zhemv_U
   5,59%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_kernel_r
   1,86%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_kernel_l
   0,58%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_incopy
   0,18%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zcopy_k
   0,13%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemv_kernel_4x4
   0,11%  hermitianEigen  hermitianEigen                   [.] MAIN__
   0,10%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemv_kernel_4x4
   0,09%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zlarfb_
   0,08%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] dlartg_
   0,07%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_itcopy
   0,07%  hermitianEigen  libopenblas_haswellp-r0.3.20.so  [.] zgemm_otcopy
   0,05%  hermitianEigen  libgfortran.so.5.0.0             [.] _gfortran_arandom_r8
   0,05%  hermitianEigen  libm.so.6                        [.] hypot

Relevant perf tool report for gcc+openblas-st (single-threading with USE_THREAD=0):

Overhead  Command          Shared Object                                Symbol
  85,90%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zlasr_
   5,34%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zhemv_U
   5,31%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_kernel_r
   1,64%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_kernel_l
   0,51%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemm_incopy
   0,18%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zcopy_k
   0,10%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zlarfb_
   0,09%  hermitianEigen-  hermitianEigen-nt                            [.] MAIN__
   0,08%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] dlartg_
   0,08%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemv_kernel_4x4
   0,07%  hermitianEigen-  libopenblas_non-threaded_haswell-r0.3.20.so  [.] zgemv_kernel_4x4
   0,06%  hermitianEigen-  libm.so.6                                    [.] hypot

Relevant perf tool report for intel+mkl:

Overhead  Command          Shared Object             Symbol
  33,74%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dgemm_pst
  14,14%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_zgemm_kernel_0_zen
  13,79%  hermitianEigen-  libmkl_def.so.2           [.] mkl_lapack_ps_def_zhemv_nb
  13,48%  hermitianEigen-  libiomp5.so               [.] _INTERNAL92a63c0c::__kmp_wait_template<kmp_flag_64<false, true>, true, false, true>
   7,92%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dgemm_kernel_zen
   5,63%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dgemm_copyan_bdz
   3,22%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_zgemm_zccopy_right4_bdz
   1,41%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_xdlacpy
   1,18%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dtrmm_inn
   0,97%  hermitianEigen-  hermitianEigen-ifort      [.] for_simd_random_number
   0,47%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_zgemm_zccopy_down2_bdz
   0,46%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_zgemm_zcopy_down4_bdz
   0,38%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dgemm_copybn_bdz
   0,35%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_xdrot
   0,29%  hermitianEigen-  libiomp5.so               [.] kmp_flag_native<unsigned long long, (flag_type)1, true>::notdone_check
   0,23%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_xzcopy
   0,12%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_dlaq6
   0,12%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_dgemm_mscale
   0,12%  hermitianEigen-  libiomp5.so               [.] _INTERNAL92a63c0c::__kmp_hyper_barrier_gather
   0,10%  hermitianEigen-  hermitianEigen-ifort      [.] MAIN__
   0,10%  hermitianEigen-  libmkl_def.so.2           [.] mkl_blas_def_ztrmrc
   0,10%  hermitianEigen-  libmkl_core.so.2          [.] mkl_lapack_zlarfb

martin-frbg · 2022-07-27T14:09:26Z

Looks like MKL may "simply" be using a different implementation of ZHEEV that avoids the expensive call to ZLASR. (Remember that almost all the LAPACK in OpenBLAS is a direct copy of https://github.com/Reference-LAPACK/lapack a.k.a "netlib" - unfortunately nothing there has changed w.r.t the implementation status of ZHEEV_2STAGE compared to my above comment from 2018)

bgeneto · 2022-07-27T15:19:23Z

unfortunately nothing there has changed w.r.t the implementation status of ZHEEV_2STAGE compared to my above comment from 2018)

That's really unfortunate! Maybe we should report elsewhere @martin-frbg (any netlib lapack forum?).

martin-frbg · 2022-07-27T15:29:57Z

see link for their github issue tracker. Old forum archived at https://icl.utk.edu/lapack-forum/ , since then replaced by a google group at https://groups.google.com/a/icl.utk.edu/g/lapack

martin-frbg · 2022-07-27T15:33:26Z

would probably make sense to rerun your test with pure "netlib" LAPACK and BLAS before reporting there though

martin-frbg · 2022-07-28T12:17:23Z

gfortran using unoptimized (and single-threaded) Reference-LAPACK (and associated BLAS) on same AMD hardware:
ZHEEVR 129.210s ZHEEV 293.950s

bgeneto · 2022-08-02T12:57:36Z

Ok, since netlib's lapack exhibits the same behavior, it's really not a bug in openblas but a performance issue with lapack's ZLASR.

It seems that the only consistent (performance wise) routine for the complex hermitian eigenproblem (when requesting all eigenpairs) in lapack/openblas are those using Relatively Robust Representations, that implies only ?HEEVR (CHEEVR/ZHEEVR)... at least until the work on ZHEEV_2STAGE, ZHEEVD_2STAGE, and ZHEEVR_2STAGE gets done/ported in lapack (unfortunately nowadays only eigenvalues can be computed by those 2STAGE routines).

Additionally, do yourself a favor and explicitly set driver='evr' while using Python numpy/scipy with openblas (otherwise the comparison with mkl is unfair due to the issue reported here):

from scipy import linalg as LA
...
w, v = LA.eigh(mat, driver='evr')

martin-frbg · 2022-08-02T13:43:33Z

Agreed. Unfortunately there is no sign of ongoing work on the 2stage codes since their inclusion. The FLAME group paper I linked above four years ago at least sketches a much faster implementation of what zlasr does, but actually coding it looks non-trivial

martin-frbg · 2022-08-14T16:58:25Z

Created a LAPACK ticket to inquire about implementation status.

This was referenced Jun 7, 2018

Use usleep instead of sched_yield by default #1600

Merged

Use a single thread for small input size in zaxpy #1601

Merged

martin-frbg mentioned this issue Oct 16, 2018

OpenBLAS becomes single-threaded #1820

Open

NotSqrt mentioned this issue Oct 22, 2018

Manylinux wheels: openblas not faster with multiple threads numpy/numpy#12220

Closed

martin-frbg mentioned this issue Dec 28, 2018

Performance issue with many cores #1881

Closed

martin-frbg mentioned this issue Aug 14, 2022

Relatively poor performance of ZHEEV (ZLASR) Reference-LAPACK/lapack#710

Closed

martin-frbg added the Feature request label Aug 20, 2022

martin-frbg added the LAPACK issue Deficiency in code imported from Reference-LAPACK label Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Above average kernel times causing slow performance #1560

Above average kernel times causing slow performance #1560

bgeneto commented May 14, 2018

martin-frbg commented May 14, 2018

bgeneto commented May 14, 2018

martin-frbg commented May 14, 2018

brada4 commented May 14, 2018

martin-frbg commented May 14, 2018

brada4 commented May 15, 2018

brada4 commented May 15, 2018

martin-frbg commented May 20, 2018

bgeneto commented May 24, 2018

brada4 commented May 25, 2018

martin-frbg commented May 26, 2018

martin-frbg commented May 26, 2018

brada4 commented May 27, 2018 •

edited

Loading

martin-frbg commented May 29, 2018 •

edited

Loading

martin-frbg commented Jun 10, 2018

martin-frbg commented Jun 13, 2018

martin-frbg commented Jun 22, 2018

arndb commented Feb 18, 2020 •

edited

Loading

martin-frbg commented Feb 18, 2020

brada4 commented Feb 19, 2020 •

edited

Loading

bgeneto commented Jul 26, 2022

brada4 commented Jul 26, 2022

martin-frbg commented Jul 27, 2022

bgeneto commented Jul 27, 2022 •

edited

Loading

bgeneto commented Jul 27, 2022

martin-frbg commented Jul 27, 2022

bgeneto commented Jul 27, 2022

martin-frbg commented Jul 27, 2022

martin-frbg commented Jul 27, 2022

martin-frbg commented Jul 28, 2022

bgeneto commented Aug 2, 2022 •

edited

Loading

martin-frbg commented Aug 2, 2022

martin-frbg commented Aug 14, 2022

Above average kernel times causing slow performance #1560

Above average kernel times causing slow performance #1560

Comments

bgeneto commented May 14, 2018

martin-frbg commented May 14, 2018

bgeneto commented May 14, 2018

martin-frbg commented May 14, 2018

brada4 commented May 14, 2018

martin-frbg commented May 14, 2018

brada4 commented May 15, 2018

brada4 commented May 15, 2018

martin-frbg commented May 20, 2018

bgeneto commented May 24, 2018

brada4 commented May 25, 2018

martin-frbg commented May 26, 2018

martin-frbg commented May 26, 2018

brada4 commented May 27, 2018 • edited Loading

martin-frbg commented May 29, 2018 • edited Loading

martin-frbg commented Jun 10, 2018

martin-frbg commented Jun 13, 2018

martin-frbg commented Jun 22, 2018

arndb commented Feb 18, 2020 • edited Loading

martin-frbg commented Feb 18, 2020

brada4 commented Feb 19, 2020 • edited Loading

bgeneto commented Jul 26, 2022

brada4 commented Jul 26, 2022

martin-frbg commented Jul 27, 2022

bgeneto commented Jul 27, 2022 • edited Loading

bgeneto commented Jul 27, 2022

Report for AMD Ryzen 5 5600G running only ZHEEV

martin-frbg commented Jul 27, 2022

bgeneto commented Jul 27, 2022

martin-frbg commented Jul 27, 2022

martin-frbg commented Jul 27, 2022

martin-frbg commented Jul 28, 2022

bgeneto commented Aug 2, 2022 • edited Loading

martin-frbg commented Aug 2, 2022

martin-frbg commented Aug 14, 2022

brada4 commented May 27, 2018 •

edited

Loading

martin-frbg commented May 29, 2018 •

edited

Loading

arndb commented Feb 18, 2020 •

edited

Loading

brada4 commented Feb 19, 2020 •

edited

Loading

bgeneto commented Jul 27, 2022 •

edited

Loading

bgeneto commented Aug 2, 2022 •

edited

Loading