OpenBLAS becomes single-threaded #1820

tjoli · 2018-10-16T11:39:21Z

Hi everybody,

I am using standard lapack routines to diagonalize moderately large matrices with openBLAS.
The box is a xeon with 6 cores and hyperthreading. At runtime execution starts with 12 threads
by default and after some reproducible time goes down to one thread with miserable performance
but execution is without errors. I have observed the same behavior with OMP_NUM_THREADS=2,4,6
and using also OPENBLAS_NUM_THREADS. I tried to set OMP_DYNAMIC=FALSE with no change of behavior. It is a locally compiled version of openBLAS 0.3.3 with gcc/gfortran 7.3.0

any hints would be appreciated !

martin-frbg · 2018-10-16T11:51:21Z

Weird. Does matrix size change by the time you notice the number of active threads going down ? Many BLAS drivers now fall back to using a single thread when the overhead of multithreading is expected to exceed the gain from parallelizing, but this is always local to the individual function. Also most LAPACK functions are copied unmodified from the netlib reference implementation, so the LAPACK itself is not multithreaded.
Offhand I can only think of "something in your code" tying all threads to the same core via sched_setaffinity. Some minimal code sample that reproduces the problem would be great, but it would probably already help to know if you see this same behaviour with older versions (e.g 0.3.0, 0.2.20)

tjoli · 2018-10-16T12:53:50Z

In fact it happens during treatment of a single matrix. I noticed the same behavior with 0.2.20 hence I upgraded. The inference of single-thread behavior comes from looking at output of "top"
also I used repeatedly the same code on a local computer center and it gives full speed with expected performance.

added:
I checked that a standard openMP code runs all the time with the correct number of threads, by standard I mean without calling any openBLAS.

martin-frbg · 2018-10-16T13:38:03Z

So what are the dimensions of that particular matrix, and which function is it that appears to execute single-threaded ? (Could be a missed or misguided optimization in OpenBLAS, but if you already saw this with 0.2.20 at least it is not a recent regression.) I presume your "standard openMP code" will be using some other implementation of BLAS (like MKL or ATLAS), or is it avoiding BLAS completely ?

tjoli · 2018-10-16T14:26:16Z

I checked matrices of sizes 1800 and 3800.

The problem appears with lapack routine ZHEEV:
if I want all eigenvectors (so ZHEEV('V'....)) then it slows down to one thread while if I want eigenvalues only ZHEEV now with argument
ZHEEV('N"...) it runs full speed i.e. all threads remain active all the time. I understand that it is more costly to get the eigenvectors but why should this reduce the threads to unity ?

the standard openMP code I referred to uses no BLAS at all, just to be sure.

martin-frbg · 2018-10-16T14:48:07Z

For that one needs to look at what ZHEEV in netlib LAPACK does - at first glance it uses DSTERF if only eigenvalues are requested, and ZSTEQR when both eigenvalues and eigenvectors. I have not yet traced the BLAS calls (if any) in the two call trees. (Does your standard code avoid LAPACK calls as well, or which LAPACK does it link to when not using OpenBLAS ?)

tjoli · 2018-10-16T14:52:26Z

my "standard" code uses no LAPACK stuff at all. Just filling matrices, a dumb case as a test.
it is not a test of any LAPACK/BLAS stuff.

tjoli · 2018-10-16T15:09:38Z

Calling LAPACK DSTEQR leads to the same problem when I want all eigenvectors of a real matrix.
It turns single-threaded after a while. Tested on a order=5600 matrix

martin-frbg · 2018-10-16T15:45:19Z

I now remember that we have #1560 as a related issue, and from what I managed to find out there it was the call to ZLASR in the eigenvectors case that was the bottleneck.

brada4 · 2018-10-16T20:21:13Z

... which more or less comes from unnecessary parallelism for small samples given to blas functions ....
@tjoli can you run
perf record sample
perf report
With and without OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 set
And post the reports, esp. if any BLAS function (as opposed to lapack) consumes relatively lots more of time in parallel case.

tjoli · 2018-10-16T20:56:35Z

so perf record without any settings NUM_THREADS set

30.43%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zlasr_
19.90%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zhemv_U
17.16%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemm_kernel_l
16.65%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemm_incopy
 8.06%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemm_kernel_r
 5.22%  lapack.exe  [kernel.vmlinux]                [k] entry_SYSCALL_64_fastpath
 0.40%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemv_kernel_4x4
 0.21%  lapack.exe  libc-2.26.so                    [.] pthread_mutex_lock
 0.20%  lapack.exe  [kernel.vmlinux]                [k] native_irq_return_iret
 0.19%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemv_kernel_4x4
 0.18%  lapack.exe  libpthread-2.26.so              [.] __pthread_mutex_lock
 0.16%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemm_itcopy
 0.16%  lapack.exe  libpthread-2.26.so              [.] __pthread_mutex_unlock_usercnt
 0.10%  lapack.exe  libc-2.26.so                    [.] __sched_yield
 0.09%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] blas_thread_server
 0.08%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zswap_k
 0.06%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dlartg_
 0.05%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zscal_k
 0.04%  lapack.exe  lapack.exe                      [.] __hamiltoniantorus_MOD_hamfermitorus

tjoli · 2018-10-16T21:08:39Z

now with OPENBLAS_NUM_THREADS=1 and OMP_NUM_THREADS=1

81.34%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zlasr_
 7.77%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemm_kernel_r
 5.46%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zhemv_U
 3.19%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemm_kernel_l
 0.38%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemm_incopy
 0.30%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemv_kernel_4x4
 0.18%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemv_kernel_4x4
 0.17%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dlartg_
 0.15%  lapack.exe  [kernel.vmlinux]                [k] native_irq_return_iret
 0.07%  lapack.exe  lapack.exe                      [.] __hamiltoniantorus_MOD_hamfermitorus
 0.05%  lapack.exe  lapack.exe                      [.] __toolz_MOD_laguerre
 0.05%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zsteqr_
 0.05%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] lsame_
 0.04%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] __powidf2
 0.03%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemm_itcopy
 0.03%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dlamch_
 0.03%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zaxpy_k
 0.03%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] ztrmm_kernel_RC
 0.03%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] ztrmm_kernel_RR
 0.03%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] ztrmm_kernel_RN
 0.03%  lapack.exe  lapack.exe                      [.] __toolz_MOD_search_8
 0.03%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemv_kernel_4x1
 0.03%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zlarfb_
 0.02%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zcopy_k
 0.02%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zswap_k
 0.02%  lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] zgemm_otcopy
 0.02%  lapack.exe  lapack.exe                      [.] __fockfermitorus_MOD_fermitorusspace

brada4 · 2018-10-17T07:02:03Z

zhemv has no ththreading threshold
@tjoli - do you see execution time halved with single thread?

martin-frbg · 2018-10-17T07:28:03Z

Execution times would complete the picture, but I doubt they will be markedly different (note that the first message has "goes down to one thread with miserable performance"). Once the program enters the non-parallelized netlib ZLASR all the work is done by a single core, no BLAS involved.

brada4 · 2018-10-17T14:04:16Z

I am into thinking zlasr work is same for both cases, just that it takes 30 or 80% depending on zhemv ?waste

tjoli · 2018-10-17T14:20:30Z

similar behavior for DSTEQR first without any settings:

28.86%  su2lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dlasr_
22.70%  su2lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dgemm_kernel
18.92%  su2lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dgemm_incopy
17.56%  su2lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dsymv_kernel_4x4
 7.62%  su2lapack.exe  [kernel.vmlinux]                [k] entry_SYSCALL_64_fastpath
 0.44%  su2lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dgemm_itcopy
 0.40%  su2lapack.exe  libpthread-2.26.so              [.] __pthread_mutex_lock
 0.30%  su2lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dgemv_kernel_4x4
 0.26%  su2lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dlartg_
 0.26%  su2lapack.exe  libpthread-2.26.so              [.] __pthread_mutex_unlock_usercnt
 0.24%  su2lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dgemv_t
 0.24%  su2lapack.exe  su2lapack.exe                   [.] __su2hamiltonianspherelong_MOD_hamfermisphere
 0.22%  su2lapack.exe  [kernel.vmlinux]                [k] native_irq_return_iret
 0.16%  su2lapack.exe  libopenblas_nehalemp-r0.3.3.so  [.] dsymv_U
 0.15%  su2lapack.exe  libc-2.26.so                    [.] __sched_yield

tjoli · 2018-10-17T14:21:53Z

now DSTEQR with OMP_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1

80.43% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dlasr_
8.16% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dgemm_kernel
5.84% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dsymv_kernel_4x4
0.77% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dlartg_
0.66% su2lapack.exe su2lapack.exe [.] su2hamiltonianspherelong_MOD_hamfermisphere
0.64% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dgemm_incopy
0.27% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dsteqr
0.24% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dgemv_t
0.22% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] lsame
0.20% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dgemv_kernel_4x4
0.16% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] _powidf2
0.16% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dlamch
0.13% su2lapack.exe [kernel.vmlinux] [k] native_irq_return_iret
0.13% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dsymv_U
0.10% su2lapack.exe su2lapack.exe [.] _toolz_MOD_search_8
0.09% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dlarfb
0.08% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] daxpy_kernel_8
0.08% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dcopy_k
0.07% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dtrmm_kernel_RN
0.07% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dgemm_itcopy
0.07% su2lapack.exe libm-2.26.so [.] 0x0000000000028a1c
0.06% su2lapack.exe libm-2.26.so [.] 0x0000000000028a94
0.06% su2lapack.exe libm-2.26.so [.] log

martin-frbg · 2018-10-17T14:23:29Z

@brada4 #1560 has a test case if you want to experiment.

tjoli · 2018-10-17T14:31:33Z

Timings DSTEQR:
omp and openblas threads=1 : 35s -> DLASR 80%
omp and openblas threads=6 : 56s -> DLASR 57%
omp and openblas threads=12 : 114s -> DLASR 30%

brada4 · 2018-10-17T15:06:24Z

@martin-frbg I will
@tjoli thanks, that just confirms initial suspicion. dlasr time derived from your data 28s : 32s: 34s

brada4 · 2018-10-17T15:15:45Z

@martin-frbg philosophically low threshold for input is page size, for output cache line , assuming all chunks are aligned, upper bound L3cache (i.e we do not add extra thread to split threshold more)

brada4 · 2018-10-17T21:45:25Z

@tjoli temp fix while I get better moderation of threading is to disable it altogether in interface/zhemv.c:

#ifdef SMP
  nthreads = num_cpu_avail();

becomes

#ifdef SMP
  nthreads = 1;//num_cpu_avail();

brada4 · 2018-10-19T10:29:36Z

Using benchmarks included
OPENBLAS_NUM_THREADS=X ./yhemv.goto 16...16000 16 and ^C when seems stable
zhemv
saturates 1 core around 128x128 sample
2 cores 320x320
chemv
480 and 480
2-core version is significantly slower for smaller samples
Sadly my 4 cores turns out to be hyperthreads and no gain in more of them.
Lets drill 1x1...512x512 into graph and do some guess where to cut.

brada4 · 2018-10-19T11:33:23Z

X=N Y=GFlops
2-thread gets above 1-thread below size of 2MB L3 cache (~1.8MB input) then it steadily saturates CPU above L3 cache size (i.e each CPU core 1is at 1MB, sort of having 2MB cache each)

@martin-frbg I would guess at "guarantee each core at least half of L3 chunk of data?"

martin-frbg · 2018-10-19T12:05:20Z

If it is L3 - interestingly nothing uses the L3_SIZE datum returned by getarch. Probably best to bind the calculation to GEMM_MULTITHREAD_THRESHOLD for consistency. Maybe even just naively copy the initial threshold from (say) zgemv.c and see where that leads you. (BTW I wonder what causes that extreme jitter in the 2-thread curves, c2 in particular)

brada4 · 2018-10-19T13:29:58Z

Jitter made me wonder too.
No cpu temperature, nothing anomalous anywhere.
Could be gemm threshold, or fixed megabyte as well, at least later is closer to the truth....

brada4 · 2018-10-19T13:59:50Z

The method and assumptions regarding underlying mechanism were quite simple.

determine when compute resource saturates iterating through viable _NUM_THREADS/taskset combinations.
./Xhemv.goto 32 3200 32 and see it stabilize
Go through as many independent cores as you have
take a number bit more than biggest it stabilizes
calculate OPENBLAS_LOOPS so that 1 1 1 gives steady timings, 200 in my case
iterate through taskeset/_LOOPS and put produce in favourite spreadsheet.

At least 1vs2vs3 cores is interesting to know if it is steps to take gradually or one, then all is viable.

brada4 · 2018-10-20T08:20:50Z

@tjoli - can you test? It is very rough and simple change.

tjoli · 2018-10-20T11:33:02Z

with OMP_NUM_THREADS=12

48.95%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zlasr_
23.26%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zhemv_U
10.75%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zgemm_kernel_r
 6.75%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zgemm_kernel_l
 5.52%  lapack.exe  [kernel.vmlinux]                    [k] entry_SYSCALL_64_fastpath
 1.74%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zgemm_incopy

the behavior is similar as seen from "top": goes single-threaded after few seconds.

same parameters as in post 4 days ago, the results were

30.43% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zlasr_
19.90% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zhemv_U
17.16% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemm_kernel_l
16.65% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemm_incopy
8.06% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemm_kernel_r
5.22% lapack.exe [kernel.vmlinux] [k] entry_SYSCALL_64_fastpath
0.40% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemv_kernel_4x4
0.21% lapack.exe libc-2.26.so [.] pthread_mutex_lock
0.20% lapack.exe [kernel.vmlinux] [k] native_irq_return_iret
0.19% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemv_kernel_4x4
0.18% lapack.exe libpthread-2.26.so [.] __pthread_mutex_lock
0.16% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemm_itcopy
0.16% lapack.exe libpthread-2.26.so [.] __pthread_mutex_unlock_usercnt
0.10% lapack.exe libc-2.26.so [.] __sched_yield

brada4 · 2018-10-20T11:57:47Z

zlasr_ is single-threaded
zhemv now is single-threaded for small inputs to avoid damage from excess threading.

Yes, the performance target is not to stress-test CPU cooling, but to give result faster.

Can you get 2 outputs with OPENBLAS_NUM_THREADS=2 and =3 - i have some doubts about assumptions I made with one single threshold measuring just once, thank you.

tjoli · 2018-10-20T12:22:01Z

so with openblas_num_threads=2
79.05% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zlasr_
7.54% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zgemm_kernel_r
7.48% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zhemv_U
3.09% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zgemm_kernel_l
0.42% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zgemm_incopy
0.32% lapack.exe [kernel.vmlinux] [k] entry_SYSCALL_64_fastpath
0.30% lapack.exe [kernel.vmlinux] [k] native_irq_return_iret
0.30% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zgemv_kernel_4x4
0.18% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zgemv_kernel_4x4
0.16% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] dlartg_

tjoli · 2018-10-20T12:29:52Z

and with openblas_num_threads=3

76.30%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zlasr_
 9.67%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zhemv_U
 7.09%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zgemm_kernel_r
 3.10%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zgemm_kernel_l
 0.82%  lapack.exe  [kernel.vmlinux]                    [k] native_irq_return_iret
 0.72%  lapack.exe  [kernel.vmlinux]                    [k] entry_SYSCALL_64_fastpath
 0.49%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zgemm_incopy
 0.29%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zgemv_kernel_4x4
 0.18%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zgemv_kernel_4x4
 0.15%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] dlartg_
 0.06%  lapack.exe  libopenblas_nehalemp-r0.3.4.dev.so  [.] zswap_k

brada4 · 2018-10-20T13:17:12Z

@martin-frbg it does not look as tragic as old one? Could it survive few releases in this form?
@tjoli there is still place for 5% improvement, but I think your initial concern is addressed?

brada4 · 2018-10-20T13:22:38Z

@tjoli _lasr group is a fancy memcpy, with like 3 FLOPS per memory read+write(each dozen flops worth), it will not gain much from parallelism. Maybe it can gain from some inlining of loops so that compiler vectorizes & prefetches better, but thats again just 5% as memory accesses need to be.

martin-frbg · 2018-10-20T13:51:42Z

@tjoli any changes to the running times with the patch ? Keeping zhemv from causing unnecessary overhead should make timings more similar between single- and multithreaded runs, but any more fundamental improvements would require a different algorithm than xSTEQR (same as in #1560, where I linked to a research paper from the developers of libflame. Not sure if/how that work maps to the current version of their library - which sports its own, non-LAPACK compatible API)

brada4 · 2018-10-21T14:07:15Z

@tjoli regarding hardware setup for performance:
if you say nehalem xeon 6core you probably get best results with HT off (boot parameter nosmt), as otherwise each hyperthread is clamped to 50% of execution resources (which matters for single thread part taking 50% time now)

If it is westmere then x86_energy_perf_policy (from cpupower or cpufreq package) may be used to set CPU to maximum power.

Also usual advice on "set BIOS to defaults" and "read vendor guide specific to your OS" apply.

And for electricity burned you probably can get modern i3 laptop that jumps circles around your server.

Delay _hemv threading in attempt to address #1820

tjoli · 2018-10-22T15:10:22Z

@martin-frbg I understand that a different algorithm is needed indeed. I am just too lazy and always jump onto LAPACK.

@brada4 I use HT because I have better performance with some custom MPI code but the gain is of course is really spectacular - it exists when going from 6 to 8 threads and vanishes beyond. And right at home my i5 is way faster...
shall I close the issue ?

brada4 · 2018-10-22T16:20:36Z

There is no code improvement expected in short-term.

zlasr performance - lots of work for negligible gain
major problem with zhemv is addressed leaving room for improvement that needs lots of measurement.

For your further experimentation:

If clang+flang is at hand they may generate different if not better LAPACK code.

Try to experiment with energy_perf_policy - it may allow 1st hyperthread to work at 100% of core while other sleeps (for ZLASR) while still permitting 50:50 split for your MPI code. x86_e_p_p not work with nehalem, but will work with westmere and better.

Depending on your needs you may find other function giving result of proper accuracy faster:
http://www.netlib.org/lapack/explore-html/d9/d8e/group__double_g_eeigen.html
(you are more than welcome to post 2 perf record/report outputs if something smells fish with threading - it is relatively easy to repair if defect is similar)

Also consider casting down to single percisions (loss of accuracy if that matters, but result comes in half time)

brada4 · 2018-10-23T09:16:52Z

If you qualify for free MKL it might be worth trying, while BLAS part is more or less on par, they have some algorithmic improvements in LAPACK side over reference LAPACK used by OpenBLAS.

martin-frbg · 2024-01-27T13:50:21Z

related ticket in Reference-LAPACK was Reference-LAPACK/lapack#710 (still need to get to implementing the suggested algorithm)

martin-frbg added a commit that referenced this issue Oct 21, 2018

Merge pull request #1825 from brada4/hemv

e3c262e

Delay _hemv threading in attempt to address #1820

martin-frbg added the LAPACK issue Deficiency in code imported from Reference-LAPACK label Jan 14, 2024

OpenBLAS becomes single-threaded #1820

OpenBLAS becomes single-threaded #1820

Comments

tjoli commented Oct 16, 2018

martin-frbg commented Oct 16, 2018

tjoli commented Oct 16, 2018 • edited Loading

martin-frbg commented Oct 16, 2018

tjoli commented Oct 16, 2018

martin-frbg commented Oct 16, 2018

tjoli commented Oct 16, 2018

tjoli commented Oct 16, 2018

martin-frbg commented Oct 16, 2018

brada4 commented Oct 16, 2018

tjoli commented Oct 16, 2018 • edited Loading

tjoli commented Oct 16, 2018 • edited Loading

brada4 commented Oct 17, 2018

martin-frbg commented Oct 17, 2018

brada4 commented Oct 17, 2018

tjoli commented Oct 17, 2018

tjoli commented Oct 17, 2018

martin-frbg commented Oct 17, 2018

tjoli commented Oct 17, 2018

brada4 commented Oct 17, 2018

brada4 commented Oct 17, 2018

brada4 commented Oct 17, 2018

brada4 commented Oct 19, 2018

brada4 commented Oct 19, 2018

martin-frbg commented Oct 19, 2018

brada4 commented Oct 19, 2018

brada4 commented Oct 19, 2018

brada4 commented Oct 20, 2018

tjoli commented Oct 20, 2018 • edited Loading

brada4 commented Oct 20, 2018

tjoli commented Oct 20, 2018

tjoli commented Oct 20, 2018

brada4 commented Oct 20, 2018

brada4 commented Oct 20, 2018

martin-frbg commented Oct 20, 2018 • edited Loading

brada4 commented Oct 21, 2018

tjoli commented Oct 22, 2018

brada4 commented Oct 22, 2018

brada4 commented Oct 23, 2018 • edited Loading

martin-frbg commented Jan 27, 2024

tjoli commented Oct 16, 2018 •

edited

Loading

tjoli commented Oct 16, 2018 •

edited

Loading

tjoli commented Oct 16, 2018 •

edited

Loading

tjoli commented Oct 20, 2018 •

edited

Loading

martin-frbg commented Oct 20, 2018 •

edited

Loading

brada4 commented Oct 23, 2018 •

edited

Loading