Relatively poor performance of ZHEEV (ZLASR) #710

martin-frbg · 2022-08-14T16:26:33Z

Some time ago in OpenMathLib/OpenBLAS#1560 it was observed that the performance of ZHEEV is markedly poorer than in MKL, and by now I am reasonably certain that OpenBLAS is not at fault. The primary problem appears to be the slowness of the ZLASR call, which interestingly does not appear to feature in perf traces of a corresponding MKL-based testcase at all.

Does anybody happen to work on this topic currently ? I know PR #83 added a 2stage implementation (apparently by someone from the Dongarra group) in late 2016, but this does "not yet" implement computation of the eigenvectors and there have been little functional changes since then.
The implementation of ZLASR on the other hand appears to predate LAPACK 3.2.1.

angsch · 2022-08-14T18:06:03Z

ZLASR applies one plane rotation at a time. A common technique is to apply loop blocking and apply several Givens rotations simultaneously.

For example, let's look at the first case realized in ZLASR, pivot == 'V', direct == 'F'. The code computes

With loop blocking by a factor of 2, this becomes

The product of the two plane rotations and the matrix is equivalent to

The products of the components of the plane rotations [for example, in the first row s(j+1)c(j) and s(j+1)s(j)] can be computed outside of the update loop and stored in a temporary variable. A Fortran code could look like

               DO 20 j = 1, m - 1, 2
                   ! Assume m-1 is even
                   tmp1 = s(j+1)*c(j)
                   tmp2 = s(j+1)*s(j)
                   tmp3 = c(j+1)*c(j)
                   tmp4 = c(j+1)*s(j)
                   stemp = s( j )
                      DO 10 i = 1, n
                         a2 = a( j+2, i )
                         a1 = a( j+1, i )
                         a0 = a( j, i )
                         a( j+2, i ) = c(j+1)*a2 - tmp1*a1 + tmp2*a0
                         a( j+1, i ) = s(j+1)*a2 + tmp3*a1 - tmp4*a0
                         a( j, i ) = s(j)*a1 + c(j)*a0
    10                CONTINUE
    20       CONTINUE

This saves traversals over the matrix and flops.

martin-frbg · 2022-10-07T08:26:38Z

Thanks for the detailed explanation - will try to create a replacement zlasr for OpenBLAS when I find the time. (Sorry for the very late response)

martin-frbg closed this as completed Oct 7, 2022

martin-frbg mentioned this issue Jan 27, 2024

OpenBLAS becomes single-threaded OpenMathLib/OpenBLAS#1820

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relatively poor performance of ZHEEV (ZLASR) #710

Relatively poor performance of ZHEEV (ZLASR) #710

martin-frbg commented Aug 14, 2022

angsch commented Aug 14, 2022

martin-frbg commented Oct 7, 2022

Relatively poor performance of ZHEEV (ZLASR) #710

Relatively poor performance of ZHEEV (ZLASR) #710

Comments

martin-frbg commented Aug 14, 2022

angsch commented Aug 14, 2022

martin-frbg commented Oct 7, 2022