Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relatively poor performance of ZHEEV (ZLASR) #710

Closed
martin-frbg opened this issue Aug 14, 2022 · 2 comments
Closed

Relatively poor performance of ZHEEV (ZLASR) #710

martin-frbg opened this issue Aug 14, 2022 · 2 comments

Comments

@martin-frbg
Copy link
Collaborator

Some time ago in OpenMathLib/OpenBLAS#1560 it was observed that the performance of ZHEEV is markedly poorer than in MKL, and by now I am reasonably certain that OpenBLAS is not at fault. The primary problem appears to be the slowness of the ZLASR call, which interestingly does not appear to feature in perf traces of a corresponding MKL-based testcase at all.

Does anybody happen to work on this topic currently ? I know PR #83 added a 2stage implementation (apparently by someone from the Dongarra group) in late 2016, but this does "not yet" implement computation of the eigenvectors and there have been little functional changes since then.
The implementation of ZLASR on the other hand appears to predate LAPACK 3.2.1.

@angsch
Copy link
Collaborator

angsch commented Aug 14, 2022

ZLASR applies one plane rotation at a time. A common technique is to apply loop blocking and apply several Givens rotations simultaneously.

For example, let's look at the first case realized in ZLASR, pivot == 'V', direct == 'F'. The code computes
image

With loop blocking by a factor of 2, this becomes
image

The product of the two plane rotations and the matrix is equivalent to
image

The products of the components of the plane rotations [for example, in the first row s(j+1)c(j) and s(j+1)s(j)] can be computed outside of the update loop and stored in a temporary variable. A Fortran code could look like

               DO 20 j = 1, m - 1, 2
                   ! Assume m-1 is even
                   tmp1 = s(j+1)*c(j)
                   tmp2 = s(j+1)*s(j)
                   tmp3 = c(j+1)*c(j)
                   tmp4 = c(j+1)*s(j)
                   stemp = s( j )
                      DO 10 i = 1, n
                         a2 = a( j+2, i )
                         a1 = a( j+1, i )
                         a0 = a( j, i )
                         a( j+2, i ) = c(j+1)*a2 - tmp1*a1 + tmp2*a0
                         a( j+1, i ) = s(j+1)*a2 + tmp3*a1 - tmp4*a0
                         a( j, i ) = s(j)*a1 + c(j)*a0
    10                CONTINUE
    20       CONTINUE

This saves traversals over the matrix and flops.

@martin-frbg
Copy link
Collaborator Author

Thanks for the detailed explanation - will try to create a replacement zlasr for OpenBLAS when I find the time. (Sorry for the very late response)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants