Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOBPCG stability: Suggestion of improvements #246

Open
mfherbst opened this issue May 20, 2019 · 38 comments · May be fixed by #247
Open

LOBPCG stability: Suggestion of improvements #246

mfherbst opened this issue May 20, 2019 · 38 comments · May be fixed by #247

Comments

@mfherbst
Copy link

We currently employ the lobpcg solver from this Julia package in our DFTK.jl electronic structure theory code. In our observations, the current Cholesky-QR-based LOBPCG implementation can become numerically unstable and sometimes even produce spurious eigenvalues.

As an illustrative example, run the following julia code

A = [1.25, 1.5, 1.5, 1.25, 1.5, 1.25, 1.5, 0, 1.13, 1.13, 1.5, 1.13, 1.5, 1.5, 1.13]
res = lobpcg(Diagonal(A), false, 5)

In my experiments with this simple test problem, it fails about each 2nd time with a PosDefException from the Cholesky factorization. In the remaining cases, it returns two approximations to the zero eigenvalue at res.λ[1] and res.λ[2]. In all such instances I further observed the first two eigenvectors to be non-orthogonal. That is to say, that

dot(res.X[:, 1], res.X[:, 2])

returns a numerically non-zero answer, order of magnitude 0.0001.

A couple of strategies to improve the numerical stability of LOBPCG have been discussed in the literature, e.g. in https://arxiv.org/abs/1704.07458. As far as I can judge from reading through the code, the suggested basis truncation and selection strategies are not yet present and it might be advantageous to take a look at implementing them.

@mfherbst mfherbst changed the title LOBPCG stability improvements LOBPCG stability: Suggestion of improvements May 20, 2019
@mfherbst
Copy link
Author

mfherbst commented May 21, 2019

Note: The original scipy implementation does not suffer from this problem. For example

from scipy.sparse.linalg import lobpcg
import numpy as np
lobpcg(np.diag(A), orth(np.random.randn(len(A), 5)), largest=False)

returns the correct result each time I tried it.

@mohamed82008
Copy link
Member

mohamed82008 commented May 21, 2019

Seems like a duplicate of #223.

@mohamed82008
Copy link
Member

I am traveling this week, but I will give it a look next week. Thanks for the issue.

@mfherbst
Copy link
Author

mfherbst commented May 21, 2019

Thanks for looking into this. Indeed, this shows some similarity with #223. E.g. block size 5 is a magic number. 4 or 6 work much better with the A defined as above.

I can assure you, however, we have similarly frequent problems in our application code, where the matrices are much larger compared to the block size than in the shown example. I'll try to come up with a reproducible example for you.

@mfherbst
Copy link
Author

I came up with a larger example, that still illustrates the PosDefException problem, see this gist.

On my machine this has a success rate of about 97%. This is not good enough for our needs, since we need in the order of hundred, potentially even thousands of such eigensolves.

@mohamed82008
Copy link
Member

Ok this is interesting. I managed to reproduce this with a specific seed. I will look into it.

@lobpcg
Copy link

lobpcg commented Jun 4, 2019

@mohamed82008 see
#246 (comment) but scipy version uses Cholesky, so there's nothing wrong with Cholesky in this example. It looks like there's just a bug in your Julia code, unrelated to your QR "fix". You may want to just compare your code against the scipy latest version rather then putting new QR

@mohamed82008
Copy link
Member

It is definitely possible that it is a bug in my code, but given the number of passed test cases, I am very curious where I might have gone wrong. I can do a close comparison when I get some time.

@lobpcg
Copy link

lobpcg commented Jun 5, 2019

@mohamed82008 this may be a bug in the core LOBPCG algorithm after all, also found by @mfherbst in scipy, see scipy/scipy#10258 (comment)

Let me think how to fix it easily...

@mfherbst
Copy link
Author

mfherbst commented Jun 5, 2019

@mohamed82008 (related to scipy/scipy#10258) I added a julia version to the example repository mfherbst/issue_scipy_lobpcg_guess for convenience and out of curiosity. As it turns out and as suspected by @lobpcg, the julia implementation of LOBPCG also fails on this example, but in fact both with and without the guess vectors.

@mfherbst
Copy link
Author

mfherbst commented Jun 5, 2019

The naive julia implementation referenced in scipy/scipy#10258 actually does pretty good in the example of mfherbst/issue_scipy_lobpcg_guess. Even converging down to 1e-14 tolerance in a reasonable number of iterations (some 250 without preconditioner and 100 with) is possible.

@mohamed82008
Copy link
Member

@mfherbst I think full orthogonalization of the basis with QR is an almost sure-fire way to make LOBPCG as stable as the QR alg (there is a loop-hole when using constraints). The only problem is the complexity the QR approach introduces in the form of additional matrix-matrix multiplications e.g. https://gist.github.com/antoine-levitt/f8ac3637c5d5d778a448780172e63e28#file-lobpcg-jl-L30, which is why I suggested this as a final measure in #247 if nothing else works. The QR "fix" in #247 is actually a less extreme measure than full orthogonalization as it only does matrix-matrix multiplications of P and/or R but not X for example.

@lobpcg
Copy link

lobpcg commented Jun 5, 2019

QR is also very bad for parallel execution - the actual reason I avoid it in the original LOBPCG algorithm, which is also implemented in many parallel libraries.

@mohamed82008
Copy link
Member

mohamed82008 commented Jun 5, 2019

I think the following approach has potential. Let U be the basis [X P R].

  1. Remove components of U along constraints matrix C
  2. Find the Gram matrix gramB = U' * B * U
  3. Find the "compact" eigenvalue decomposition of gramB eliminating the nullspace basis, gramB = V_c L_c V_c'.
  4. Update U using U = U * V_c * sqrt.(L_c).
  5. Efficiently update A*U and B*U by right-multiplying by V_c * sqrt.(L_c).

Note that the sizes of U, A*U and B*U can decrease in steps 4 and 5. The above guarantees that U' B U == I without introducing new basis vectors that may conflict with the constraints matrix C.

Similarly, if using QR, I think we need to make sure not to include any additional basis from Q that are not spanned by the matrix we are trying to orthogonalize, i.e. only take columns whose corresponding diagonal in R is non-zero. These additional basis can conflict with the constraint matrix C which will backfire at the end.

The eigenvalue decomposition approach above avoids the need for additional matrix-matrix multiplications involving the matrices A or B and is more parallelizable so it should be more efficient without sacrificing stability.

Comments? @lobpcg @antoine-levitt

@antoine-levitt
Copy link

As soon as you discard information in this way, you will slow down convergence (essentially, downgrade to LOBPG, ie without the P) for precisions smaller than ~1e-8 (try it, you'll see), which are important for a number of applications.

@mohamed82008
Copy link
Member

mohamed82008 commented Jun 5, 2019

Well, if the whole P matrix disappears due to 0 eigenvalues, LOBPCG basically turns into a preconditioned gradient descent algorithm, as opposed to a preconditioned conjugate gradient algorithm, so the convergence can possibly get hit. But increasing the subspace dimension is not computationally trivial, so perhaps the cost of the additional iterations of LOBPG (without the P matrix, dropped "conjugate"), can be compensated by the savings made from not expanding the basis using QR for example in the unconstrained case. Besides, if a good preconditioner is used, then the difference in convergence speed in terms of number of iterations may be even less serious. I think it is worth a try.

@mohamed82008
Copy link
Member

Perhaps both options can be provided once the constraint issue for QR is figured out, not that I am volunteering to implement both! If this turns out to be too much work, I may not have the time to do it any time soon; this is a somewhat busy period for me. But let' see.

@mohamed82008
Copy link
Member

The constrained case for QR seems simple to handle. We just need to check that any additional column of Q that we are including in the basis, for which the corresponding diagonal element of R is 0, has an orthogonal component to C, and is C-orthgonalized before being added.

@mohamed82008
Copy link
Member

So it seems a similar method to the one proposed above was also proposed in https://arxiv.org/pdf/1704.07458.pdf with the name svqbDrop. I haven't read the paper in details but it seems to have interesting ideas.

@lobpcg
Copy link

lobpcg commented Jun 6, 2019

So it seems a similar method to the one proposed above was also proposed in https://arxiv.org/pdf/1704.07458.pdf with the name svqbDrop. I haven't read the paper in details but it seems to have interesting ideas.

We discussed this paper already, e.g., #247 (comment)

@lobpcg
Copy link

lobpcg commented Jun 6, 2019

Well, if the whole P matrix disappears due to 0 eigenvalues, LOBPCG basically turns into a preconditioned gradient descent algorithm, as opposed to a preconditioned conjugate gradient algorithm, so the convergence can possibly get hit. But increasing the subspace dimension is not computationally trivial, so perhaps the cost of the additional iterations of LOBPG (without the P matrix, dropped "conjugate"), can be compensated by the savings made from not expanding the basis using QR for example in the unconstrained case. Besides, if a good preconditioner is used, then the difference in convergence speed in terms of number of iterations may be even less serious. I think it is worth a try.

Dropping P for the rest of iterations is extreme (I do it in MATLAB under some conditions) and of course results in dramatic slow down, e.g., already discussed in #247 (comment)

To fix the current test case, you need to drop P only on iterations where P is linear dependent, as I now do in https://github.com/lobpcg/blopex/blob/master/blopex_tools/matlab/lobpcg/lobpcg.m , but not yet in scipy. please see updated scipy/scipy#10258 (comment)

@lobpcg
Copy link

lobpcg commented Jun 7, 2019

Let me ping an expert @joseeroman in case he has some advice

@lobpcg
Copy link

lobpcg commented Jun 8, 2019

@mohamed82008
Copy link
Member

I have my doubts about this "fix" since it doesn't really change the overall basis. I would be very surprised if it actually fixed all of @mfherbst 's test cases.

@lobpcg
Copy link

lobpcg commented Jun 12, 2019

I have updated my comment: to fix the current test case, you need to drop P on iterations where P is linear dependent, as I now do in https://github.com/lobpcg/blopex/blob/master/blopex_tools/matlab/lobpcg/lobpcg.m , but not yet in scipy. please see scipy/scipy#10258 (comment)

@stevengj
Copy link
Member

When I was implementing a related algorithm (a cruder predecessor to LOPCG, since at that time I didn't realize you could perform the line minimizations by solving a Ritz problem) many years ago (https://www.osapublishing.org/oe/fulltext.cfm?uri=oe-8-3-173&id=63584), I found that it was sufficient to re-orthogonalize occasionally (not every step). Maybe you could estimate the condition number from the Cholesky factors (which can be done cheaply, I think?) in order to decide whether to re-orthogonalize?

In that work I used the polar decomposition A=QP. This can be computed relatively straightforwardly in parallel since P = sqrt(A'A) is a small matrix, so you just need a parallel matrix transpose and reduction for the Gram matrix A'A. Maybe this is related to @mohamed82008's suggestion above.

@antoine-levitt
Copy link

Oh wow that's an old thread. Since then, we ended up writing our own implementation in https://github.com/JuliaMolSim/DFTK.jl/blob/master/src/eigen/lobpcg_hyper_impl.jl. It uses a number of tricks for maximal stability, and we haven't been able to make it break yet (and not for lack of trying :-)), and keeps full convergence rate even when converging to close to machine precision, which I have never been able to do with any other LOBPCG implementation. It also fixes a very tricky issue whereby locking degraded convergence, which took me a good while to figure out; I think no other implementation has that fix. I wanted to add some other tricks (like avoid doing a full diagonalization of X'AX using perturbation theory when close to convergence, which gives quite a nice speedup in certain circumstances) and possibly make a paper out of it, but other things got in the way. If somebody is interested in picking this up again I'd be happy to share.

The choice there was to only use Cholesky for orthogonalization, because it's the fastest in a parallel setting (it's like the polar decomposition, but choleskys are faster than square roots). It's also very unstable so we do two things: 1) we add something to the diagonal when the cholesky fails 2) we re-orthogonalize when needed, indeed using the condition number of the Cholesky factors as you suggest @stevengj (see https://github.com/JuliaMolSim/DFTK.jl/blob/master/src/eigen/lobpcg_hyper_impl.jl#L77)

@ViralBShah
Copy link
Contributor

Would it be useful for others to bring your implementation into this package?

@mfherbst
Copy link
Author

mfherbst commented Nov 18, 2020

As I said on the discourse thread our implementation at its current stage is not complete drop-in replacement for the lobpcg of this package (e.g. generalised eigensolves are not tested very thoroughly, only smallest eigenvalues implemented etc). So to get it to fully replace the implementation that exists would be a bit of work. Coexistence might be a little easier short-term I suppose.

Other than that it would lower the burden of maintaining it on our side, so I would not mind helping to get it integrated elsewhere 😄. What do you think @antoine-levitt?

@lobpcg
Copy link

lobpcg commented Nov 18, 2020

Since my last comment in this thread, I have updated a year ago
https://github.com/scipy/scipy/blob/v1.5.4/scipy/sparse/linalg/eigen/lobpcg/lobpcg.py
to make it more stable, e.g., also run in single precision to full attainable accuracy. My tricks are different from those @antoine-levitt describes above, so stability and performance should also differ. It would be interesting for someone to compare, maybe combine some of the tricks...

@antoine-levitt
Copy link

antoine-levitt commented Nov 18, 2020

I already put it up at https://github.com/antoine-levitt/LOBPCG last year. It's the version that DFTK's one is based on. Anybody is welcome to do what they want with it. It's definitely more stable than the one in IterativeSolvers, and therefore I would recommend it for somebody to use for their first attempt. Generalized eigenvalue problems are OK (definitely more stable than the implementation in IterativeSolvers) as long as B is not horribly badly conditioned; there's an alternative tweak that doesn't assume B to be relatively well conditioned but it needs additional B multiplications for stability so I didn't put it in by default.

I did not benchmark extensively outside of DFTK's operating point so I can't say for sure that the implementation is unambiguously better than the one in IterativeSolvers regarding performance: there might be more matmuls or memory usage in our version; to be checked.

LOBPCG, as most Krylov methods, is a very delicate algorithm that is fundamentally unstable, and implementations have to find a fine line between stability and efficiency, so I think it's fine to have several implementations coexist. If someone does the work of comparing implementations and finds that one (possibly after tweaks) is unambiguously better than the others, then it might make sense to just have one, or possibly two pareto-optimal implementations, one more geared towards stability and one more towards performance. I can help but I don't have time to actually do that non-trivial amount of work.

@mohamed82008
Copy link
Member

Let's get a GSoC on this.

@mohamed82008
Copy link
Member

If there is a write-up somewhere describing the algorithm to be implemented, it would make it easier for the student to do it. Otherwise we would need a more mathematically adept student than the average GSoC to make sense of the arguments and scripts posted here. If a paper can come out of it, perhaps a graduate student might also be interested to volunteer and do all the comparisons and benchmarks.

@lobpcg
Copy link

lobpcg commented Nov 19, 2020

The problem is that there are several papers describing different competing tricks and correspondingly several different actual implementations of LOBPCG in various packages, see https://en.wikipedia.org/wiki/LOBPCG . In exact arithmetic, different implementations should result in exactly same results, although with quite different computational costs and various memory requirements. In double precision, and especially single precision, implementations vary greatly in numerical stability and may just fail depending on the input. There is no universally accepted "best" implementation. Arguably, the Python version https://github.com/scipy/scipy/blob/v1.5.4/scipy/sparse/linalg/eigen/lobpcg/lobpcg.py is most used and tested by various users. If you just want to modify the existing Julia version to literally follow every detail of this Python version, it is a purely coding exercise, easy for a CS student to do, just basically translating Python into Julia. If the goal is first to determine the "best" version for implementation in Julia, it is a difficult, possibly open-ended, PhD-level math/CS research project.

@mohamed82008
Copy link
Member

Either way, we can put it on the GSoC project list and depending on the level of the student, we can do good old scope creep.

@antoine-levitt
Copy link

Agreed with Andrew here: the pure coding exercise is not very interesting, as one can just use PyCall to use scipy's version from julia. The interesting part requires serious numerical skills.

@mfherbst
Copy link
Author

A GSoC student for this would be pretty cool. Let me know if there is something I can do to help when it gets to it!

@ViralBShah
Copy link
Contributor

I think the main thing to do is to list the project here and add yourself as a mentor (with any other potential co-mentors):

https://julialang.org/jsoc/projects/

I think it is soon going to be time that students start visiting this list, so best to put it up immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants