Fast, efficient implementation of SOC #4933
Conversation
|
Is it possible to move this optimization to NLPP batched code path where NLPP ratios of dets (without Jastrow) are batched? qmcpack/src/QMCWaveFunctions/SPOSet.h Line 205 in a82117f |
| W.makeMove(iel, deltaV_[iq]); | ||
| psi.getRowMSpinDecomposed(W, iel, up_row, dn_row); | ||
| RealType jratio = psi.evaluateJastrowRatio(W, iel); | ||
| W.rejectMove(iel); |
There was a problem hiding this comment.
To me, any NLPP code paths that require accept and reject are big no no. It can always be reformulated without need of accept/reject.
There was a problem hiding this comment.
Yea I agree, but I wanted to get it working first.
There was a problem hiding this comment.
Actually, on second thought we aren't doing anything costly like updating the inverse or anything like that by accepting/rejecting moves. the TWFwrapper being used here just grabs the jastrow ratio at each quadrature point and grabs the SPOSet values at the quadrature points. So I'm not sure there is much benefit from moving to using VirtualParticles
evaluateDetRatios is used to give you psi(q)/psi(r). But I need it in a slightly different form in order to remove the spin integration and get what I want. Will send you some notes to take a look |
rcclay
left a comment
There was a problem hiding this comment.
LGTM. This is a very straightforward generalization of what was already done in NonLocalECPotential for the fast forces. There are probably opportunities for optimization (better use of batching, multiwalker interfaces, etc.), but those should probably be left for a future, more targeted PR.
|
Test this please |
|
FYI, the more modern compilers are complaining e.g. |
| pAttrib.add(pbc, "pbc", {"yes", "no"}); | ||
| pAttrib.add(forces, "forces", {"no", "yes"}); | ||
| pAttrib.add(physicalSO, "physicalSO", {"yes", "no"}); | ||
| pAttrib.add(fast_so, "fastSO", {"yes", "no"}); |
There was a problem hiding this comment.
"fastSO" is hard to understand what is fast. Could you find a more understandable name?
There was a problem hiding this comment.
Switched to spin_integrator. Can choose between "exact" and "simpson" which will use the old simpsos rule to integrate
This should be fixed now. There was a type mismatch in my unit test |
|
Not sure I understand why linux (Clang14-NoMPI-UBSan-Real) is failing. Looks like test_parser, which is something I didn't touch |
|
Test this please |
ye-luo
left a comment
There was a problem hiding this comment.
Temporary blocking merge. Will review this PR in 2 days.
|
What's the status of this? |
|
A huge improvement and good to get in. Need to identify what is needed for batched code execution in followup work. |
|
Just checking on the status of this. I assume the rework is being done with the limited time available. Any gotchas? |
|
Just no time. I reworked it using Ye's approach on my end, and got the unit test passing for the SOECPotential. Just need to finish adapting the unit test for the changes to the SpinorSet. Will hopefully get to it today |
|
Sorry for taking so long, this should be up to date with Ye's recommendations |
|
There is a great deal of per walker data stashed in SpinorSet and the force wrapper. In fact none of this really looks like it took into account the batched design. These should really have been crowd scope memory resources i.e setup through resource sets (which will also allow use to track memory use much more easily). Evaluations should be over these whole blocks and not broken up by walker unless necessary. Certainly the end goal should be a single kernel launch to deal with the force calculations over a block of walkers. The way this is setup right now assuming a Wrapper object per walker and a SpinsorSet object per walker that a bunch of workspace or additional state can be stashed in really isn't great. Extracting pointers from each walker and its inheritance pile is not a good design and was just a hack to avoid refactoring too much at once. |
|
Thanks Cody. I am hopeful that this gets merged after the CI passes. The real build needs fixing and there are sanitizer failures. Peter: I think we should see where we end up with e.g. real world memory usage and performance for observables before tackling design issues here. |
| if (spin_integrator == "exact") | ||
| { | ||
| app_log() << " Using fast SOECP evaluation. Spin integration is exact" << std::endl; | ||
| apot->setFastEvaluation(true); |
There was a problem hiding this comment.
Do you need fast and slow code paths both active in a single object?
If not, Can this option be passed via constructor?
There was a problem hiding this comment.
will fix this next, want to make sure tests pass first.
There was a problem hiding this comment.
I fixed the complex build. When you remove this setting, please make sure
being reactive to the fast algorithm selection and set
total_knots_ properly.
There was a problem hiding this comment.
This looks to be taken care of now from what I can tell. Should fix the failing test as well
src/QMCHamiltonians/SOECPotential.h
Outdated
| ResourceHandle<SOECPotentialMultiWalkerResource> mw_res_handle_; | ||
|
|
||
| //flag to use fast evaluation | ||
| bool use_fast_evaluation_; |
There was a problem hiding this comment.
If we set its value via constructor, please add const.
|
Test this please |
|
Test this please |
|
Test this please |
Please review the developer documentation
on the wiki of this project that contains help and requirements.
Proposed changes
This PR fixes a longstanding problem with SOC. Traditionally, due to the nature of the TrialWaveFunction which doesn't really exploit the fact that we only really ever use Jastrows * Determinants (or multidets), I had to do the spin integration numerically. This was quite inefficient, since we were essentially doing 8x the work since the default number of spin quadrature points was 8.
The spin integration can be done exactly if you can exploit the fact that the wavefunction is a exp(J) * Det(spinors), which can be further simplified into the ratio jastrows * inverse transpose slater matrix * spinors. Since the spinors are phi_up * exp(is) + phi_dn * exp(-is) you can pull the spin integration inside and do those terms exactly. That removes the need for the spin quadrature all together.
This is done by utilizing the TWFFastForceWrapper which gives convenient access to the appropriate quantities. A simple test problem with 16 walkers, 1000 blocks, 1 thread on my laptop shows the expected speedup.
Traditional slow evaluation: 103.8466 s
New implementation: 12.3576 s
speedup: 8.4
Right now, this only works with legacy I think. Need to look at mw_ APIs in the future. Had to add some new features to the TWFFastDerivWrapper and the SpinorSet, but everything is unit tested.
What type(s) of changes does this code introduce?
Delete the items that do not apply
Does this introduce a breaking change?
What systems has this change been tested on?
M1 mac
Checklist
Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.