Skip to content

Fast, efficient implementation of SOC #4933

Merged
prckent merged 27 commits intoQMCPACK:developfrom
camelto2:fast_soc
Apr 12, 2024
Merged

Fast, efficient implementation of SOC #4933
prckent merged 27 commits intoQMCPACK:developfrom
camelto2:fast_soc

Conversation

@camelto2
Copy link
Contributor

@camelto2 camelto2 commented Feb 20, 2024

Please review the developer documentation
on the wiki of this project that contains help and requirements.

Proposed changes

This PR fixes a longstanding problem with SOC. Traditionally, due to the nature of the TrialWaveFunction which doesn't really exploit the fact that we only really ever use Jastrows * Determinants (or multidets), I had to do the spin integration numerically. This was quite inefficient, since we were essentially doing 8x the work since the default number of spin quadrature points was 8.

The spin integration can be done exactly if you can exploit the fact that the wavefunction is a exp(J) * Det(spinors), which can be further simplified into the ratio jastrows * inverse transpose slater matrix * spinors. Since the spinors are phi_up * exp(is) + phi_dn * exp(-is) you can pull the spin integration inside and do those terms exactly. That removes the need for the spin quadrature all together.

This is done by utilizing the TWFFastForceWrapper which gives convenient access to the appropriate quantities. A simple test problem with 16 walkers, 1000 blocks, 1 thread on my laptop shows the expected speedup.

Traditional slow evaluation: 103.8466 s
New implementation: 12.3576 s
speedup: 8.4

Right now, this only works with legacy I think. Need to look at mw_ APIs in the future. Had to add some new features to the TWFFastDerivWrapper and the SpinorSet, but everything is unit tested.

What type(s) of changes does this code introduce?

Delete the items that do not apply

  • New feature

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

M1 mac

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

  • Yes. This PR is up to date with current the current state of 'develop'
  • Yes. Code added or changed in the PR has been clang-formatted
  • Yes. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • Yes. Documentation has been added (if appropriate)

@camelto2 camelto2 requested review from rcclay and ye-luo and removed request for rcclay February 20, 2024 01:34
@ye-luo
Copy link
Contributor

ye-luo commented Feb 20, 2024

Is it possible to move this optimization to NLPP batched code path where NLPP ratios of dets (without Jastrow) are batched?
I mean specialize

virtual void evaluateDetRatios(const VirtualParticleSet& VP,

W.makeMove(iel, deltaV_[iq]);
psi.getRowMSpinDecomposed(W, iel, up_row, dn_row);
RealType jratio = psi.evaluateJastrowRatio(W, iel);
W.rejectMove(iel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, any NLPP code paths that require accept and reject are big no no. It can always be reformulated without need of accept/reject.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree, but I wanted to get it working first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second thought we aren't doing anything costly like updating the inverse or anything like that by accepting/rejecting moves. the TWFwrapper being used here just grabs the jastrow ratio at each quadrature point and grabs the SPOSet values at the quadrature points. So I'm not sure there is much benefit from moving to using VirtualParticles

@camelto2
Copy link
Contributor Author

Is it possible to move this optimization to NLPP batched code path where NLPP ratios of dets (without Jastrow) are batched? I mean specialize

virtual void evaluateDetRatios(const VirtualParticleSet& VP,

evaluateDetRatios is used to give you psi(q)/psi(r). But I need it in a slightly different form in order to remove the spin integration and get what I want. Will send you some notes to take a look

rcclay
rcclay previously approved these changes Feb 20, 2024
Copy link
Contributor

@rcclay rcclay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is a very straightforward generalization of what was already done in NonLocalECPotential for the fast forces. There are probably opportunities for optimization (better use of batching, multiwalker interfaces, etc.), but those should probably be left for a future, more targeted PR.

@prckent
Copy link
Contributor

prckent commented Feb 20, 2024

Test this please

@prckent
Copy link
Contributor

prckent commented Feb 20, 2024

FYI, the more modern compilers are complaining e.g.

src/QMCHamiltonians/tests/test_SOECPotential.cpp:258:54: error: non-const lvalue reference to type 'Real' (aka 'float') cannot bind to a value of unrelated type 'Return_t' (aka 'double')
  testing::TestSOECPotential::evalFast(so_ecp, elec, value)***

pAttrib.add(pbc, "pbc", {"yes", "no"});
pAttrib.add(forces, "forces", {"no", "yes"});
pAttrib.add(physicalSO, "physicalSO", {"yes", "no"});
pAttrib.add(fast_so, "fastSO", {"yes", "no"});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fastSO" is hard to understand what is fast. Could you find a more understandable name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to spin_integrator. Can choose between "exact" and "simpson" which will use the old simpsos rule to integrate

@camelto2
Copy link
Contributor Author

FYI, the more modern compilers are complaining e.g.

src/QMCHamiltonians/tests/test_SOECPotential.cpp:258:54: error: non-const lvalue reference to type 'Real' (aka 'float') cannot bind to a value of unrelated type 'Return_t' (aka 'double')
  testing::TestSOECPotential::evalFast(so_ecp, elec, value)***

This should be fixed now. There was a type mismatch in my unit test

@camelto2
Copy link
Contributor Author

Not sure I understand why linux (Clang14-NoMPI-UBSan-Real) is failing. Looks like test_parser, which is something I didn't touch

rcclay
rcclay previously approved these changes Feb 21, 2024
@ye-luo
Copy link
Contributor

ye-luo commented Feb 21, 2024

Test this please

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary blocking merge. Will review this PR in 2 days.

@rcclay
Copy link
Contributor

rcclay commented Feb 27, 2024

What's the status of this?

@prckent
Copy link
Contributor

prckent commented Feb 27, 2024

A huge improvement and good to get in. Need to identify what is needed for batched code execution in followup work.

@prckent
Copy link
Contributor

prckent commented Mar 26, 2024

Just checking on the status of this. I assume the rework is being done with the limited time available. Any gotchas?

@camelto2
Copy link
Contributor Author

Just no time. I reworked it using Ye's approach on my end, and got the unit test passing for the SOECPotential. Just need to finish adapting the unit test for the changes to the SpinorSet. Will hopefully get to it today

@camelto2
Copy link
Contributor Author

camelto2 commented Apr 4, 2024

Sorry for taking so long, this should be up to date with Ye's recommendations

@camelto2 camelto2 requested a review from ye-luo April 4, 2024 16:37
@PDoakORNL
Copy link
Contributor

PDoakORNL commented Apr 4, 2024

There is a great deal of per walker data stashed in SpinorSet and the force wrapper. In fact none of this really looks like it took into account the batched design. These should really have been crowd scope memory resources i.e setup through resource sets (which will also allow use to track memory use much more easily). Evaluations should be over these whole blocks and not broken up by walker unless necessary. Certainly the end goal should be a single kernel launch to deal with the force calculations over a block of walkers.

The way this is setup right now assuming a Wrapper object per walker and a SpinsorSet object per walker that a bunch of workspace or additional state can be stashed in really isn't great. Extracting pointers from each walker and its inheritance pile is not a good design and was just a hack to avoid refactoring too much at once.

@prckent
Copy link
Contributor

prckent commented Apr 5, 2024

Thanks Cody. I am hopeful that this gets merged after the CI passes. The real build needs fixing and there are sanitizer failures.

Peter: I think we should see where we end up with e.g. real world memory usage and performance for observables before tackling design issues here.

if (spin_integrator == "exact")
{
app_log() << " Using fast SOECP evaluation. Spin integration is exact" << std::endl;
apot->setFastEvaluation(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need fast and slow code paths both active in a single object?
If not, Can this option be passed via constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix this next, want to make sure tests pass first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the complex build. When you remove this setting, please make sure

void SOECPComponent::resize_warrays(int n, int m, int s)

being reactive to the fast algorithm selection and set total_knots_ properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be taken care of now from what I can tell. Should fix the failing test as well

ResourceHandle<SOECPotentialMultiWalkerResource> mw_res_handle_;

//flag to use fast evaluation
bool use_fast_evaluation_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set its value via constructor, please add const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@ye-luo
Copy link
Contributor

ye-luo commented Apr 10, 2024

Test this please

@ye-luo
Copy link
Contributor

ye-luo commented Apr 10, 2024

Test this please

@ye-luo
Copy link
Contributor

ye-luo commented Apr 11, 2024

Test this please

@ye-luo ye-luo enabled auto-merge April 12, 2024 00:30
@prckent prckent disabled auto-merge April 12, 2024 02:23
@prckent prckent merged commit 1cd71b5 into QMCPACK:develop Apr 12, 2024
@camelto2 camelto2 deleted the fast_soc branch April 12, 2024 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants