Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial LCAO vgl batched implementation using GEMM #4407

Merged
merged 31 commits into from
Jan 26, 2023

Conversation

amandadumi
Copy link
Contributor

Proposed changes

This pull request introduces progress towards a batched implementation for the LCAO code. A CPU-only GEMM is introduced for the AO->MO transformation for phi, dphi and d2phi. The API is left unchanged and instead a conversion between old data types and the OffloadVGLArray is done to allow the SPOSet base class to maintain compatibility with the spline code. A separate PR could work towards unifying the API for LCAO and splines.

What type(s) of changes does this code introduce?

  • New feature
  • Testing changes (e.g. new unit/integration/performance tests)

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

He and EtOH which which are incorporated as unit tests.

Checklist

  • Yes. This PR is up to date with current the current state of 'develop'
  • Yes. Code added or changed in the PR has been clang-formatted
  • Yes. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • Yes. Documentation has been added (if appropriate)

@ye-luo
Copy link
Contributor

ye-luo commented Jan 22, 2023

Test this please

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at my changes and let me know if you have questions.

//elec.update();
elec.R[0][0] = 0.0001;
elec.R[0][1] = 0.0;
elec.R[0][2] = 0.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elec.R[0] = {0.0001, 0.0, 0.0}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have better idea how to do this throughout the code base.

// auto elec2 = elec.makeClone();
sposet->evaluateVGL(elec, 0, psiref_0, dpsiref_0, d2psiref_0);

REQUIRE(std::real(psiref_0[0]) == Approx(-0.001664403313));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change all the value checks from REQUIRE to CHECK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have better idea how to do this throughout the code base.

@ye-luo ye-luo changed the title Lcao vgl offload Intial LCAO vgl batched implementation using GEMM Jan 22, 2023
@ye-luo
Copy link
Contributor

ye-luo commented Jan 22, 2023

Test this please

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I changed a lot in the code. So need non-ANL approval.

@amandadumi
Copy link
Contributor Author

Hello Ye, thank you for the review and changes. After taking a look, your changes make sense in terms of removing some of the intermediate objects and more readable/compact initialization of other objects.

Copy link
Contributor

@PDoakORNL PDoakORNL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a design issue that OffloadMWVGLArrays need to be multiwalker resources owned around the DiracDeterminantBatched level but are getting defined in method scopes per call.
see: phi_v_g_l in LCAOrbitalSet::mw_evaluateVGL and basis_mw in LCAOrbitalSet::mw_evaluateVGLImplGEMM
There are a bunch of small things here that make this hard to read and are adding to tech debt.

Requiring manipulation of an XMLTree to write a test is awful but beyond the scope of this PR.

using vgl_type = VectorSoaContainer<T, OHMMS_DIM + 2>;
using vgh_type = VectorSoaContainer<T, 10>;
using vghgh_type = VectorSoaContainer<T, 20>;
using value_type = T;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type and _type are redundant here the naming convention is types are leading capital mixed case so:
Value implies a type already.
Other types here would be properly named Value Vgl, Vgh, Vghgh

Copy link
Contributor

@ye-luo ye-luo Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there SoaBasisSetBase::value_type being used. Need to do renaming outside this PR.

std::vector<ValueType>& ratios,
std::vector<GradType>& grads) const
{
assert(this == &spo_list.getLeader());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs test coverage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add once I have the shared resource set up.

src/QMCWaveFunctions/tests/test_MO.cpp Outdated Show resolved Hide resolved
src/QMCWaveFunctions/tests/test_MO.cpp Show resolved Hide resolved
src/QMCWaveFunctions/tests/test_MO.cpp Show resolved Hide resolved
src/QMCWaveFunctions/tests/test_MO.cpp Outdated Show resolved Hide resolved
src/QMCWaveFunctions/LCAO/LCAOrbitalSet.h Outdated Show resolved Hide resolved
@@ -30,10 +30,11 @@ namespace qmcplusplus
struct LCAOrbitalSet : public SPOSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a class

src/QMCWaveFunctions/LCAO/LCAOrbitalSet.cpp Show resolved Hide resolved
@prckent
Copy link
Contributor

prckent commented Jan 26, 2023

What are the thoughts on getting useful performance either with or without updating the old data structures? If the old structures have the wrong layout, any (re)mapping is an undesired cost that will presumably have to be dealt with "sometime".

@prckent
Copy link
Contributor

prckent commented Jan 26, 2023

Happy to merge but definitely interested in the answer to my above question.

I would not be surprised if changing the data layouts in the old code improved its performance. If changes like this are needed, please keep an eye on the carbon molecule performance tests.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 26, 2023

Test this please

@ye-luo
Copy link
Contributor

ye-luo commented Jan 26, 2023

What are the thoughts on getting useful performance either with or without updating the old data structures? If the old structures have the wrong layout, any (re)mapping is an undesired cost that will presumably have to be dealt with "sometime

We are still exploring right now. The new layout makes sense only when walkers are batched I assume. Before this PR, we only have one type of implementation (splines) in the new layout. Now we have two. There will be refactoring in the use side as well. Before it becomes clear what are the most suitable APIs, I'd like to hold massive update to the rest of the code base.

@prckent prckent merged commit d324656 into QMCPACK:develop Jan 26, 2023
@ye-luo ye-luo changed the title Intial LCAO vgl batched implementation using GEMM Initial LCAO vgl batched implementation using GEMM Sep 29, 2023
ye-luo added a commit to ye-luo/qmcpack that referenced this pull request Sep 29, 2023
Addresses QMCPACK#4666.
mw_evaluate_notranspose and evaluate_notranspose don't produce bitwise identical results.
The problem started by QMCPACK#4407 which introduce a batched evaluation and caused bitwise matrix value checking failure. I printed diff and found the value is at the epsilon of float or double. So no concern.

The intention of the assertion was to catch any missing D2H transfer earlier and there is no need of recomputing. So the solution is removing the re-computation.
@ye-luo ye-luo mentioned this pull request Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants