Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ratio evaluation in 1RDM estimator #1672

Merged
merged 8 commits into from Jun 28, 2019

Conversation

@jtkrogel
Copy link
Contributor

commented Jun 27, 2019

This PR addresses #975.

I did a series of performance profiling runs by rejigging the NiO performance tests to include the 1RDM. These tests showed that a ~4x performance penalty is found for production system sizes when using 1 non-local sample per atom. Higher sampling densities would result in proportional cost increases (2x more samples leads to 2x slowdown). Below is a plot showing the BlockCPU time in DMC vs. number of atoms/samples relative to a run of the same size containing no estimator.

baseline_reltime_vs_atoms_samples

By inserting timers into the 1RDM eval, I was able to see the fraction of time spent on each sub-component. Components include ratio eval, spo eval (labeled "basis" below), and matrix products. As a function of system size, the ratio eval clearly dominates (approaches 100% of the overall time):

baseline_1rdm_fractions

With the performance improvements in this PR (solely due to Ye's evaluateRatiosAlltoOne function), the added cost over and estimator-free run drops to about 30% regardless of system size:

optimized_reltime_vs_atoms_samples

The balance between different evaluation components is now much closer with matrix products and spo evals becoming competitive with ratio evals as the leading source of efficiency loss:

optimized_1rdm_fractions

In production runs, the spo eval and matrix products are likely to become even more significant since the performance tests are limited to an artificially small basis that comprises just the occupied states. In a production setting, spo evals are likely to grow by 2-3x and matrix products possibly by 4-10x. The performance analysis will be redone in a more production setting once support has been added for independent spin up/down bases (which will further strain the spo eval). In any event, the ratio contribution seems to now be handled.

@qmc-robot

This comment has been minimized.

Copy link
Collaborator

commented Jun 27, 2019

Can one of the maintainers verify this patch?

@jtkrogel

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2019

Should have mentioned: the new code passes the existing 1RDM VMC tests for diamond.

@ye-luo

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

Could you include your timers as well?
You can find examples in #1664.

{
PosType& Rp = Pq.R[p];
for (int m = 0; m < samples; ++m, ++nm)
Matrix_t& P_nm = *Psi_nm[s];

This comment has been minimized.

Copy link
@ye-luo

ye-luo Jun 27, 2019

Contributor

Not necessary to fix now but keep in mind of replacing pointers with containers when you are refactoring old codes.

@jtkrogel

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2019

Sure, I will add the timers.

@ye-luo

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

I wrote a wiki page for adding timers https://github.com/QMCPACK/qmcpack/wiki/How-to-add-timers
Please update accordingly.

@ye-luo

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

use timer_level_fine only. Timers can nest even within the same level. medium is reserved for sections inside drivers.

@prckent

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

@ye-luo Please can your move the timers info to the manual and remove from the wiki. The how to add tests section can also be moved/merged. Reason: Trying to have a single place for development info, and this is currently the manual.

@jtkrogel

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2019

Is the use of enum + initializer function strictly necessary, or just use of timer_level_fine? I'm missing the usefulness of the former here.

@ye-luo

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

@prckent
@markdewing and I are still evolving the wiki page and then will move to the manual.

@ye-luo

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

@jtkrogel There is no initializer function. It is a definition of static const class member data.
Having an array of timers is better than a set of individual timers. Having enum is better than directly using integer indices.

Please do both the enum and timer_level_fine change. Also use scoped timer to replace start and stop embracing the whole function body.

@jtkrogel

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2019

@ye-luo The setup_timers function is what I meant by initializer function. IMHO, this (enum, name map, timer list) just adds extra code and layers of naming (which reduces readability) over individual timers. From this point of view, I see no advantage, but if it is the required pattern then I will follow it.

Your documentation does not specify the type definition of myTimers. What is it?

@markdewing

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

For the timers, it feels slightly neater to have a single member variable storing all the timers, rather than a number of member variables. This does come at the cost of a little bit more code and an extra enum.

@ye-luo

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

@jtkrogel I updated the wiki. Defining TimerNames to the class is only necessary when multiple constructors exist. In your case, it can be defined local to the constructor.

It is cleaner to have enum and a vector of timers.

@jtkrogel

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2019

@ye-luo Done.

src/QMCHamiltonians/DensityMatrices1B.cpp Outdated Show resolved Hide resolved
src/QMCHamiltonians/DensityMatrices1B.cpp Outdated Show resolved Hide resolved
src/QMCHamiltonians/DensityMatrices1B.cpp Outdated Show resolved Hide resolved
src/QMCHamiltonians/DensityMatrices1B.cpp Outdated Show resolved Hide resolved
src/QMCHamiltonians/DensityMatrices1B.cpp Outdated Show resolved Hide resolved
src/QMCHamiltonians/DensityMatrices1B.h Outdated Show resolved Hide resolved
@ye-luo

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

Okay to test

@ye-luo

ye-luo approved these changes Jun 28, 2019

@prckent

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

Really good to see. Dramatic speedup.

@ye-luo ye-luo merged commit 1055f9a into QMCPACK:develop Jun 28, 2019

3 checks passed

rhea-cpu
Details
rhea-cuda-experimental
Details
rhea-gpu
Details

@jtkrogel jtkrogel added the ECP label Aug 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.