Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffusion Monte Carlo trace anomaly in batched QMCPACK code #4648

Open
djstaros opened this issue Jun 26, 2023 · 26 comments
Open

Diffusion Monte Carlo trace anomaly in batched QMCPACK code #4648

djstaros opened this issue Jun 26, 2023 · 26 comments
Labels

Comments

@djstaros
Copy link

djstaros commented Jun 26, 2023

Describe the bug
A Diffusion Monte Carlo calculation using the batched (real build CPU) version of QMCPACK 3.16.0 on ALCF's Polaris resulted in nearly periodic dips in the DMC energy trace which were on the order of 5 Ha. This issue is not present in an otherwise identical calculation using the legacy (real build CPU) version of QMCPACK 3.16.9 on NERSC's Perlmutter.

To Reproduce

  1. If necessary, install QMCPACK 3.16.0 to Polaris from here
  2. Run QMCPACK 3.16.0 on ALCF's Polaris using the attached input files in the location bug_inputs/dmc_3160_batched_polaris and the orbitals specified in the NOTE below.
  3. Plot the energy trace which results using the command: qmca -t -q e *.s001.scalar.dat

NOTE: The QP2QMCPACK.h5 file containing the orbitals is in a public Google Drive link here due to the large size of 98 MB.

Description of the system
An 8-atom primitive cell of ferromagnetic monolayer CrI3 with more than 20Å of vacuum in the z-direction, with 38 up spin electrons and 32 down-spin electrons. Used ppn boundary conditions and 4,4,1 periodic images. The orbitals are a single-determinant of natural orbitals (NO's) obtained from three iterative cycles of CIPSI -> NO's and represented in a GTO basis set.

Expected behavior
It is expected to see the same trace problem as in the attached photo bug_inputs/trace_dmc_3160_batched_polaris.png. Compare this to regular legacy Perlmutter behavior (bug_inputs/trace_dmc_3169_legacy_perlmutter.png).

Inputs and photos of issue
bug_inputs.zip

@ye-luo
Copy link
Contributor

ye-luo commented Jun 26, 2023

Could you run batched driver with the CPU build on Polaris and report how it compares to your perlmutter CPU legacy driver runs?

@prckent
Copy link
Contributor

prckent commented Jun 26, 2023

Please can you add a description of the system here.

@djstaros
Copy link
Author

@ye-luo Do you mean batched driver with CPU build on Perlmutter?

@djstaros
Copy link
Author

@prckent My apologies for missing this. The issue has been updated with the system details.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 26, 2023

@ye-luo Do you mean batched driver with CPU build on Perlmutter?

I thought you were using the offload build on Polaris. Actually no. So please use batched driver with the CPU build you have on Perlmutter.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 27, 2023

@djstaros the energy drops are very suspicious. Could you test two runs with batched driver in CPU build using the latestest QMCPACK develop branch on perlmutter. In one run, add
<parameter name="crowd_serialize_walkers"> yes </parameter>

@djstaros
Copy link
Author

djstaros commented Jun 27, 2023

@ye-luo Sure, I will submit a run adding that flag, with the most updated develop branch.

In the mean-time, here is the result of using the batched driver with the real CPU code on Perlmutter...

eTrace_s001

...the energy drops are seen again. This run had about 8,400 total walkers, whereas the original Polaris one had about 21,000 and the legacy Perlmutter run had about 15,000 (just for the sake of stating this information).

@prckent prckent added the bug label Jun 27, 2023
@prckent
Copy link
Contributor

prckent commented Jun 27, 2023

Thanks Dan for the system description and the updates. The CPU result is very revealing: clearly there is a real and generic problem that can't be blamed on offloading to GPUs here. Since you didn't mention it, can we assume that everything looked OK on the VMC side? i.e. similar energies and variances.

@jtkrogel
Copy link
Contributor

Hi Dan, Please can you upload here also the scalar.dat and dmc.dat files for the two runs? This will help to diagnose the issue further.

@djstaros
Copy link
Author

@prckent Yes, the VMC trace behaves normally and the energies/variances were as expected for both the legacy and batched real CPU codes.

@djstaros
Copy link
Author

@jtkrogel Good idea. I'm uploading here a zip with the scalar.dat and dmc.dat files from a.) the batched Polaris run, b.) the legacy Perlmutter run, and c.) the batched Perlmutter run.

Outputs
bug_outputs.zip

@djstaros
Copy link
Author

@ye-luo Yet to have a true final verdict, but your serial walkers suggestion on Perlmutter seems well behaved so far, with over 120 blocks ("samples" on the trace plots) already done and a working error bar of 1 mHa. Without this flag, the batched Perlmutter energies were already dipping by that number of blocks so this is good. I can upload the final results in probably about 1.5 hours (I'll just update this comment with the new information).

@jtkrogel
Copy link
Contributor

The issue is reproduced in the batched code on Perlmutter. See below:

Batched Polaris
dmc_polaris_batched

Legacy Perlmutter
dmc_perlmutter_legacy

Batched Perlmutter
dmc_perlmutter_batched

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 27, 2023

It's most likely that the issue relates to the population control algorithm (which differs inexplicably between the legacy and batched codes).

It is also possible, but less likely, that the local energy differs between the two. @djstaros to rule this out, would you run VMC on Perlmutter with number of walkers and blocks matching the DMC above? From this we can check if there is any differing behavior in the energy components.

@djstaros
Copy link
Author

@jtkrogel The VMC/DMC energies from the legacy Perlmutter and batched Perlmutter run using @ye-luo 's suggested fix in the DMC block ( yes ) are as follows:

Legacy Perlmutter Real CPU build
LocalEnergy Variance ratio
QP2QMCPACK series 0 -242.017522 +/- 0.003111 3.367914 +/- 0.036474 0.0139
QP2QMCPACK series 1 -242.305246 +/- 0.000605 3.496860 +/- 0.014910 0.0144

Batched Perlmutter Real CPU build with serialized walkers turned on for DMC only
LocalEnergy Variance ratio
QP2QMCPACK series 0 -242.022616 +/- 0.000735 3.394342 +/- 0.014140 0.0140
QP2QMCPACK series 1 -242.306972 +/- 0.000530 3.486293 +/- 0.005918 0.0144

As both the VMC and DMC energies are within error bars of one-another, it seems that the local energy does not differ between legacy and batched/serialized.

@jtkrogel
Copy link
Contributor

Thanks @djstaros. Please also post the scalar.dat and dmc.dat files for the runs described above.

@djstaros
Copy link
Author

@jtkrogel The legacy Perlmutter dmc.dat and scalar.dat are already posted above in Outputs/b_dmc_perl_legacy. I am attaching the "serialized" batched Perlmutter dmc.dat and scalar.dat files here.

Third output
d_dmc_perl_batch_ser.zip

@jtkrogel
Copy link
Contributor

Running with crowd_serialize_walkers=yes appears to fix the problem. It also introduces no slowdown in the cpu run.

@ye-luo what is the impact of this flag internally? Is there any reason "yes" is not the default for a cpu run?

@ye-luo
Copy link
Contributor

ye-luo commented Jun 27, 2023

crowd_serialize_walkers=yes forces the batched driver to call single walker APIs internally. On CPU, the performance difference should be minimal. What I feel here is a bug in the multi-walker specialized implementation.

@jtkrogel
Copy link
Contributor

Yes, agreed. The multi-walker implementation needs looking into. Beyond that, I propose we need a PR to set the default to "yes" for cpu runs.

If there is no performance gain, I think a gain in safety is clearly a win that is well worth it. People will run into this problem by default if we don't.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 27, 2023

Yes, agreed. The multi-walker implementation needs looking into. Beyond that, I propose we need a PR to set the default to "yes" for cpu runs.

If there is no performance gain, I think a gain in safety is clearly a win that is well worth it. People will run into this problem by default if we don't.

There are actual benefit not serializing walkers although it depends on the actual simulations. I don't feel good to force walker serialization by default.

@jtkrogel
Copy link
Contributor

How about we make it default at least until the other code path is fixed?

I see no disadvantage to this at all, and there is currently real research time being wasted due to this issue.

@prckent
Copy link
Contributor

prckent commented Jun 27, 2023

Happily, experience says that bugs like this don't take long to fix once they are diagnosed.

Something to think about is what aspect of these runs are not captured by our current testing. e.g. We do have serialized vs non-serialized DMC tests that were designed to catch this very issue.

@prckent
Copy link
Contributor

prckent commented Jul 3, 2023

The population growth around the time of the anomaly indicates a stuck walker, or at least we are able to generate an anomalously low energy and slow moving one. The runs nearly recover but clearly should not be used.

One reassuring result is the agreement of the legacy and batched but serialized codes in DMC.

If we believe that the crowd_serialize_walkers=yes and =no runs use exactly the same move generation algorithm, then we have either bad numerics that can generate an unlikely configuration in the =no (multiwalker API) case or there is an occasional bug in the energy calculations in the multiwalker case.

Any disagreement or alternatives to this logic?

@prckent
Copy link
Contributor

prckent commented May 8, 2024

Commenting to remind ourselves that this was never fully solved. Keeping crowd_serialize_walkers is not a long term strategy (performance) and could just be hiding the problem given the limited statistics here...

@ye-luo
Copy link
Contributor

ye-luo commented May 8, 2024

It seems to me caused by the broken T-move fixed in #4902
@djstaros will you be able to rerun the reproducer using /soft/applications/qmcpack/develop-20240425 on polaris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants