Diffusion Monte Carlo trace anomaly in batched QMCPACK code #4648

djstaros · 2023-06-26T20:58:43Z

Describe the bug
A Diffusion Monte Carlo calculation using the batched (real build CPU) version of QMCPACK 3.16.0 on ALCF's Polaris resulted in nearly periodic dips in the DMC energy trace which were on the order of 5 Ha. This issue is not present in an otherwise identical calculation using the legacy (real build CPU) version of QMCPACK 3.16.9 on NERSC's Perlmutter.

To Reproduce

If necessary, install QMCPACK 3.16.0 to Polaris from here
Run QMCPACK 3.16.0 on ALCF's Polaris using the attached input files in the location bug_inputs/dmc_3160_batched_polaris and the orbitals specified in the NOTE below.
Plot the energy trace which results using the command: qmca -t -q e *.s001.scalar.dat

NOTE: The QP2QMCPACK.h5 file containing the orbitals is in a public Google Drive link here due to the large size of 98 MB.

Description of the system
An 8-atom primitive cell of ferromagnetic monolayer CrI3 with more than 20Å of vacuum in the z-direction, with 38 up spin electrons and 32 down-spin electrons. Used ppn boundary conditions and 4,4,1 periodic images. The orbitals are a single-determinant of natural orbitals (NO's) obtained from three iterative cycles of CIPSI -> NO's and represented in a GTO basis set.

Expected behavior
It is expected to see the same trace problem as in the attached photo bug_inputs/trace_dmc_3160_batched_polaris.png. Compare this to regular legacy Perlmutter behavior (bug_inputs/trace_dmc_3169_legacy_perlmutter.png).

Inputs and photos of issue
bug_inputs.zip

ye-luo · 2023-06-26T21:07:19Z

Could you run batched driver with the CPU build on Polaris and report how it compares to your perlmutter CPU legacy driver runs?

prckent · 2023-06-26T21:07:25Z

Please can you add a description of the system here.

djstaros · 2023-06-26T21:44:53Z

@ye-luo Do you mean batched driver with CPU build on Perlmutter?

djstaros · 2023-06-26T21:45:18Z

@prckent My apologies for missing this. The issue has been updated with the system details.

ye-luo · 2023-06-26T22:05:59Z

@ye-luo Do you mean batched driver with CPU build on Perlmutter?

I thought you were using the offload build on Polaris. Actually no. So please use batched driver with the CPU build you have on Perlmutter.

ye-luo · 2023-06-27T02:56:18Z

@djstaros the energy drops are very suspicious. Could you test two runs with batched driver in CPU build using the latestest QMCPACK develop branch on perlmutter. In one run, add
<parameter name="crowd_serialize_walkers"> yes </parameter>

djstaros · 2023-06-27T12:22:03Z

@ye-luo Sure, I will submit a run adding that flag, with the most updated develop branch.

In the mean-time, here is the result of using the batched driver with the real CPU code on Perlmutter...

...the energy drops are seen again. This run had about 8,400 total walkers, whereas the original Polaris one had about 21,000 and the legacy Perlmutter run had about 15,000 (just for the sake of stating this information).

prckent · 2023-06-27T13:29:19Z

Thanks Dan for the system description and the updates. The CPU result is very revealing: clearly there is a real and generic problem that can't be blamed on offloading to GPUs here. Since you didn't mention it, can we assume that everything looked OK on the VMC side? i.e. similar energies and variances.

jtkrogel · 2023-06-27T13:32:38Z

Hi Dan, Please can you upload here also the scalar.dat and dmc.dat files for the two runs? This will help to diagnose the issue further.

djstaros · 2023-06-27T14:06:42Z

@prckent Yes, the VMC trace behaves normally and the energies/variances were as expected for both the legacy and batched real CPU codes.

djstaros · 2023-06-27T14:11:23Z

@jtkrogel Good idea. I'm uploading here a zip with the scalar.dat and dmc.dat files from a.) the batched Polaris run, b.) the legacy Perlmutter run, and c.) the batched Perlmutter run.

Outputs
bug_outputs.zip

djstaros · 2023-06-27T14:14:12Z

@ye-luo Yet to have a true final verdict, but your serial walkers suggestion on Perlmutter seems well behaved so far, with over 120 blocks ("samples" on the trace plots) already done and a working error bar of 1 mHa. Without this flag, the batched Perlmutter energies were already dipping by that number of blocks so this is good. I can upload the final results in probably about 1.5 hours (I'll just update this comment with the new information).

jtkrogel · 2023-06-27T14:29:27Z

The issue is reproduced in the batched code on Perlmutter. See below:

Batched Polaris

Legacy Perlmutter

Batched Perlmutter

jtkrogel · 2023-06-27T14:33:47Z

It's most likely that the issue relates to the population control algorithm (which differs inexplicably between the legacy and batched codes).

It is also possible, but less likely, that the local energy differs between the two. @djstaros to rule this out, would you run VMC on Perlmutter with number of walkers and blocks matching the DMC above? From this we can check if there is any differing behavior in the energy components.

djstaros · 2023-06-27T19:14:14Z

@jtkrogel The VMC/DMC energies from the legacy Perlmutter and batched Perlmutter run using @ye-luo 's suggested fix in the DMC block ( yes ) are as follows:

Legacy Perlmutter Real CPU build
LocalEnergy Variance ratio
QP2QMCPACK series 0 -242.017522 +/- 0.003111 3.367914 +/- 0.036474 0.0139
QP2QMCPACK series 1 -242.305246 +/- 0.000605 3.496860 +/- 0.014910 0.0144

Batched Perlmutter Real CPU build with serialized walkers turned on for DMC only
LocalEnergy Variance ratio
QP2QMCPACK series 0 -242.022616 +/- 0.000735 3.394342 +/- 0.014140 0.0140
QP2QMCPACK series 1 -242.306972 +/- 0.000530 3.486293 +/- 0.005918 0.0144

As both the VMC and DMC energies are within error bars of one-another, it seems that the local energy does not differ between legacy and batched/serialized.

jtkrogel · 2023-06-27T19:24:31Z

Thanks @djstaros. Please also post the scalar.dat and dmc.dat files for the runs described above.

djstaros · 2023-06-27T19:33:27Z

@jtkrogel The legacy Perlmutter dmc.dat and scalar.dat are already posted above in Outputs/b_dmc_perl_legacy. I am attaching the "serialized" batched Perlmutter dmc.dat and scalar.dat files here.

Third output
d_dmc_perl_batch_ser.zip

jtkrogel · 2023-06-27T20:37:37Z

Running with crowd_serialize_walkers=yes appears to fix the problem. It also introduces no slowdown in the cpu run.

@ye-luo what is the impact of this flag internally? Is there any reason "yes" is not the default for a cpu run?

ye-luo · 2023-06-27T20:41:24Z

crowd_serialize_walkers=yes forces the batched driver to call single walker APIs internally. On CPU, the performance difference should be minimal. What I feel here is a bug in the multi-walker specialized implementation.

jtkrogel · 2023-06-27T20:48:14Z

Yes, agreed. The multi-walker implementation needs looking into. Beyond that, I propose we need a PR to set the default to "yes" for cpu runs.

If there is no performance gain, I think a gain in safety is clearly a win that is well worth it. People will run into this problem by default if we don't.

ye-luo · 2023-06-27T20:51:54Z

Yes, agreed. The multi-walker implementation needs looking into. Beyond that, I propose we need a PR to set the default to "yes" for cpu runs.

If there is no performance gain, I think a gain in safety is clearly a win that is well worth it. People will run into this problem by default if we don't.

There are actual benefit not serializing walkers although it depends on the actual simulations. I don't feel good to force walker serialization by default.

jtkrogel · 2023-06-27T21:00:49Z

How about we make it default at least until the other code path is fixed?

I see no disadvantage to this at all, and there is currently real research time being wasted due to this issue.

prckent · 2023-06-27T21:11:11Z

Happily, experience says that bugs like this don't take long to fix once they are diagnosed.

Something to think about is what aspect of these runs are not captured by our current testing. e.g. We do have serialized vs non-serialized DMC tests that were designed to catch this very issue.

prckent · 2023-07-03T18:21:33Z

The population growth around the time of the anomaly indicates a stuck walker, or at least we are able to generate an anomalously low energy and slow moving one. The runs nearly recover but clearly should not be used.

One reassuring result is the agreement of the legacy and batched but serialized codes in DMC.

If we believe that the crowd_serialize_walkers=yes and =no runs use exactly the same move generation algorithm, then we have either bad numerics that can generate an unlikely configuration in the =no (multiwalker API) case or there is an occasional bug in the energy calculations in the multiwalker case.

Any disagreement or alternatives to this logic?

prckent · 2024-05-08T20:27:32Z

Commenting to remind ourselves that this was never fully solved. Keeping crowd_serialize_walkers is not a long term strategy (performance) and could just be hiding the problem given the limited statistics here...

ye-luo · 2024-05-08T20:38:54Z

It seems to me caused by the broken T-move fixed in #4902
@djstaros will you be able to rerun the reproducer using /soft/applications/qmcpack/develop-20240425 on polaris

prckent added the bug label Jun 27, 2023

prckent mentioned this issue Jun 28, 2023

Nexus: Adding crowds_serialize_walkers tag to dmc input list #4651

Merged

aannabe mentioned this issue Jun 29, 2023

Unexpected VMC acceptance ratio behavior #4652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diffusion Monte Carlo trace anomaly in batched QMCPACK code #4648

Diffusion Monte Carlo trace anomaly in batched QMCPACK code #4648

djstaros commented Jun 26, 2023 •

edited

ye-luo commented Jun 26, 2023

prckent commented Jun 26, 2023

djstaros commented Jun 26, 2023

djstaros commented Jun 26, 2023

ye-luo commented Jun 26, 2023

ye-luo commented Jun 27, 2023

djstaros commented Jun 27, 2023 •

edited

prckent commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

djstaros commented Jun 27, 2023

djstaros commented Jun 27, 2023

djstaros commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

jtkrogel commented Jun 27, 2023 •

edited

djstaros commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

djstaros commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

ye-luo commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

ye-luo commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

prckent commented Jun 27, 2023

prckent commented Jul 3, 2023 •

edited

prckent commented May 8, 2024

ye-luo commented May 8, 2024

Diffusion Monte Carlo trace anomaly in batched QMCPACK code #4648

Diffusion Monte Carlo trace anomaly in batched QMCPACK code #4648

Comments

djstaros commented Jun 26, 2023 • edited

ye-luo commented Jun 26, 2023

prckent commented Jun 26, 2023

djstaros commented Jun 26, 2023

djstaros commented Jun 26, 2023

ye-luo commented Jun 26, 2023

ye-luo commented Jun 27, 2023

djstaros commented Jun 27, 2023 • edited

prckent commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

djstaros commented Jun 27, 2023

djstaros commented Jun 27, 2023

djstaros commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

jtkrogel commented Jun 27, 2023 • edited

djstaros commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

djstaros commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

ye-luo commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

ye-luo commented Jun 27, 2023

jtkrogel commented Jun 27, 2023

prckent commented Jun 27, 2023

prckent commented Jul 3, 2023 • edited

prckent commented May 8, 2024

ye-luo commented May 8, 2024

djstaros commented Jun 26, 2023 •

edited

djstaros commented Jun 27, 2023 •

edited

jtkrogel commented Jun 27, 2023 •

edited

prckent commented Jul 3, 2023 •

edited