Final DMC energy varies with equilibration strategy #2330

jtkrogel · 2020-03-05T17:11:36Z

The final DMC energy for a 20 atom cell of SrCoO3 differs when QMCPACK is run in two ways that should be equivalent:

A single DMC run with a timestep of 0.002/Ha.
DMC run first with a timestep of 0.02/Ha followed by 0.002/Ha.

In this case, a difference of 6.6(7) mHa/f.u. is observed in the final DMC energy using the CPU code on Theta (case 1) and CADES (case 1 and 2) with Theta and CADES agreeing for case 1.

A plot of the 2-timestep data with differing numbers of equilibration blocks removed, seems to show a very long and slow transient. For comparison, a full equilibration from VMC to the DMC 0.002/Ha equilibrium occurs within 100 blocks. Starting from the 0.02/Ha equilibrium, the new 0.002/Ha equilibrium is not reached even after 1000 blocks.

Given the rapid equlibration from the high energy VMC state, it is unclear what mechanism could be responsible for the lack of equilibration from the lower energy, and presumably closer, DMC 0.02/Ha state. More investigation of this effect is warranted.

The code's inability to relax to the same minimum from different initial states casts doubt on the equilibrium found from standard single shot calculations performed in production.

Issue identified by @mcbennet.

prckent · 2020-03-05T20:42:11Z

Hoping to narrow this bug/feature: if you do a restart, does the problem persist?
T-Moves or not? If so, which flavor?

Also: to help planning debugging, how long does this example take to run?

prckent · 2020-03-05T20:43:19Z

Attention @camelto2 . Until resolved, this issue would preclude a smarter equilibration strategy.

camelto2 · 2020-03-05T20:48:05Z

Attention @camelto2 . Until resolved, this issue would preclude a smarter equilibration strategy.

Agreed. I've been chatting with @mcbennet about this offline. @lshulen and @rcclay have pointed out a number of issues that could be the culprit similar to what you mention above.

jtkrogel · 2020-03-05T20:49:39Z

This is for T-moves version 0. Trying a restart is a good idea. Right now, we are performing reruns on CADES with slightly smaller errorbars for 20, 10, and 5 atom cells as initial runs gave ambiguous results for the smaller cells. This should tell us if the issue is reproducible and whether it appears for smaller systems as well.

The 20 atom runs take about a day on 8 Skylake nodes.

jtkrogel · 2020-03-05T20:50:27Z

@camelto2, @lshulen, @rcclay What are your suggestions for possible culprits for this issue?

camelto2 · 2020-03-05T20:50:52Z

It is possibly worth mentioning that @aannabe has seen similar problems with an independent code (qwalk) using tmoves. The tmoves used there would be equivalent to tmoves v0 in qmcpack.

rcclay · 2020-03-05T20:52:01Z

DMC propagator needs an E_trial and variance of E_local for things like adjusting the population size and determining if the walkers are stepping into weird places. Depending on how these values are estimated/carried over between simulations could have some measureable consequences, especially for these energy scales.

jtkrogel · 2020-03-05T20:52:23Z

@camelto2, @aannabe Any data for methods other than T-moves v0 (i.e. is there any reason to believe this is local to T-moves)?

jtkrogel · 2020-03-05T20:53:18Z

@rcclay, but these choices should fade out rapidly and should be no worse than starting from VMC (at least in principle), right?

rcclay · 2020-03-05T20:55:44Z

I need to dig into SimpleFixedNodeBranch and see how precisely and how often we update this stuff.

camelto2 · 2020-03-05T21:00:23Z

@camelto2, @aannabe Any data for methods other than T-moves v0 (i.e. is there any reason to believe this is local to T-moves)?

Not that I am aware of. If nothing is obvious from the propagator as @rcclay mentioned above, it would be useful to check the dependence on tmoves/locality

camelto2 · 2020-03-05T21:02:42Z

another thing to investigate is the whether or not you can recover the correct behavior using smaller jumps in the timestep. Rather than going from 0.02 to 0.002, trying a few intermediates to see if the problem is as drastic.

jtkrogel · 2020-03-05T21:03:13Z

A complicating factor is that at the entry point to each DMC sub-run, warmup is performed with stochastic reconfiguration. In this case (and in others), we have observed that the trial energy behaves erratically (rapid fluctuations on the scale of the VMC-DMC energy difference), before settling down immediately following the warmup.

If a large error was introduced to the trial energy at this point, and if the moving average of the trial energy kept this initial bias for a long time, it could affect the population dynamics for an extended period of time. I suppose that this would formally appear as a more pronounced (and transient) population control bias.

jtkrogel · 2020-03-05T21:06:45Z

@camelto2 I agree that this is a good thing to investigate. I had planned to move this direction once we have fully settled reproducibility. Restarts are another good thing to try early on.

prckent · 2020-03-06T15:23:17Z

In view of the upcoming weekend it would be sensible to queue several independent series of runs that will tell us the most with the least amount of human effort. For example:

Repeat these tests with the locality approximation to include/exclude t-moves as a source of problems. This is a few line change to the inputs.
Repeat these tests using restarts and not multiple qmc sections. This is a bit more work to setup but will either exclude or implicate passing of state information between QMC sections as a factor. Add an additional dmc run so that you have a restarted dmc following a normal dmc section and not one with equilibration. The two production sections should have the same results and statistical behavior.

Perhaps this has already been done, but I think it is worth fishing for significant problems in existing data. Unlikely to be successful but low effort:

Look at histograms of the existing 0.002 a.u. production data graphed above to see if by eye there are any outliers, changes in the distributions. They should be indistinguishable by eye independent of starting point.
Check the acceptance statistics of the two 0.002 a.u. production results graphed above. They should be indistinguishable.

jtkrogel · 2020-03-06T16:58:18Z

Chandler is performing a series of runs. He's been able to reproduce the problem in the 10 atom cell, so we can now get conclusive results in 12 hours on 4 nodes.

Runs we have planned:

Restart runs w/ 0.002 a.u. starting from walker populations equilibrated to 0.02/Ha and "equilibrated" to 0.002/Ha following 0.02.
Runs with larger intermediate timestep: i.e. 0.04 -> 0.002 and 0.1 -> 0.002
A run with an identical intermediate timestep 0.002 -> 0.002
Running "adiabatically" 0.02->0.0199->0.0198...->0.002
Runs for LA and T-moves versions 1 and 3 (version 0 used so far)
Run QMCPACK v3.6.0 w/ T-moves v0 and 0.02->0.002 (the original case)

Additional observations:

The 0.002 acceptance ratio agrees to high precision across all runs
Differences in the energy sub-components reveal a pattern similar to "stuck walker" issues, i.e. a higher kinetic energy, a higher electron-electron energy, and a lower local electron-ion energy (more time spent in the core region). The non-change of acceptance ratio is, however, inconsistent with "stuck walker" behavior.

Hyeondeok-Shin · 2020-03-06T17:16:59Z

@jtkrogel @prckent I'm writing what I found today because it seems to be related with this issue.

This DMC results are for 2D GeSe starting from 0.1, 0.05 and 0.01 to 0.005 a.u. time step, and v3 T-moves is used.

I first thought DMC total energy is converged enough on 0.005 a.u., but when I restart with stored walkers on the same 0.005 a.u. final time step, it gives different equilibrium total energy.

I think it is the same issue with here.

mcbennet · 2020-03-06T17:45:43Z

Hyeondeok, interesting. The stored walkers that you use for the restart -- they were stored during the 0.005 a.u. timestep in the first run?

Hyeondeok-Shin · 2020-03-06T17:53:30Z

@mcbennet Yes, it is. They were from the first run with the same time step 0.005 a.u. to the restart run.

prckent · 2020-03-06T19:23:09Z

@Hyeondeok-Shin Have you checked that VMC restarts correctly and consistently in this system? Clearly it would be very difficult for it not to, but we thought DMC was safe as well...

Hyeondeok-Shin · 2020-03-06T19:39:43Z

@prckent I didn't check for VMC restart run, but I can quickly check it. I'll post here when VMC check has done.

Hyeondeok-Shin · 2020-03-07T01:57:55Z

I quickly did a restart test for VMC and it seems fine.

VMC test has done with total 400 blocks and 200 blocks + 200 restart blocks.

Total 400 blocks.

GeSe-S4-dmc series 1 -105.390069 +/- 0.001767 2.487108 +/- 0.081797 0.0236

First 200 + 200 restart blocks.

file series 2 -105.391064 +/- 0.001996 2.507822 +/- 0.072885 0.0238

They are equivalent, so restart works correctly for VMC (or single QMC section),

prckent · 2020-03-09T15:12:41Z

@mcbennet @jtkrogel assuming that the weekend runs have not flagged a clear culprit, I have an additional request: compile the code without MPI and run on the smaller cell with OpenMP threads on one node only. Do not use the MPI code with one task for this test. The run should use exactly the same number of total walkers as the 4 node runs so that it can be directly compared. This will rule out any problems with either our MPI implementation or that on CADES.

jtkrogel · 2020-03-09T16:14:35Z

The weekend runs have not flagged a clear culprit, however we do have some new information:

Running with a larger intermediate timestep magnifies the effect (much larger bias in the following small timestep run), making it easier to study the effect and/or look for it in other systems.
The magnitude of the effect can vary greatly with different initial seeds (one run might show a large effect and an exact rerun with a different seed can show a much smaller effect).
The effect is still present in various T-moves variants, but possibly to varying degrees (possibly conflated with 2 though). We are running now w/ a larger intermediate timestep to clarify.
Restarts "work", i.e. the effect is fully preserved across restarts.

We can run with no-mpi. Chandler is currently trying to compile v3.6.0 for a historical reference. He will also soon see if the effect is visible in pseudopotential diamond, if so, we can more easily compare e.g. QMCPACK and QWalk in the near future (good to isolate code vs algorithm issue).

ye-luo · 2020-03-09T16:28:23Z

@jtkrogel could you provide the DMC input sections and the full printout of QMCPACK corresponding to your run with two time steps?

mcbennet · 2020-03-09T16:35:15Z

@ye-luo attached here
equil_strategy.tar.gz

ye-luo · 2020-03-09T16:57:35Z

@mcbennet Could you rerun the one_ts and two_ts with warmupSteps=1000 in every DMC section?

mcbennet · 2020-03-09T17:05:38Z

Yes, I can have a look at that.

prckent · 2020-03-09T17:19:35Z

Interesting results so far. Can you confirm that locality approximation shows the problem? I suggest focusing on it since all the different QMC codes have it implemented consistently.

mcbennet · 2020-03-18T12:14:54Z

I see the following stochastic results for diamond with develop+modification mentioned above (DevMod).

Also, for SCO, the modification appears to have resolved AoS.

One TS Ref: -449.351323 +/- 0.001461

AoS LocalEnergy = -449.350831 +/- 0.001897
SoA LocalEnergy = -449.212763 +/- 0.001374

I am submitting a WIP PR, so that others might test my modification, scrutinize it, etc.

jtkrogel · 2020-03-18T12:46:24Z

Thanks Chandler. So with your fix, diamond now looks clean with AoS/SoA, but there remains an additional (and possibly unrelated) bug in SoA, as evidenced by your SCO results, correct?

mcbennet · 2020-03-18T13:02:05Z

Yes, diamond looks clean to me. I am not yet sure however why my stochastic tests show failures for 1704 and Devel in AoS, but pass deterministic tests.

Also, SCO looks clean as well for AoS but SoA is still off. I do not know the nature of the SoA error in SCO.

jtkrogel · 2020-03-18T13:11:23Z

Good to fix the manifesting issue, anyway. The remaining SoA issue was introduced prior to 3.8.0, right? I guess a return to the bisection search with SoA should be revealing.

mcbennet · 2020-03-18T13:25:17Z

Follow up regarding deterministic tests.

@Hyeondeok-Shin , I modified your tests by increasing number of VMC walkers to 20 then removing targetwalkers variables in DMC sections, i.e.,

<qmc method="vmc" move="pbyp">
     <estimator name="LocalEnergy" hdf5="no"/>
     <parameter name="walkers">    20   </parameter>
     <parameter name="substeps">  1 </parameter>
     <parameter name="warmupSteps">  1  </parameter>
     <parameter name="steps">  1 </parameter>
     <parameter name="blocks">  1 </parameter>
     <parameter name="timestep">  1.0 </parameter>
     <parameter name="usedrift">   no </parameter>
   </qmc>

   <qmc method="dmc" move="pbyp" checkpoint="-1">
     <estimator name="LocalEnergy" hdf5="no"/>
     <parameter name="reconfiguration">   no </parameter>
     <parameter name="warmupSteps">  3  </parameter>
     <parameter name="timestep">  0.3 </parameter>
     <parameter name="steps">   3   </parameter>
     <parameter name="blocks">  10   </parameter>
     <parameter name="nonlocalmoves">  yes </parameter>
   </qmc>

   <qmc method="dmc" move="pbyp" checkpoint="-1">
     <estimator name="LocalEnergy" hdf5="no"/>
     <parameter name="reconfiguration">   no </parameter>
     <parameter name="warmupSteps">  3  </parameter>
     <parameter name="timestep">  0.002 </parameter>
     <parameter name="steps">   10   </parameter>
     <parameter name="blocks">  10   </parameter>
     <parameter name="nonlocalmoves">  yes </parameter>
   </qmc>

After doing so, I see that this deterministic tests agrees with my stochastic tests

Build	Test Result
380 AoS real	REF
380 AoS comp	REF
380 SoA real	REF
380 SoA comp	REF
1704 AoS real	FAIL
1704 AoS comp	FAIL
1704 SoA real	FAIL
1704 SoA comp	FAIL
Dev AoS real	FAIL
Dev AoS comp	FAIL
Dev SoA real	FAIL
Dev SoA comp	FAIL
DevMod AoS real	PASS
DevMod AoS comp	PASS
DevMod SoA real	FAIL
DevMod SoA comp	FAIL

jtkrogel · 2020-03-18T13:29:45Z

This is an important update to the deterministic test. It's also consistent with the idea that only some trajectories (a subset of the phase space) are affected.

mcbennet · 2020-03-18T13:50:11Z

More observations worth noting regarding the above deterministic test.

For AoS, all versions agree for series 1 -- the intermediate timestep.
For SoA, 1704 and 3.8.0 agree for series 1 while Dev and DevMod disagree with 3.8.0 for series 1.
For SoA, Dev and DevMod agree with each other for series 1.

prckent · 2020-03-18T15:08:39Z

Note that the VMC driver will round up the number of walkers to a multiple of the thread count. There is no way to change this because the existing VMC drivers require an equal number of walkers per thread, which makes sense from efficiency considerations although a route for more flexibility would be ideal. Therefore when specifying walkers you need to be careful about the number of threads that you use.

mcbennet · 2020-03-18T15:13:19Z

For all of my tests of diamond 1x1x1, I have used 1 thread and 1 mpi task.

prckent · 2020-03-24T20:50:48Z

Does the discussion above still reflect the current status? i.e. If anyone has an update, please post.

mcbennet · 2020-03-25T01:20:36Z

Some discussion has been continued within #2349. That WIP PR corresponds to the modifications that fixes AoS (mentioned above in present thread) -- though SoA remains broken.

mcbennet · 2020-03-30T12:26:01Z

For comparison purposes, I am sharing an analogous plot here to the diamond 1x1x1 figure above but for SCO. This again suggests an underlying issue with SoA while AoS appears clean for version 3.8.0, then a bias in #1704 and beyond. The suggested modification in #2349 again appears to remove the bias in AoS.

UPDATE (4/8/2020): There was a mistake in the figure above. The SoA builds were all from the development branch. See corrected plot further down this thread.

jtkrogel · 2020-03-30T12:46:19Z

OK, so #2349 is a fix for the bug introduced in #1704 to AoS. SoA has a different bug that predates version 3.8.0.

Interesting that both bugs, introduced separately, affect the total energy in a similar way.

Would you mind running SoA with 3.6.0 and 3.7.0?

mcbennet · 2020-03-30T19:23:49Z

Submitted earlier today. Will prepend to the plot when complete.

mcbennet · 2020-04-02T17:39:04Z

Retracted comment.

mcbennet · 2020-04-08T11:35:05Z

The following figure contains corrected SrCoO3 data. In my previous figure, all SoA builds corresponded to the development branch by mistake.

From the statistical tests here, we see that a bug was introduced in #1704 and the bug remains in the current develop branch. #2369 fixes this bug. This test is consistent with the statistical test of diamond.

mcbennet · 2020-04-09T13:46:15Z

Regarding the deterministic test, here I focus on the absolute errors in the first DMC series. I have taken Hyeondeok's test and have varied two parameters independently -- (1) the number of target walkers in the DMC and (2) the timestep in the initial DMC. The plots below show the series 1 errors in total energy of 1704 and Develop (pulled in 3/13) w.r.t 3.8.0.

For all variations, it appears that 1704 is clean on this front (if 3.8.0 is assumed "correct"). For Develop, clearly there is a dependence on both timestep and number of target walkers chosen -- also, from the case where targetwalkers=2 it is clear that the absolute error is not necessarily monotonic with timestep.

Note: the targetwalkers=2 case is exactly Hyeondeok's test but with varied timestep.

Note: All runs below correspond to SoA builds

mcbennet · 2020-04-09T18:59:33Z

PR #2177 appears to be the first to show the discrepancy from 3.8.0 for the first DMC series.
PR #2173 went in just before and it looks clean.

Fix for Issue #2330

Fix last bit of #2330

prckent · 2020-04-13T13:25:11Z

After the additional fix of #2377, develop should now be cured of this important problem.

We can check the nightlies tomorrow afternoon for the results of the deterministic test, but I think we also need confirmation here that the problem is fixed in SrCoO3 and no other problems have crept in.

ye-luo · 2020-04-13T13:31:06Z

@mcbennet could you verify that the develop no more diverge behaviour from v3.8.0? Thank you.

mcbennet · 2020-04-13T14:13:51Z

While SrCoO3 runs, I ran my most recent deterministic tests with diamond. So far, both DMC series are consistent with 3.8.0

mcbennet · 2020-04-15T12:17:22Z

The latest develop looks clean for this stochastic test.

prckent · 2020-04-17T23:31:31Z

From the above, I think we can happily mark this as closed. Thanks all.

jtkrogel added the bug label Mar 5, 2020

mcbennet mentioned this issue Mar 18, 2020

[WIP] Fix for Issue #2330 #2349

Closed

mcbennet mentioned this issue Apr 8, 2020

Fix for Issue #2330 #2369

Merged

prckent added this to the v3.9.2 milestone Apr 9, 2020

prckent added a commit that referenced this issue Apr 11, 2020

Merge pull request #2369 from mcbennet/issue_2330_fix_2

93ee0a5

Fix for Issue #2330

ye-luo mentioned this issue Apr 12, 2020

Fix last bit of #2330 #2377

Merged

prckent added a commit that referenced this issue Apr 13, 2020

Merge pull request #2377 from ye-luo/fix-early-return

6c2f302

Fix last bit of #2330

prckent closed this as completed Apr 17, 2020

Final DMC energy varies with equilibration strategy #2330

Final DMC energy varies with equilibration strategy #2330

Comments

jtkrogel commented Mar 5, 2020

prckent commented Mar 5, 2020 • edited

prckent commented Mar 5, 2020

camelto2 commented Mar 5, 2020

jtkrogel commented Mar 5, 2020

jtkrogel commented Mar 5, 2020

camelto2 commented Mar 5, 2020

rcclay commented Mar 5, 2020

jtkrogel commented Mar 5, 2020 • edited

jtkrogel commented Mar 5, 2020

rcclay commented Mar 5, 2020

camelto2 commented Mar 5, 2020

camelto2 commented Mar 5, 2020

jtkrogel commented Mar 5, 2020

jtkrogel commented Mar 5, 2020

prckent commented Mar 6, 2020 • edited

jtkrogel commented Mar 6, 2020 • edited

Hyeondeok-Shin commented Mar 6, 2020 • edited

mcbennet commented Mar 6, 2020 • edited

Hyeondeok-Shin commented Mar 6, 2020

prckent commented Mar 6, 2020

Hyeondeok-Shin commented Mar 6, 2020

Hyeondeok-Shin commented Mar 7, 2020 • edited

prckent commented Mar 9, 2020 • edited

jtkrogel commented Mar 9, 2020 • edited

ye-luo commented Mar 9, 2020

mcbennet commented Mar 9, 2020

ye-luo commented Mar 9, 2020 • edited

mcbennet commented Mar 9, 2020

prckent commented Mar 9, 2020

mcbennet commented Mar 18, 2020

jtkrogel commented Mar 18, 2020 • edited

mcbennet commented Mar 18, 2020

jtkrogel commented Mar 18, 2020

mcbennet commented Mar 18, 2020

jtkrogel commented Mar 18, 2020

mcbennet commented Mar 18, 2020 • edited

prckent commented Mar 18, 2020

mcbennet commented Mar 18, 2020

prckent commented Mar 24, 2020 • edited

mcbennet commented Mar 25, 2020

mcbennet commented Mar 30, 2020 • edited

jtkrogel commented Mar 30, 2020

mcbennet commented Mar 30, 2020

mcbennet commented Apr 2, 2020 • edited

mcbennet commented Apr 8, 2020 • edited

mcbennet commented Apr 9, 2020 • edited

mcbennet commented Apr 9, 2020

prckent commented Apr 13, 2020

ye-luo commented Apr 13, 2020 • edited

mcbennet commented Apr 13, 2020

mcbennet commented Apr 15, 2020

prckent commented Apr 17, 2020

prckent commented Mar 5, 2020 •

edited

jtkrogel commented Mar 5, 2020 •

edited

prckent commented Mar 6, 2020 •

edited

jtkrogel commented Mar 6, 2020 •

edited

Hyeondeok-Shin commented Mar 6, 2020 •

edited

mcbennet commented Mar 6, 2020 •

edited

Hyeondeok-Shin commented Mar 7, 2020 •

edited

prckent commented Mar 9, 2020 •

edited

jtkrogel commented Mar 9, 2020 •

edited

ye-luo commented Mar 9, 2020 •

edited

jtkrogel commented Mar 18, 2020 •

edited

mcbennet commented Mar 18, 2020 •

edited

prckent commented Mar 24, 2020 •

edited

mcbennet commented Mar 30, 2020 •

edited

mcbennet commented Apr 2, 2020 •

edited

mcbennet commented Apr 8, 2020 •

edited

mcbennet commented Apr 9, 2020 •

edited

ye-luo commented Apr 13, 2020 •

edited