Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final DMC energy varies with equilibration strategy #2330

Closed
jtkrogel opened this issue Mar 5, 2020 · 81 comments
Closed

Final DMC energy varies with equilibration strategy #2330

jtkrogel opened this issue Mar 5, 2020 · 81 comments
Assignees
Labels
Milestone

Comments

@jtkrogel
Copy link
Contributor

@jtkrogel jtkrogel commented Mar 5, 2020

The final DMC energy for a 20 atom cell of SrCoO3 differs when QMCPACK is run in two ways that should be equivalent:

  1. A single DMC run with a timestep of 0.002/Ha.
  2. DMC run first with a timestep of 0.02/Ha followed by 0.002/Ha.

In this case, a difference of 6.6(7) mHa/f.u. is observed in the final DMC energy using the CPU code on Theta (case 1) and CADES (case 1 and 2) with Theta and CADES agreeing for case 1.

A plot of the 2-timestep data with differing numbers of equilibration blocks removed, seems to show a very long and slow transient. For comparison, a full equilibration from VMC to the DMC 0.002/Ha equilibrium occurs within 100 blocks. Starting from the 0.02/Ha equilibrium, the new 0.002/Ha equilibrium is not reached even after 1000 blocks.

GS4_ediff

Given the rapid equlibration from the high energy VMC state, it is unclear what mechanism could be responsible for the lack of equilibration from the lower energy, and presumably closer, DMC 0.02/Ha state. More investigation of this effect is warranted.

The code's inability to relax to the same minimum from different initial states casts doubt on the equilibrium found from standard single shot calculations performed in production.

Issue identified by @mcbennet.

@jtkrogel jtkrogel added the bug label Mar 5, 2020
@prckent
Copy link
Contributor

@prckent prckent commented Mar 5, 2020

Hoping to narrow this bug/feature: if you do a restart, does the problem persist?
T-Moves or not? If so, which flavor?

Also: to help planning debugging, how long does this example take to run?

@prckent
Copy link
Contributor

@prckent prckent commented Mar 5, 2020

Attention @camelto2 . Until resolved, this issue would preclude a smarter equilibration strategy.

@camelto2
Copy link
Contributor

@camelto2 camelto2 commented Mar 5, 2020

Attention @camelto2 . Until resolved, this issue would preclude a smarter equilibration strategy.

Agreed. I've been chatting with @mcbennet about this offline. @lshulen and @rcclay have pointed out a number of issues that could be the culprit similar to what you mention above.

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 5, 2020

This is for T-moves version 0. Trying a restart is a good idea. Right now, we are performing reruns on CADES with slightly smaller errorbars for 20, 10, and 5 atom cells as initial runs gave ambiguous results for the smaller cells. This should tell us if the issue is reproducible and whether it appears for smaller systems as well.

The 20 atom runs take about a day on 8 Skylake nodes.

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 5, 2020

@camelto2, @lshulen, @rcclay What are your suggestions for possible culprits for this issue?

@camelto2
Copy link
Contributor

@camelto2 camelto2 commented Mar 5, 2020

It is possibly worth mentioning that @aannabe has seen similar problems with an independent code (qwalk) using tmoves. The tmoves used there would be equivalent to tmoves v0 in qmcpack.

@rcclay
Copy link
Contributor

@rcclay rcclay commented Mar 5, 2020

DMC propagator needs an E_trial and variance of E_local for things like adjusting the population size and determining if the walkers are stepping into weird places. Depending on how these values are estimated/carried over between simulations could have some measureable consequences, especially for these energy scales.

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 5, 2020

@camelto2, @aannabe Any data for methods other than T-moves v0 (i.e. is there any reason to believe this is local to T-moves)?

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 5, 2020

@rcclay, but these choices should fade out rapidly and should be no worse than starting from VMC (at least in principle), right?

@rcclay
Copy link
Contributor

@rcclay rcclay commented Mar 5, 2020

I need to dig into SimpleFixedNodeBranch and see how precisely and how often we update this stuff.

@camelto2
Copy link
Contributor

@camelto2 camelto2 commented Mar 5, 2020

@camelto2, @aannabe Any data for methods other than T-moves v0 (i.e. is there any reason to believe this is local to T-moves)?

Not that I am aware of. If nothing is obvious from the propagator as @rcclay mentioned above, it would be useful to check the dependence on tmoves/locality

@camelto2
Copy link
Contributor

@camelto2 camelto2 commented Mar 5, 2020

another thing to investigate is the whether or not you can recover the correct behavior using smaller jumps in the timestep. Rather than going from 0.02 to 0.002, trying a few intermediates to see if the problem is as drastic.

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 5, 2020

A complicating factor is that at the entry point to each DMC sub-run, warmup is performed with stochastic reconfiguration. In this case (and in others), we have observed that the trial energy behaves erratically (rapid fluctuations on the scale of the VMC-DMC energy difference), before settling down immediately following the warmup.

If a large error was introduced to the trial energy at this point, and if the moving average of the trial energy kept this initial bias for a long time, it could affect the population dynamics for an extended period of time. I suppose that this would formally appear as a more pronounced (and transient) population control bias.

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 5, 2020

@camelto2 I agree that this is a good thing to investigate. I had planned to move this direction once we have fully settled reproducibility. Restarts are another good thing to try early on.

@prckent
Copy link
Contributor

@prckent prckent commented Mar 6, 2020

In view of the upcoming weekend it would be sensible to queue several independent series of runs that will tell us the most with the least amount of human effort. For example:

  • Repeat these tests with the locality approximation to include/exclude t-moves as a source of problems. This is a few line change to the inputs.
  • Repeat these tests using restarts and not multiple qmc sections. This is a bit more work to setup but will either exclude or implicate passing of state information between QMC sections as a factor. Add an additional dmc run so that you have a restarted dmc following a normal dmc section and not one with equilibration. The two production sections should have the same results and statistical behavior.

Perhaps this has already been done, but I think it is worth fishing for significant problems in existing data. Unlikely to be successful but low effort:

  • Look at histograms of the existing 0.002 a.u. production data graphed above to see if by eye there are any outliers, changes in the distributions. They should be indistinguishable by eye independent of starting point.
  • Check the acceptance statistics of the two 0.002 a.u. production results graphed above. They should be indistinguishable.

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 6, 2020

Chandler is performing a series of runs. He's been able to reproduce the problem in the 10 atom cell, so we can now get conclusive results in 12 hours on 4 nodes.

Runs we have planned:

  • Restart runs w/ 0.002 a.u. starting from walker populations equilibrated to 0.02/Ha and "equilibrated" to 0.002/Ha following 0.02.
  • Runs with larger intermediate timestep: i.e. 0.04 -> 0.002 and 0.1 -> 0.002
  • A run with an identical intermediate timestep 0.002 -> 0.002
  • Running "adiabatically" 0.02->0.0199->0.0198...->0.002
  • Runs for LA and T-moves versions 1 and 3 (version 0 used so far)
  • Run QMCPACK v3.6.0 w/ T-moves v0 and 0.02->0.002 (the original case)

Additional observations:

  • The 0.002 acceptance ratio agrees to high precision across all runs
  • Differences in the energy sub-components reveal a pattern similar to "stuck walker" issues, i.e. a higher kinetic energy, a higher electron-electron energy, and a lower local electron-ion energy (more time spent in the core region). The non-change of acceptance ratio is, however, inconsistent with "stuck walker" behavior.

@Hyeondeok-Shin
Copy link
Contributor

@Hyeondeok-Shin Hyeondeok-Shin commented Mar 6, 2020

@jtkrogel @prckent I'm writing what I found today because it seems to be related with this issue.

This DMC results are for 2D GeSe starting from 0.1, 0.05 and 0.01 to 0.005 a.u. time step, and v3 T-moves is used.

I first thought DMC total energy is converged enough on 0.005 a.u., but when I restart with stored walkers on the same 0.005 a.u. final time step, it gives different equilibrium total energy.

Bug

I think it is the same issue with here.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 6, 2020

Hyeondeok, interesting. The stored walkers that you use for the restart -- they were stored during the 0.005 a.u. timestep in the first run?

@Hyeondeok-Shin
Copy link
Contributor

@Hyeondeok-Shin Hyeondeok-Shin commented Mar 6, 2020

@mcbennet Yes, it is. They were from the first run with the same time step 0.005 a.u. to the restart run.

@prckent
Copy link
Contributor

@prckent prckent commented Mar 6, 2020

@Hyeondeok-Shin Have you checked that VMC restarts correctly and consistently in this system? Clearly it would be very difficult for it not to, but we thought DMC was safe as well...

@Hyeondeok-Shin
Copy link
Contributor

@Hyeondeok-Shin Hyeondeok-Shin commented Mar 6, 2020

@prckent I didn't check for VMC restart run, but I can quickly check it. I'll post here when VMC check has done.

@Hyeondeok-Shin
Copy link
Contributor

@Hyeondeok-Shin Hyeondeok-Shin commented Mar 7, 2020

I quickly did a restart test for VMC and it seems fine.

VMC test has done with total 400 blocks and 200 blocks + 200 restart blocks.

Total 400 blocks.

GeSe-S4-dmc series 1 -105.390069 +/- 0.001767 2.487108 +/- 0.081797 0.0236

First 200 + 200 restart blocks.

file series 2 -105.391064 +/- 0.001996 2.507822 +/- 0.072885 0.0238

They are equivalent, so restart works correctly for VMC (or single QMC section),

@prckent
Copy link
Contributor

@prckent prckent commented Mar 9, 2020

@mcbennet @jtkrogel assuming that the weekend runs have not flagged a clear culprit, I have an additional request: compile the code without MPI and run on the smaller cell with OpenMP threads on one node only. Do not use the MPI code with one task for this test. The run should use exactly the same number of total walkers as the 4 node runs so that it can be directly compared. This will rule out any problems with either our MPI implementation or that on CADES.

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 9, 2020

The weekend runs have not flagged a clear culprit, however we do have some new information:

  1. Running with a larger intermediate timestep magnifies the effect (much larger bias in the following small timestep run), making it easier to study the effect and/or look for it in other systems.
  2. The magnitude of the effect can vary greatly with different initial seeds (one run might show a large effect and an exact rerun with a different seed can show a much smaller effect).
  3. The effect is still present in various T-moves variants, but possibly to varying degrees (possibly conflated with 2 though). We are running now w/ a larger intermediate timestep to clarify.
  4. Restarts "work", i.e. the effect is fully preserved across restarts.

We can run with no-mpi. Chandler is currently trying to compile v3.6.0 for a historical reference. He will also soon see if the effect is visible in pseudopotential diamond, if so, we can more easily compare e.g. QMCPACK and QWalk in the near future (good to isolate code vs algorithm issue).

@ye-luo
Copy link
Contributor

@ye-luo ye-luo commented Mar 9, 2020

@jtkrogel could you provide the DMC input sections and the full printout of QMCPACK corresponding to your run with two time steps?

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 9, 2020

@ye-luo attached here
equil_strategy.tar.gz

@ye-luo
Copy link
Contributor

@ye-luo ye-luo commented Mar 9, 2020

@mcbennet Could you rerun the one_ts and two_ts with warmupSteps=1000 in every DMC section?

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 9, 2020

Yes, I can have a look at that.

@prckent
Copy link
Contributor

@prckent prckent commented Mar 9, 2020

Interesting results so far. Can you confirm that locality approximation shows the problem? I suggest focusing on it since all the different QMC codes have it implemented consistently.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 18, 2020

I see the following stochastic results for diamond with develop+modification mentioned above (DevMod).

Energies

Also, for SCO, the modification appears to have resolved AoS.

One TS Ref: -449.351323 +/- 0.001461

AoS LocalEnergy = -449.350831 +/- 0.001897
SoA LocalEnergy = -449.212763 +/- 0.001374

I am submitting a WIP PR, so that others might test my modification, scrutinize it, etc.

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 18, 2020

Thanks Chandler. So with your fix, diamond now looks clean with AoS/SoA, but there remains an additional (and possibly unrelated) bug in SoA, as evidenced by your SCO results, correct?

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 18, 2020

Yes, diamond looks clean to me. I am not yet sure however why my stochastic tests show failures for 1704 and Devel in AoS, but pass deterministic tests.

Also, SCO looks clean as well for AoS but SoA is still off. I do not know the nature of the SoA error in SCO.

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 18, 2020

Good to fix the manifesting issue, anyway. The remaining SoA issue was introduced prior to 3.8.0, right? I guess a return to the bisection search with SoA should be revealing.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 18, 2020

Follow up regarding deterministic tests.

@Hyeondeok-Shin , I modified your tests by increasing number of VMC walkers to 20 then removing targetwalkers variables in DMC sections, i.e.,

<qmc method="vmc" move="pbyp">
     <estimator name="LocalEnergy" hdf5="no"/>
     <parameter name="walkers">    20   </parameter>
     <parameter name="substeps">  1 </parameter>
     <parameter name="warmupSteps">  1  </parameter>
     <parameter name="steps">  1 </parameter>
     <parameter name="blocks">  1 </parameter>
     <parameter name="timestep">  1.0 </parameter>
     <parameter name="usedrift">   no </parameter>
   </qmc>

   <qmc method="dmc" move="pbyp" checkpoint="-1">
     <estimator name="LocalEnergy" hdf5="no"/>
     <parameter name="reconfiguration">   no </parameter>
     <parameter name="warmupSteps">  3  </parameter>
     <parameter name="timestep">  0.3 </parameter>
     <parameter name="steps">   3   </parameter>
     <parameter name="blocks">  10   </parameter>
     <parameter name="nonlocalmoves">  yes </parameter>
   </qmc>

   <qmc method="dmc" move="pbyp" checkpoint="-1">
     <estimator name="LocalEnergy" hdf5="no"/>
     <parameter name="reconfiguration">   no </parameter>
     <parameter name="warmupSteps">  3  </parameter>
     <parameter name="timestep">  0.002 </parameter>
     <parameter name="steps">   10   </parameter>
     <parameter name="blocks">  10   </parameter>
     <parameter name="nonlocalmoves">  yes </parameter>
   </qmc>

After doing so, I see that this deterministic tests agrees with my stochastic tests

Build Test Result
380 AoS real REF
380 AoS comp REF
380 SoA real REF
380 SoA comp REF
1704 AoS real FAIL
1704 AoS comp FAIL
1704 SoA real FAIL
1704 SoA comp FAIL
Dev AoS real FAIL
Dev AoS comp FAIL
Dev SoA real FAIL
Dev SoA comp FAIL
DevMod AoS real PASS
DevMod AoS comp PASS
DevMod SoA real FAIL
DevMod SoA comp FAIL

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 18, 2020

This is an important update to the deterministic test. It's also consistent with the idea that only some trajectories (a subset of the phase space) are affected.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 18, 2020

More observations worth noting regarding the above deterministic test.

  1. For AoS, all versions agree for series 1 -- the intermediate timestep.
  2. For SoA, 1704 and 3.8.0 agree for series 1 while Dev and DevMod disagree with 3.8.0 for series 1.
  3. For SoA, Dev and DevMod agree with each other for series 1.

@prckent
Copy link
Contributor

@prckent prckent commented Mar 18, 2020

Note that the VMC driver will round up the number of walkers to a multiple of the thread count. There is no way to change this because the existing VMC drivers require an equal number of walkers per thread, which makes sense from efficiency considerations although a route for more flexibility would be ideal. Therefore when specifying walkers you need to be careful about the number of threads that you use.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 18, 2020

For all of my tests of diamond 1x1x1, I have used 1 thread and 1 mpi task.

@prckent
Copy link
Contributor

@prckent prckent commented Mar 24, 2020

Does the discussion above still reflect the current status? i.e. If anyone has an update, please post.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 25, 2020

Some discussion has been continued within #2349. That WIP PR corresponds to the modifications that fixes AoS (mentioned above in present thread) -- though SoA remains broken.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 30, 2020

For comparison purposes, I am sharing an analogous plot here to the diamond 1x1x1 figure above but for SCO. This again suggests an underlying issue with SoA while AoS appears clean for version 3.8.0, then a bias in #1704 and beyond. The suggested modification in #2349 again appears to remove the bias in AoS.

Energies_SCO

UPDATE (4/8/2020): There was a mistake in the figure above. The SoA builds were all from the development branch. See corrected plot further down this thread.

@jtkrogel
Copy link
Contributor Author

@jtkrogel jtkrogel commented Mar 30, 2020

OK, so #2349 is a fix for the bug introduced in #1704 to AoS. SoA has a different bug that predates version 3.8.0.

Interesting that both bugs, introduced separately, affect the total energy in a similar way.

Would you mind running SoA with 3.6.0 and 3.7.0?

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Mar 30, 2020

Submitted earlier today. Will prepend to the plot when complete.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Apr 2, 2020

Retracted comment.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Apr 8, 2020

The following figure contains corrected SrCoO3 data. In my previous figure, all SoA builds corresponded to the development branch by mistake.

Energies_SCO

From the statistical tests here, we see that a bug was introduced in #1704 and the bug remains in the current develop branch. #2369 fixes this bug. This test is consistent with the statistical test of diamond.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Apr 9, 2020

Regarding the deterministic test, here I focus on the absolute errors in the first DMC series. I have taken Hyeondeok's test and have varied two parameters independently -- (1) the number of target walkers in the DMC and (2) the timestep in the initial DMC. The plots below show the series 1 errors in total energy of 1704 and Develop (pulled in 3/13) w.r.t 3.8.0.

For all variations, it appears that 1704 is clean on this front (if 3.8.0 is assumed "correct"). For Develop, clearly there is a dependence on both timestep and number of target walkers chosen -- also, from the case where targetwalkers=2 it is clear that the absolute error is not necessarily monotonic with timestep.

Note: the targetwalkers=2 case is exactly Hyeondeok's test but with varied timestep.

Note: All runs below correspond to SoA builds

2tw
4tw
8tw

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Apr 9, 2020

PR #2177 appears to be the first to show the discrepancy from 3.8.0 for the first DMC series.
PR #2173 went in just before and it looks clean.

8tw_New

@prckent prckent added this to the v3.9.2 milestone Apr 9, 2020
prckent added a commit that referenced this issue Apr 11, 2020
prckent added a commit that referenced this issue Apr 13, 2020
@prckent
Copy link
Contributor

@prckent prckent commented Apr 13, 2020

After the additional fix of #2377, develop should now be cured of this important problem.

We can check the nightlies tomorrow afternoon for the results of the deterministic test, but I think we also need confirmation here that the problem is fixed in SrCoO3 and no other problems have crept in.

@ye-luo
Copy link
Contributor

@ye-luo ye-luo commented Apr 13, 2020

@mcbennet could you verify that the develop no more diverge behaviour from v3.8.0? Thank you.

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Apr 13, 2020

While SrCoO3 runs, I ran my most recent deterministic tests with diamond. So far, both DMC series are consistent with 3.8.0

8tw_S1
8tw_S2

@mcbennet
Copy link
Contributor

@mcbennet mcbennet commented Apr 15, 2020

The latest develop looks clean for this stochastic test.

Energies_SCO

@prckent
Copy link
Contributor

@prckent prckent commented Apr 17, 2020

From the above, I think we can happily mark this as closed. Thanks all.

@prckent prckent closed this as completed Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants