Block use of DMC reconfiguration since it is incorrect #2254

jtkrogel · 2020-01-28T18:43:23Z

DMC with stochastic reconfiguration is currently broken, and has been for a long time (see #18). Production runs should not be performed with this option until the bug is fixed.

This PR places sentinel code to prevent production use of this broken option. The aborts should be removed only after the bug is fixed.

As an aside, this option being read in 4 different places highlights poor design in this region of the code.

qmc-robot · 2020-01-28T18:44:01Z

Can one of the admins verify this patch?

prckent · 2020-01-28T18:52:26Z

Some regular uses that ignore the incorrectness:

grep -r -n -i "Reconfiguration.*yes" tests/*
tests/performance/C-molecule/sample/dmc-C12-e48-pp/C12-dmc.xml:99:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C18-e72-pp/C18-dmc.xml:105:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C24-e144-ae/C24-dmc.xml:109:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C30-e180-ae/C30-dmc.xml:115:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C12-e72-ae/C12-dmc.xml:97:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C24-e96-pp/C24-dmc.xml:111:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C30-e120-pp/C30-dmc.xml:117:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C60-e240-pp/C60-dmc.xml:147:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C18-e108-ae/C18-dmc.xml:103:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-graphite/sample/dmc-a64-e256-gpu/C-graphite-S256-dmc.xml:69:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-graphite/sample/dmc-a64-e256-cpu/C-graphite-S256-dmc.xml:68:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a512-e6144-cpu/NiO-fcc-S128-dmc.xml:688:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a64-e768-cpu-J3/NiO-fcc-S16-dmc.xml:197:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a512-e6144-gpu/NiO-fcc-S128-dmc.xml:687:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a4-e48-cpu-J3/NiO-fcc-S1-dmc.xml:130:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a8-e96-cpu/NiO-fcc-S2-dmc.xml:121:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a96-e1152-gpu/NiO-fcc-S24-dmc.xml:219:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a192-e2304-cpu/NiO-fcc-S48-dmc.xml:328:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a192-e2304-cpu-J3/NiO-fcc-S48-dmc.xml:341:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a16-e192-cpu-J3/NiO-fcc-S4-dmc.xml:143:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a192-e2304-gpu/NiO-fcc-S48-dmc.xml:327:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a256-e3072-cpu-J3/NiO-fcc-S64-dmc.xml:413:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a96-e1152-cpu-J3/NiO-fcc-S24-dmc.xml:233:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a96-e1152-cpu/NiO-fcc-S24-dmc.xml:220:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a8-e96-gpu/NiO-fcc-S2-dmc.xml:120:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a64-e768-cpu/NiO-fcc-S16-dmc.xml:184:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a4-e48-cpu/NiO-fcc-S1-dmc.xml:117:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a512-e6144-cpu-J3/NiO-fcc-S128-dmc.xml:701:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a128-e1536-cpu-J3/NiO-fcc-S32-dmc.xml:269:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a4-e48-gpu/NiO-fcc-S1-dmc.xml:116:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a64-e768-gpu/NiO-fcc-S16-dmc.xml:183:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a8-e96-cpu-J3/NiO-fcc-S2-dmc.xml:134:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a128-e1536-gpu/NiO-fcc-S32-dmc.xml:255:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a1024-e12288-cpu-J3/NiO-fcc-S256-dmc.xml:1277:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a128-e1536-cpu/NiO-fcc-S32-dmc.xml:256:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a1024-e12288-gpu/NiO-fcc-S256-dmc.xml:1263:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a256-e3072-gpu/NiO-fcc-S64-dmc.xml:399:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a16-e192-cpu/NiO-fcc-S4-dmc.xml:130:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a16-e192-gpu/NiO-fcc-S4-dmc.xml:129:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a1024-e12288-cpu/NiO-fcc-S256-dmc.xml:1264:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a256-e3072-cpu/NiO-fcc-S64-dmc.xml:400:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a32-e384-gpu/NiO-fcc-S8-dmc.xml:147:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a32-e384-cpu-J3/NiO-fcc-S8-dmc.xml:161:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a32-e384-cpu/NiO-fcc-S8-dmc.xml:148:    <parameter name="reconfiguration">      yes </parameter>
tests/solids/diamondC_2x1x1_pp/qmc_long_vmc_dmc_reconf.in.xml:99:     <parameter name="reconfiguration">   yes </parameter>
tests/solids/diamondC_2x1x1_pp/qmc_short_vmc_dmc_reconf.in.xml:99:     <parameter name="reconfiguration">   yes </parameter>

prckent · 2020-01-28T18:57:04Z

I'll note that the relevant areas of the code have unfortunately proven very difficult for multiple people to work with now, reconfiguration or not. "Legacy" status confirmed: complex, opaque, and also wrong. We should develop a proper cleanup plan.

jtkrogel · 2020-01-28T19:18:22Z

We've recently had postdocs at ORNL use this option for production runs with the anticipation that all was well (and understandably because the manual presents reconfiguration as a valid option for use). These runs are now being redone, but time could have been saved had the option been appropriately blocked.

I agree that a way should be left open for legitimate performance testing. Perhaps we do so only via a developer-visible backdoor key? Something like reconfiguration="I really mean yes".

prckent · 2020-01-28T19:29:03Z

Unfortunate that this has wasted time.

Good suggested temporary workaround. I was thinking that we should merge this and deal with the fallout in any case. We can avoid performance test fallout via a temporary "reconfiguration=runwhileincorrect" option. Do you have bandwidth to implement this + update the performance test input xml (only)?

jtkrogel · 2020-01-28T19:54:16Z

Yes.

jtkrogel · 2020-01-28T20:33:56Z

Still checking the code. Wait a bit.

ye-luo

Need to fix compilation.

src/QMCDrivers/DMC/DMC.cpp

src/QMCDrivers/DMC/WalkerControlFactory.cpp

src/QMCDrivers/SimpleFixedNodeBranch.cpp

jtkrogel · 2020-01-29T14:44:59Z

Need a little more time for checks of the behavior of the code.

ye-luo · 2020-01-29T14:51:33Z

Okay to test

ye-luo

My comments are all addressed. So approve. Please wait for Jaron's final check before merging.

jtkrogel · 2020-01-29T20:53:24Z

Checks complete. This is ready to go in now.

prckent · 2020-01-29T21:06:10Z

Do the deterministic tests pass for you? The CI is complaining in the app and driver tests. Possibly the test code needs an update to account for this change.

jtkrogel · 2020-01-29T21:38:25Z

The unit tests for qmcapp do not all pass. Looking into why.

jtkrogel · 2020-01-29T22:10:45Z

It's the batched DMC driver:

tests/bin/test_qmcapp -s "QMCDriverFactory create DMCBatched driver"

...
  QMCHamiltonian::add2WalkerProperty added
    2 to P::PropertyList 
    0 to P::Collectables 
    starting Index of the observables in P::PropertyList = 9
Fatal Error. Aborting at Reconfiguration is currently broken and gives incorrect results. Set reconfiguration="no" or remove the reconfiguration option from the DMC input section. To run performance tests, please set reconfiguration to "runwhileincorrect" instead of "yes" to restore consistent behaviour.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

All the other driver unit tests pass.

@ye-luo @PDoakORNL Any ideas what the issue might be?

PDoakORNL · 2020-01-30T00:11:18Z

src/QMCDrivers/DMC/DMCDriverInput.cpp

@@ -43,6 +43,9 @@ void DMCDriverInput::readXML(xmlNodePtr node)
    throw std::runtime_error("Illegal input for MaxAge in DMC input section");
  if(branch_interval_ < 0)
    throw std::runtime_error("Illegal input for branchInterval or substeps in DMC input section");
+
+  if(reconfig_str != "no" && reconfig_str != "runwhileincorrect")
+    APP_ABORT("Reconfiguration is currently broken and gives incorrect results. Set reconfiguration=\"no\" or remove the reconfiguration option from the DMC input section. To run performance tests, please set reconfiguration to \"runwhileincorrect\" instead of \"yes\" to restore consistent behaviour.")
 }


leaving aside that touching the reconfig_str after its been read breaks the new input model, i.e. options are parsed once and if they are typed data they are that type not a string converted in situ all over the code.

A valid value for reconfig_str is empty.

PDoakORNL · 2020-01-30T00:20:38Z

The file that needs updating:
https://github.com/QMCPACK/qmcpack/blob/develop/src/QMCDrivers/tests/ValidQMCInputSections.h

This Is where I have captured XML nodes for the unit tests, I prefer that unit tests only take minimal input and if it all possible do not have to read external files.

See my comment for why the change in DMCDriverInput.cpp causes failure.
I'm not going to block this PR further. @jtkrogel has gotten us patched up, but I think this workaround needs more some discussion at ECP.

My Opinion in advance:
I would like to just come back and remove the option and shards of refactored garbage from the DMCDriverInput and DMCBatched driver soon. I'd rather not spend time to debug the pieces of it I ported into the batched it will simplify the code to strip it out. The we can add it back with a clean design if it really is useful.

As for the legacy driver, It seems obvious to me that it should be removed from the performance tests since the impact of a broken algorithm on runtime means comparing to past performance tests is fairly pointless. Of course if there isn't any impact then there is no reason not to update it to a valid input rather than leaving an example of invalid input lying around. I also Developer time should not be wasted fixing this feature of in the deprecated DMC driver.

prckent · 2020-01-30T02:03:06Z

Re: performance tests. Background: They only look at the workload - propagating a constant population for a few steps. The algorithm would have to be significantly more buggy than now to influence the results significantly. (e.g. put all the electrons very close to ions). A bigger source of variance is using a random starting positions for the electrons and not an equilibrated ensemble. To do better here would require more and large reference files.

…guration

ye-luo · 2020-01-30T04:38:01Z

We agreed to improve the input section but it never gets prioritized. Even if we can improve even just a bit to help our users and ourselves. We should do it or at least learn something. Even if we have time to make a drastic change to fix all the problems we have thought about, new issues will still occur after.

In principle, we should implement a correct one with a new option and remove this option completely. If we just remove it now, we ends up running performance tests with population control and it will be impossible to benchmark on a single node. So far this broken implementation only affect final energy but not performance characteristics.

In addition, I think this feature should be implemented in a driver agnostic way.

ye-luo · 2020-01-30T04:38:58Z

@jtkrogel I fixed the unit test.

jtkrogel · 2020-01-30T12:58:57Z

Thanks @ye-luo. I have no further additions to make.

Clearly fixing the input system is a longer discussion. We should go back to the main discussion regarding the overhaul of the input processing system in #407 (#2007 and #2024 are related).

I've added a new "input" label to all of these and others to make the related issues discoverable.

block the user from producing buggy reconfiguration results

75f9f38

prckent changed the title ~~Block user reconfiguration~~ Block use of DMC reconfiguration since it is incorrect Jan 28, 2020

allow reconfig via runwhileincorrect input

3930e11

ye-luo reviewed Jan 29, 2020

View reviewed changes

src/QMCDrivers/DMC/DMC.cpp Outdated Show resolved Hide resolved

src/QMCDrivers/DMC/WalkerControlFactory.cpp Outdated Show resolved Hide resolved

src/QMCDrivers/DMC/WalkerControlFactory.cpp Show resolved Hide resolved

src/QMCDrivers/SimpleFixedNodeBranch.cpp Show resolved Hide resolved

jtkrogel added 4 commits January 29, 2020 09:12

fix typo

a0b85c6

fix double paren

7fb355a

update abort message

53833d5

update abort message

fca5331

ye-luo approved these changes Jan 29, 2020

View reviewed changes

PDoakORNL reviewed Jan 30, 2020

View reviewed changes

ye-luo added 2 commits January 29, 2020 22:18

Fix unit test

198f6bf

Merge remote-tracking branch 'github/develop' into block_user_reconfi…

82d1cf0

…guration

ye-luo merged commit fbea026 into QMCPACK:develop Jan 30, 2020

jtkrogel deleted the block_user_reconfiguration branch March 22, 2021 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block use of DMC reconfiguration since it is incorrect #2254

Block use of DMC reconfiguration since it is incorrect #2254

jtkrogel commented Jan 28, 2020

qmc-robot commented Jan 28, 2020

prckent commented Jan 28, 2020

prckent commented Jan 28, 2020

jtkrogel commented Jan 28, 2020

prckent commented Jan 28, 2020

jtkrogel commented Jan 28, 2020

jtkrogel commented Jan 28, 2020

ye-luo left a comment

jtkrogel commented Jan 29, 2020

ye-luo commented Jan 29, 2020

ye-luo left a comment

jtkrogel commented Jan 29, 2020

prckent commented Jan 29, 2020

jtkrogel commented Jan 29, 2020

jtkrogel commented Jan 29, 2020

PDoakORNL Jan 30, 2020

PDoakORNL commented Jan 30, 2020 •

edited

Loading

prckent commented Jan 30, 2020

ye-luo commented Jan 30, 2020

ye-luo commented Jan 30, 2020

jtkrogel commented Jan 30, 2020

Block use of DMC reconfiguration since it is incorrect #2254

Block use of DMC reconfiguration since it is incorrect #2254

Conversation

jtkrogel commented Jan 28, 2020

qmc-robot commented Jan 28, 2020

prckent commented Jan 28, 2020

prckent commented Jan 28, 2020

jtkrogel commented Jan 28, 2020

prckent commented Jan 28, 2020

jtkrogel commented Jan 28, 2020

jtkrogel commented Jan 28, 2020

ye-luo left a comment

Choose a reason for hiding this comment

jtkrogel commented Jan 29, 2020

ye-luo commented Jan 29, 2020

ye-luo left a comment

Choose a reason for hiding this comment

jtkrogel commented Jan 29, 2020

prckent commented Jan 29, 2020

jtkrogel commented Jan 29, 2020

jtkrogel commented Jan 29, 2020

PDoakORNL Jan 30, 2020

Choose a reason for hiding this comment

PDoakORNL commented Jan 30, 2020 • edited Loading

prckent commented Jan 30, 2020

ye-luo commented Jan 30, 2020

ye-luo commented Jan 30, 2020

jtkrogel commented Jan 30, 2020

PDoakORNL commented Jan 30, 2020 •

edited

Loading