Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block use of DMC reconfiguration since it is incorrect #2254

Merged
merged 8 commits into from
Jan 30, 2020

Conversation

jtkrogel
Copy link
Contributor

DMC with stochastic reconfiguration is currently broken, and has been for a long time (see #18). Production runs should not be performed with this option until the bug is fixed.

This PR places sentinel code to prevent production use of this broken option. The aborts should be removed only after the bug is fixed.

As an aside, this option being read in 4 different places highlights poor design in this region of the code.

@qmc-robot
Copy link

Can one of the admins verify this patch?

@prckent prckent changed the title Block user reconfiguration Block use of DMC reconfiguration since it is incorrect Jan 28, 2020
@prckent
Copy link
Contributor

prckent commented Jan 28, 2020

Some regular uses that ignore the incorrectness:

grep -r -n -i "Reconfiguration.*yes" tests/*
tests/performance/C-molecule/sample/dmc-C12-e48-pp/C12-dmc.xml:99:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C18-e72-pp/C18-dmc.xml:105:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C24-e144-ae/C24-dmc.xml:109:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C30-e180-ae/C30-dmc.xml:115:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C12-e72-ae/C12-dmc.xml:97:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C24-e96-pp/C24-dmc.xml:111:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C30-e120-pp/C30-dmc.xml:117:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C60-e240-pp/C60-dmc.xml:147:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-molecule/sample/dmc-C18-e108-ae/C18-dmc.xml:103:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-graphite/sample/dmc-a64-e256-gpu/C-graphite-S256-dmc.xml:69:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/C-graphite/sample/dmc-a64-e256-cpu/C-graphite-S256-dmc.xml:68:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a512-e6144-cpu/NiO-fcc-S128-dmc.xml:688:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a64-e768-cpu-J3/NiO-fcc-S16-dmc.xml:197:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a512-e6144-gpu/NiO-fcc-S128-dmc.xml:687:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a4-e48-cpu-J3/NiO-fcc-S1-dmc.xml:130:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a8-e96-cpu/NiO-fcc-S2-dmc.xml:121:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a96-e1152-gpu/NiO-fcc-S24-dmc.xml:219:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a192-e2304-cpu/NiO-fcc-S48-dmc.xml:328:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a192-e2304-cpu-J3/NiO-fcc-S48-dmc.xml:341:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a16-e192-cpu-J3/NiO-fcc-S4-dmc.xml:143:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a192-e2304-gpu/NiO-fcc-S48-dmc.xml:327:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a256-e3072-cpu-J3/NiO-fcc-S64-dmc.xml:413:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a96-e1152-cpu-J3/NiO-fcc-S24-dmc.xml:233:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a96-e1152-cpu/NiO-fcc-S24-dmc.xml:220:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a8-e96-gpu/NiO-fcc-S2-dmc.xml:120:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a64-e768-cpu/NiO-fcc-S16-dmc.xml:184:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a4-e48-cpu/NiO-fcc-S1-dmc.xml:117:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a512-e6144-cpu-J3/NiO-fcc-S128-dmc.xml:701:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a128-e1536-cpu-J3/NiO-fcc-S32-dmc.xml:269:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a4-e48-gpu/NiO-fcc-S1-dmc.xml:116:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a64-e768-gpu/NiO-fcc-S16-dmc.xml:183:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a8-e96-cpu-J3/NiO-fcc-S2-dmc.xml:134:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a128-e1536-gpu/NiO-fcc-S32-dmc.xml:255:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a1024-e12288-cpu-J3/NiO-fcc-S256-dmc.xml:1277:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a128-e1536-cpu/NiO-fcc-S32-dmc.xml:256:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a1024-e12288-gpu/NiO-fcc-S256-dmc.xml:1263:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a256-e3072-gpu/NiO-fcc-S64-dmc.xml:399:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a16-e192-cpu/NiO-fcc-S4-dmc.xml:130:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a16-e192-gpu/NiO-fcc-S4-dmc.xml:129:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a1024-e12288-cpu/NiO-fcc-S256-dmc.xml:1264:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a256-e3072-cpu/NiO-fcc-S64-dmc.xml:400:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a32-e384-gpu/NiO-fcc-S8-dmc.xml:147:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a32-e384-cpu-J3/NiO-fcc-S8-dmc.xml:161:    <parameter name="reconfiguration">      yes </parameter>
tests/performance/NiO/sample/dmc-a32-e384-cpu/NiO-fcc-S8-dmc.xml:148:    <parameter name="reconfiguration">      yes </parameter>
tests/solids/diamondC_2x1x1_pp/qmc_long_vmc_dmc_reconf.in.xml:99:     <parameter name="reconfiguration">   yes </parameter>
tests/solids/diamondC_2x1x1_pp/qmc_short_vmc_dmc_reconf.in.xml:99:     <parameter name="reconfiguration">   yes </parameter>

@prckent
Copy link
Contributor

prckent commented Jan 28, 2020

I'll note that the relevant areas of the code have unfortunately proven very difficult for multiple people to work with now, reconfiguration or not. "Legacy" status confirmed: complex, opaque, and also wrong. We should develop a proper cleanup plan.

@jtkrogel
Copy link
Contributor Author

We've recently had postdocs at ORNL use this option for production runs with the anticipation that all was well (and understandably because the manual presents reconfiguration as a valid option for use). These runs are now being redone, but time could have been saved had the option been appropriately blocked.

I agree that a way should be left open for legitimate performance testing. Perhaps we do so only via a developer-visible backdoor key? Something like reconfiguration="I really mean yes".

@prckent
Copy link
Contributor

prckent commented Jan 28, 2020

Unfortunate that this has wasted time.

Good suggested temporary workaround. I was thinking that we should merge this and deal with the fallout in any case. We can avoid performance test fallout via a temporary "reconfiguration=runwhileincorrect" option. Do you have bandwidth to implement this + update the performance test input xml (only)?

@jtkrogel
Copy link
Contributor Author

Yes.

@jtkrogel
Copy link
Contributor Author

Still checking the code. Wait a bit.

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix compilation.

src/QMCDrivers/DMC/DMC.cpp Outdated Show resolved Hide resolved
src/QMCDrivers/DMC/WalkerControlFactory.cpp Outdated Show resolved Hide resolved
src/QMCDrivers/DMC/WalkerControlFactory.cpp Show resolved Hide resolved
src/QMCDrivers/SimpleFixedNodeBranch.cpp Show resolved Hide resolved
@jtkrogel
Copy link
Contributor Author

Need a little more time for checks of the behavior of the code.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 29, 2020

Okay to test

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are all addressed. So approve. Please wait for Jaron's final check before merging.

@jtkrogel
Copy link
Contributor Author

Checks complete. This is ready to go in now.

@prckent
Copy link
Contributor

prckent commented Jan 29, 2020

Do the deterministic tests pass for you? The CI is complaining in the app and driver tests. Possibly the test code needs an update to account for this change.

@jtkrogel
Copy link
Contributor Author

The unit tests for qmcapp do not all pass. Looking into why.

@jtkrogel
Copy link
Contributor Author

It's the batched DMC driver:

tests/bin/test_qmcapp -s "QMCDriverFactory create DMCBatched driver"

...
  QMCHamiltonian::add2WalkerProperty added
    2 to P::PropertyList 
    0 to P::Collectables 
    starting Index of the observables in P::PropertyList = 9
Fatal Error. Aborting at Reconfiguration is currently broken and gives incorrect results. Set reconfiguration="no" or remove the reconfiguration option from the DMC input section. To run performance tests, please set reconfiguration to "runwhileincorrect" instead of "yes" to restore consistent behaviour.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

All the other driver unit tests pass.

@ye-luo @PDoakORNL Any ideas what the issue might be?

@@ -43,6 +43,9 @@ void DMCDriverInput::readXML(xmlNodePtr node)
throw std::runtime_error("Illegal input for MaxAge in DMC input section");
if(branch_interval_ < 0)
throw std::runtime_error("Illegal input for branchInterval or substeps in DMC input section");

if(reconfig_str != "no" && reconfig_str != "runwhileincorrect")
APP_ABORT("Reconfiguration is currently broken and gives incorrect results. Set reconfiguration=\"no\" or remove the reconfiguration option from the DMC input section. To run performance tests, please set reconfiguration to \"runwhileincorrect\" instead of \"yes\" to restore consistent behaviour.")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving aside that touching the reconfig_str after its been read breaks the new input model, i.e. options are parsed once and if they are typed data they are that type not a string converted in situ all over the code.

A valid value for reconfig_str is empty.

@PDoakORNL
Copy link
Contributor

PDoakORNL commented Jan 30, 2020

The file that needs updating:
https://github.com/QMCPACK/qmcpack/blob/develop/src/QMCDrivers/tests/ValidQMCInputSections.h

This Is where I have captured XML nodes for the unit tests, I prefer that unit tests only take minimal input and if it all possible do not have to read external files.

See my comment for why the change in DMCDriverInput.cpp causes failure.
I'm not going to block this PR further. @jtkrogel has gotten us patched up, but I think this workaround needs more some discussion at ECP.

My Opinion in advance:
I would like to just come back and remove the option and shards of refactored garbage from the DMCDriverInput and DMCBatched driver soon. I'd rather not spend time to debug the pieces of it I ported into the batched it will simplify the code to strip it out. The we can add it back with a clean design if it really is useful.

As for the legacy driver, It seems obvious to me that it should be removed from the performance tests since the impact of a broken algorithm on runtime means comparing to past performance tests is fairly pointless. Of course if there isn't any impact then there is no reason not to update it to a valid input rather than leaving an example of invalid input lying around. I also Developer time should not be wasted fixing this feature of in the deprecated DMC driver.

@prckent
Copy link
Contributor

prckent commented Jan 30, 2020

Re: performance tests. Background: They only look at the workload - propagating a constant population for a few steps. The algorithm would have to be significantly more buggy than now to influence the results significantly. (e.g. put all the electrons very close to ions). A bigger source of variance is using a random starting positions for the electrons and not an equilibrated ensemble. To do better here would require more and large reference files.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 30, 2020

We agreed to improve the input section but it never gets prioritized. Even if we can improve even just a bit to help our users and ourselves. We should do it or at least learn something. Even if we have time to make a drastic change to fix all the problems we have thought about, new issues will still occur after.

In principle, we should implement a correct one with a new option and remove this option completely. If we just remove it now, we ends up running performance tests with population control and it will be impossible to benchmark on a single node. So far this broken implementation only affect final energy but not performance characteristics.

In addition, I think this feature should be implemented in a driver agnostic way.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 30, 2020

@jtkrogel I fixed the unit test.

@ye-luo ye-luo merged commit fbea026 into QMCPACK:develop Jan 30, 2020
@jtkrogel
Copy link
Contributor Author

Thanks @ye-luo. I have no further additions to make.

Clearly fixing the input system is a longer discussion. We should go back to the main discussion regarding the overhaul of the input processing system in #407 (#2007 and #2024 are related).

I've added a new "input" label to all of these and others to make the related issues discoverable.

@jtkrogel jtkrogel deleted the block_user_reconfiguration branch March 22, 2021 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants