-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block use of DMC reconfiguration since it is incorrect #2254
Conversation
Can one of the admins verify this patch? |
Some regular uses that ignore the incorrectness:
|
I'll note that the relevant areas of the code have unfortunately proven very difficult for multiple people to work with now, reconfiguration or not. "Legacy" status confirmed: complex, opaque, and also wrong. We should develop a proper cleanup plan. |
We've recently had postdocs at ORNL use this option for production runs with the anticipation that all was well (and understandably because the manual presents reconfiguration as a valid option for use). These runs are now being redone, but time could have been saved had the option been appropriately blocked. I agree that a way should be left open for legitimate performance testing. Perhaps we do so only via a developer-visible backdoor key? Something like |
Unfortunate that this has wasted time. Good suggested temporary workaround. I was thinking that we should merge this and deal with the fallout in any case. We can avoid performance test fallout via a temporary "reconfiguration=runwhileincorrect" option. Do you have bandwidth to implement this + update the performance test input xml (only)? |
Yes. |
Still checking the code. Wait a bit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix compilation.
Need a little more time for checks of the behavior of the code. |
Okay to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments are all addressed. So approve. Please wait for Jaron's final check before merging.
Checks complete. This is ready to go in now. |
Do the deterministic tests pass for you? The CI is complaining in the app and driver tests. Possibly the test code needs an update to account for this change. |
The unit tests for qmcapp do not all pass. Looking into why. |
It's the batched DMC driver:
All the other driver unit tests pass. @ye-luo @PDoakORNL Any ideas what the issue might be? |
@@ -43,6 +43,9 @@ void DMCDriverInput::readXML(xmlNodePtr node) | |||
throw std::runtime_error("Illegal input for MaxAge in DMC input section"); | |||
if(branch_interval_ < 0) | |||
throw std::runtime_error("Illegal input for branchInterval or substeps in DMC input section"); | |||
|
|||
if(reconfig_str != "no" && reconfig_str != "runwhileincorrect") | |||
APP_ABORT("Reconfiguration is currently broken and gives incorrect results. Set reconfiguration=\"no\" or remove the reconfiguration option from the DMC input section. To run performance tests, please set reconfiguration to \"runwhileincorrect\" instead of \"yes\" to restore consistent behaviour.") | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving aside that touching the reconfig_str after its been read breaks the new input model, i.e. options are parsed once and if they are typed data they are that type not a string converted in situ all over the code.
A valid value for reconfig_str is empty.
The file that needs updating: This Is where I have captured XML nodes for the unit tests, I prefer that unit tests only take minimal input and if it all possible do not have to read external files. See my comment for why the change in DMCDriverInput.cpp causes failure. My Opinion in advance: As for the legacy driver, It seems obvious to me that it should be removed from the performance tests since the impact of a broken algorithm on runtime means comparing to past performance tests is fairly pointless. Of course if there isn't any impact then there is no reason not to update it to a valid input rather than leaving an example of invalid input lying around. I also Developer time should not be wasted fixing this feature of in the deprecated DMC driver. |
Re: performance tests. Background: They only look at the workload - propagating a constant population for a few steps. The algorithm would have to be significantly more buggy than now to influence the results significantly. (e.g. put all the electrons very close to ions). A bigger source of variance is using a random starting positions for the electrons and not an equilibrated ensemble. To do better here would require more and large reference files. |
We agreed to improve the input section but it never gets prioritized. Even if we can improve even just a bit to help our users and ourselves. We should do it or at least learn something. Even if we have time to make a drastic change to fix all the problems we have thought about, new issues will still occur after. In principle, we should implement a correct one with a new option and remove this option completely. If we just remove it now, we ends up running performance tests with population control and it will be impossible to benchmark on a single node. So far this broken implementation only affect final energy but not performance characteristics. In addition, I think this feature should be implemented in a driver agnostic way. |
@jtkrogel I fixed the unit test. |
Thanks @ye-luo. I have no further additions to make. Clearly fixing the input system is a longer discussion. We should go back to the main discussion regarding the overhaul of the input processing system in #407 (#2007 and #2024 are related). I've added a new "input" label to all of these and others to make the related issues discoverable. |
DMC with stochastic reconfiguration is currently broken, and has been for a long time (see #18). Production runs should not be performed with this option until the bug is fixed.
This PR places sentinel code to prevent production use of this broken option. The aborts should be removed only after the bug is fixed.
As an aside, this option being read in 4 different places highlights poor design in this region of the code.