Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify checkpointing, change defaults, fix legacy, implement in batched #4646

Merged
merged 5 commits into from
Jun 26, 2023

Conversation

prckent
Copy link
Contributor

@prckent prckent commented Jun 26, 2023

Proposed changes

This PR aims to simplify and improve the checkpointing process.

As mentioned in #4633 the documented every N blocks checkpointing feature was not working as expected. In legacy, this required storeconfigs to also be specified to work (undocumented/unexpected). Checkpointing was also not implemented in the batched drivers.

Here:

  • checkpoint={-1,0,N} driver parameter option sufficient to control checkpointing
  • Default changed to checkpoint=0. Configs will be written at the end of each QMC section by default. -1 disables. N sets write period in addition to end of section write.
  • Removed storeconfigs option completely.
  • Implemented checkpointing in batched code.
  • Updated docs appropriately

Have verified by eye that the checkpoints look sensible and are updated periodically. Open to testing suggestions.

Batched code does not create a info.xml as part of checkpoint.

Also remove unused FastGrad option.

What type(s) of changes does this code introduce?

  • Bugfix
  • New feature

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

sulfur gcc build only. PR for CI.

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

  • Yes/ This PR is up to date with current the current state of 'develop'
  • Yes/ Code added or changed in the PR has been clang-formatted
  • (Existing restart tests use code paths). This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • Yes. Documentation has been added (if appropriate)

@prckent prckent changed the title Implement checkpointing in batched code and fix legacy Checkpointing in batched code, fix legacy, remove storeconfigs Jun 26, 2023
@ye-luo
Copy link
Contributor

ye-luo commented Jun 26, 2023

Based on the documentation the default value of checkpoint is -1, should it be better to have 0 as default, namely end of driver section?

@prckent
Copy link
Contributor Author

prckent commented Jun 26, 2023

I agree. Pushed change and will update PR text

@prckent prckent changed the title Checkpointing in batched code, fix legacy, remove storeconfigs Simplify checkpointing, change defaults, fix legacy, implement in batched Jun 26, 2023
@prckent prckent marked this pull request as ready for review June 26, 2023 19:23
@prckent
Copy link
Contributor Author

prckent commented Jun 26, 2023

Test this please

@ye-luo
Copy link
Contributor

ye-luo commented Jun 26, 2023

@prckent your fix brings me attention that we don't have the corresponding tests for batched drivers.

@prckent
Copy link
Contributor Author

prckent commented Jun 26, 2023

Agree we are missing some tests here. Which were you thinking of? batched dmc restart tests? or something else?

I don't think that we have any tests on checkpoints saved within a section (as opposed to at the end), whether legacy or batched. The restart tests use the "checkpoints" saved at the end of a section. It is not obvious how to reliably catch the checkpoints written within a section for testing.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 26, 2023

Agree we are missing some tests here. Which were you thinking of? batched dmc restart tests? or something else?

I don't think that we have any tests on checkpoints saved within a section (as opposed to at the end), whether legacy or batched. The restart tests use the "checkpoints" saved at the end of a section. It is not obvious how to reliably catch the checkpoints written within a section for testing.

That types of tests can only be made on the unit test side. We can simply run a driver with coordinates dumping in the middle, then load it back without using a driver and check against fixed numbers.

@prckent
Copy link
Contributor Author

prckent commented Jun 26, 2023

Test this please

@ye-luo ye-luo merged commit ee17a8e into QMCPACK:develop Jun 26, 2023
@prckent prckent deleted the removestoreconfigs branch June 26, 2023 23:55
@prckent prckent mentioned this pull request Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants