Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Checkpointing #74

Merged
merged 7 commits into from
Aug 30, 2020
Merged

Add Checkpointing #74

merged 7 commits into from
Aug 30, 2020

Conversation

elbeejay
Copy link
Member

@elbeejay elbeejay commented Jul 14, 2020

I think this covers the gist of #57, if a bit crudely. Am happy to move things "up" to the preprocessor if that seems to make more sense, I just honestly am not as comfortable/familiar with the CLI wrappers and the stuff going on there. So instead, I stuck the bits to save and load the checkpoint files a bit deeper into the model.

new YAML flags

  • save_checkpoint : this is a boolean controlling whether or not checkpoint files are saved, default is False

  • resume_checkpoint : this is a boolean controlling whether or not checkpoint files should be loaded from the out_dir defined in the YAML

  • checkpoint_dt : the save interval at which to record checkpoint information to the disk. If undefined, the checkpoint information will be saved with the frequency save_dt

checkpoint files

All The bulk of the grids and the random number state (per suggestion in #61) are saved to a single .npz file called checkpoint.npz.

The arrays holding information about the stratigraphy are sparse arrays, so they are saved in their own .npz files (strata_eta.npz and strata_sand_frac.npz). Alternatively we could convert them to dense arrays and have all of our checkpoint data in a single file - I don't feel strongly about this either way.

@elbeejay elbeejay requested a review from amoodie July 14, 2020 13:28
@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2020

Codecov Report

Merging #74 into develop will increase coverage by 0.44%.
The diff coverage is 97.29%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #74      +/-   ##
===========================================
+ Coverage    88.87%   89.32%   +0.44%     
===========================================
  Files           11       11              
  Lines         1241     1311      +70     
===========================================
+ Hits          1103     1171      +68     
- Misses         138      140       +2     
Impacted Files Coverage Δ
pyDeltaRCM/model.py 91.72% <87.50%> (-0.65%) ⬇️
pyDeltaRCM/deltaRCM_tools.py 94.56% <100.00%> (+0.55%) ⬆️
pyDeltaRCM/init_tools.py 98.90% <100.00%> (+0.15%) ⬆️
pyDeltaRCM/shared_tools.py 30.46% <100.00%> (+3.41%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efa579b...f85a4a0. Read the comment docs.

Copy link
Member

@amoodie amoodie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @elbeejay

I'm not sure about moving things to the preprocessor either. I think your implementation is probably the correct way to do it, because the preprocessor would want to call the methods of individual model instances anyway (since each knows where its own files are located). In the preprocessor, we will want to implement some wrapper to have the relevant --resume_checkpoint option.

Overall, looks like a spot on implementation, just made some comments/suggestions below 👍

tests/test_model.py Show resolved Hide resolved
pyDeltaRCM/default.yml Outdated Show resolved Hide resolved
pyDeltaRCM/default.yml Outdated Show resolved Hide resolved
pyDeltaRCM/deltaRCM_tools.py Outdated Show resolved Hide resolved
pyDeltaRCM/deltaRCM_tools.py Outdated Show resolved Hide resolved
pyDeltaRCM/deltaRCM_tools.py Outdated Show resolved Hide resolved
def load_checkpoint(self):
"""Load the checkpoint from the .npz file."""
ckp_file = os.path.join(self.prefix, 'checkpoint.npz')
checkpoint = np.load(ckp_file, allow_pickle=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this line will throw a FileNotFound error if there is no checkpoint created? I think that's probably fine, but is there any reason we would want to check for ourselves whether the file exists first and throw our own exception? Not sure, just thinking out loud.

pyDeltaRCM/model.py Show resolved Hide resolved
@elbeejay
Copy link
Member Author

Thanks for the feedback @amoodie. I added the logger messages but have yet to implement tests to check the log for them. Also need to write the true consistency check to make sure the checkpointing is working as designed.

I did add a separate checkpoint_dt optional flag in the .yml to give users the option to save the checkpoint at whatever frequency they want. When it is not specified the save_dt value is used (couldn't find a way to specify this default in the default.yml itself unfortunately). Since we went to the effort of keeping a NetCDF file open so we could rapidly write to it, I felt like it was a shame to automatically slow things down by writing the .npz checkpoint to file every time you saved data (if you wanted to use the checkpointing feature). So hopefully this allows for more of a balance between saving checkpoint data in the event a run must be resumed, and being able to save the model grids with reasonable regularity to the NetCDF.

@elbeejay elbeejay marked this pull request as draft July 25, 2020 00:05
@elbeejay
Copy link
Member Author

Flipping this to a draft as there seem to be some things I'm missing. Forgot about needing to re-open the netCDF file when the checkpoint is re-loaded. Also need to resolve inconsistencies between runs for the duration and runs that have been 'resumed' from a checkpoint.

@elbeejay elbeejay marked this pull request as ready for review August 28, 2020 23:43
@elbeejay elbeejay requested a review from amoodie August 28, 2020 23:44
@elbeejay
Copy link
Member Author

Just realized that this 'simple' checkpointing implementation neglects subsidence right now...

Copy link
Member

@amoodie amoodie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elbeejay This looks great. I know it was a lot of work, thanks for tackling it.

My only thought is with regard to your point that subsidence is not handled. Perhaps we could add a NotImplementedError during initialization, if both save_checkpoint and toggle_subsidence are true. I sent a PR to your branch with this added, and a test.

if self.save_checkpoint and self.toggle_subsidence:
    raise NotImplementedError('Cannot handle checkpointing with subsidence.')

Would be good to open a reminder issue to handle the subsidence implementation.

add error if both checkpointing and subsidence. Simple xFail test.
@elbeejay
Copy link
Member Author

Thanks for the review, opened an issue so we will be sure to address that in the future.

@elbeejay elbeejay merged commit 2d5fd66 into DeltaRCM:develop Aug 30, 2020
@elbeejay elbeejay deleted the checkpointing branch October 30, 2020 15:29
@amoodie amoodie mentioned this pull request Dec 9, 2020
amoodie pushed a commit to amoodie/pyDeltaRCM that referenced this pull request Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants