Add Checkpointing #74

elbeejay · 2020-07-14T13:28:20Z

I think this covers the gist of #57, if a bit crudely. Am happy to move things "up" to the preprocessor if that seems to make more sense, I just honestly am not as comfortable/familiar with the CLI wrappers and the stuff going on there. So instead, I stuck the bits to save and load the checkpoint files a bit deeper into the model.

new YAML flags

save_checkpoint : this is a boolean controlling whether or not checkpoint files are saved, default is False
resume_checkpoint : this is a boolean controlling whether or not checkpoint files should be loaded from the out_dir defined in the YAML
checkpoint_dt : the save interval at which to record checkpoint information to the disk. If undefined, the checkpoint information will be saved with the frequency save_dt

checkpoint files

All ~~The bulk~~ of the grids and the random number state (per suggestion in #61) are saved to a single .npz file called checkpoint.npz.

The arrays holding information about the stratigraphy are sparse arrays, so they are saved in their own .npz files (strata_eta.npz and strata_sand_frac.npz). Alternatively we could convert them to dense arrays and have all of our checkpoint data in a single file - I don't feel strongly about this either way.

codecov-commenter · 2020-07-14T13:30:24Z

Codecov Report

Merging #74 into develop will increase coverage by 0.44%.
The diff coverage is 97.29%.

@@             Coverage Diff             @@
##           develop      #74      +/-   ##
===========================================
+ Coverage    88.87%   89.32%   +0.44%     
===========================================
  Files           11       11              
  Lines         1241     1311      +70     
===========================================
+ Hits          1103     1171      +68     
- Misses         138      140       +2

Impacted Files	Coverage Δ
pyDeltaRCM/model.py	`91.72% <87.50%> (-0.65%)`	⬇️
pyDeltaRCM/deltaRCM_tools.py	`94.56% <100.00%> (+0.55%)`	⬆️
pyDeltaRCM/init_tools.py	`98.90% <100.00%> (+0.15%)`	⬆️
pyDeltaRCM/shared_tools.py	`30.46% <100.00%> (+3.41%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efa579b...f85a4a0. Read the comment docs.

amoodie

Looks great @elbeejay

I'm not sure about moving things to the preprocessor either. I think your implementation is probably the correct way to do it, because the preprocessor would want to call the methods of individual model instances anyway (since each knows where its own files are located). In the preprocessor, we will want to implement some wrapper to have the relevant --resume_checkpoint option.

Overall, looks like a spot on implementation, just made some comments/suggestions below 👍

tests/test_model.py

pyDeltaRCM/default.yml

pyDeltaRCM/deltaRCM_tools.py

amoodie · 2020-07-14T14:38:53Z

pyDeltaRCM/init_tools.py

+    def load_checkpoint(self):
+        """Load the checkpoint from the .npz file."""
+        ckp_file = os.path.join(self.prefix, 'checkpoint.npz')
+        checkpoint = np.load(ckp_file, allow_pickle=True)


I guess this line will throw a FileNotFound error if there is no checkpoint created? I think that's probably fine, but is there any reason we would want to check for ourselves whether the file exists first and throw our own exception? Not sure, just thinking out loud.

pyDeltaRCM/model.py

elbeejay · 2020-07-14T18:28:59Z

Thanks for the feedback @amoodie. I added the logger messages but have yet to implement tests to check the log for them. Also need to write the true consistency check to make sure the checkpointing is working as designed.

I did add a separate checkpoint_dt optional flag in the .yml to give users the option to save the checkpoint at whatever frequency they want. When it is not specified the save_dt value is used (couldn't find a way to specify this default in the default.yml itself unfortunately). Since we went to the effort of keeping a NetCDF file open so we could rapidly write to it, I felt like it was a shame to automatically slow things down by writing the .npz checkpoint to file every time you saved data (if you wanted to use the checkpointing feature). So hopefully this allows for more of a balance between saving checkpoint data in the event a run must be resumed, and being able to save the model grids with reasonable regularity to the NetCDF.

elbeejay · 2020-07-25T00:06:22Z

Flipping this to a draft as there seem to be some things I'm missing. Forgot about needing to re-open the netCDF file when the checkpoint is re-loaded. Also need to resolve inconsistencies between runs for the duration and runs that have been 'resumed' from a checkpoint.

elbeejay · 2020-08-28T23:52:43Z

Just realized that this 'simple' checkpointing implementation neglects subsidence right now...

amoodie

@elbeejay This looks great. I know it was a lot of work, thanks for tackling it.

My only thought is with regard to your point that subsidence is not handled. Perhaps we could add a NotImplementedError during initialization, if both save_checkpoint and toggle_subsidence are true. I sent a PR to your branch with this added, and a test.

if self.save_checkpoint and self.toggle_subsidence:
    raise NotImplementedError('Cannot handle checkpointing with subsidence.')

Would be good to open a reminder issue to handle the subsidence implementation.

add error if both checkpointing and subsidence. Simple xFail test.

elbeejay · 2020-08-30T22:26:06Z

Thanks for the review, opened an issue so we will be sure to address that in the future.

Add Checkpointing

elbeejay requested a review from amoodie July 14, 2020 13:28

amoodie reviewed Jul 14, 2020

View reviewed changes

elbeejay force-pushed the checkpointing branch from d105a5d to c3aa3f3 Compare July 22, 2020 17:39

elbeejay marked this pull request as draft July 25, 2020 00:05

elbeejay force-pushed the checkpointing branch from bb5cd88 to 8c5c3d7 Compare July 28, 2020 15:27

amoodie mentioned this pull request Jul 29, 2020

refactor model progression api in model and preprocessor #84

Closed

elbeejay added 2 commits August 25, 2020 22:50

checkpointing functionality

9a656b5

checkpointing tests

27a2a67

elbeejay force-pushed the checkpointing branch from c976c15 to 27a2a67 Compare August 26, 2020 02:53

elbeejay added 2 commits August 28, 2020 19:07

minor fixes and changing tests to account for saving of t==0 now

bccae56

close netcdfs for failing windows tests

c2d9267

elbeejay marked this pull request as ready for review August 28, 2020 23:43

elbeejay requested a review from amoodie August 28, 2020 23:44

relocate strata init based on checkpointing

306c20d

elbeejay mentioned this pull request Aug 29, 2020

Additional / Longer Tests of "Checkpointing" #93

Closed

add error if both checkpointing and subsidence. Simple xFail test.

8cb0423

amoodie approved these changes Aug 30, 2020

View reviewed changes

Merge pull request #2 from amoodie/checkpointing

f85a4a0

add error if both checkpointing and subsidence. Simple xFail test.

elbeejay mentioned this pull request Aug 30, 2020

Add subsidence support to checkpointing #95

Closed

elbeejay merged commit 2d5fd66 into DeltaRCM:develop Aug 30, 2020

elbeejay deleted the checkpointing branch October 30, 2020 15:29

amoodie mentioned this pull request Dec 9, 2020

Implement checkpointing #57

Closed

amoodie pushed a commit to amoodie/pyDeltaRCM that referenced this pull request Feb 25, 2021

Merge pull request DeltaRCM#74 from elbeejay/checkpointing

71bae06

Add Checkpointing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Checkpointing #74

Add Checkpointing #74

elbeejay commented Jul 14, 2020 •

edited

Loading

codecov-commenter commented Jul 14, 2020 •

edited

Loading

amoodie left a comment

amoodie Jul 14, 2020

elbeejay commented Jul 14, 2020

elbeejay commented Jul 25, 2020

elbeejay commented Aug 28, 2020

amoodie left a comment

elbeejay commented Aug 30, 2020

Add Checkpointing #74

Add Checkpointing #74

Conversation

elbeejay commented Jul 14, 2020 • edited Loading

new YAML flags

checkpoint files

codecov-commenter commented Jul 14, 2020 • edited Loading

Codecov Report

amoodie left a comment

Choose a reason for hiding this comment

amoodie Jul 14, 2020

Choose a reason for hiding this comment

elbeejay commented Jul 14, 2020

elbeejay commented Jul 25, 2020

elbeejay commented Aug 28, 2020

amoodie left a comment

Choose a reason for hiding this comment

elbeejay commented Aug 30, 2020

elbeejay commented Jul 14, 2020 •

edited

Loading

codecov-commenter commented Jul 14, 2020 •

edited

Loading