-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Checkpointing #74
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #74 +/- ##
===========================================
+ Coverage 88.87% 89.32% +0.44%
===========================================
Files 11 11
Lines 1241 1311 +70
===========================================
+ Hits 1103 1171 +68
- Misses 138 140 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @elbeejay
I'm not sure about moving things to the preprocessor either. I think your implementation is probably the correct way to do it, because the preprocessor would want to call the methods of individual model instances anyway (since each knows where its own files are located). In the preprocessor, we will want to implement some wrapper to have the relevant --resume_checkpoint
option.
Overall, looks like a spot on implementation, just made some comments/suggestions below 👍
def load_checkpoint(self): | ||
"""Load the checkpoint from the .npz file.""" | ||
ckp_file = os.path.join(self.prefix, 'checkpoint.npz') | ||
checkpoint = np.load(ckp_file, allow_pickle=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this line will throw a FileNotFound
error if there is no checkpoint created? I think that's probably fine, but is there any reason we would want to check for ourselves whether the file exists first and throw our own exception? Not sure, just thinking out loud.
Thanks for the feedback @amoodie. I added the logger messages but have yet to implement tests to check the log for them. Also need to write the true consistency check to make sure the checkpointing is working as designed. I did add a separate |
Flipping this to a draft as there seem to be some things I'm missing. Forgot about needing to re-open the netCDF file when the checkpoint is re-loaded. Also need to resolve inconsistencies between runs for the duration and runs that have been 'resumed' from a checkpoint. |
c976c15
to
27a2a67
Compare
Just realized that this 'simple' checkpointing implementation neglects subsidence right now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elbeejay This looks great. I know it was a lot of work, thanks for tackling it.
My only thought is with regard to your point that subsidence is not handled. Perhaps we could add a NotImplementedError
during initialization, if both save_checkpoint
and toggle_subsidence
are true. I sent a PR to your branch with this added, and a test.
if self.save_checkpoint and self.toggle_subsidence:
raise NotImplementedError('Cannot handle checkpointing with subsidence.')
Would be good to open a reminder issue to handle the subsidence implementation.
add error if both checkpointing and subsidence. Simple xFail test.
Thanks for the review, opened an issue so we will be sure to address that in the future. |
Add Checkpointing
I think this covers the gist of #57, if a bit crudely.
Am happy to move things "up" to thepreprocessor
if that seems to make more sense, I just honestly am not as comfortable/familiar with the CLI wrappers and the stuff going on there. So instead, I stuck the bits to save and load the checkpoint files a bit deeper into the model.new YAML flags
save_checkpoint
: this is a boolean controlling whether or not checkpoint files are saved, default is Falseresume_checkpoint
: this is a boolean controlling whether or not checkpoint files should be loaded from theout_dir
defined in the YAMLcheckpoint_dt
: the save interval at which to record checkpoint information to the disk. If undefined, the checkpoint information will be saved with the frequencysave_dt
checkpoint files
All
The bulkof the grids and the random number state (per suggestion in #61) are saved to a single .npz file calledcheckpoint.npz
.The arrays holding information about the stratigraphy are sparse arrays, so they are saved in their own .npz files (strata_eta.npz
andstrata_sand_frac.npz
). Alternatively we could convert them to dense arrays and have all of our checkpoint data in a single file - I don't feel strongly about this either way.