-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to specify archive directory for checkpoint files #225
add option to specify archive directory for checkpoint files #225
Conversation
I'll set up a test case ASAP and do a small node run to test while we wait for Cori to stop having file system issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. A few comments/suggestions included.
@@ -50,6 +50,10 @@ | |||
"This will be prepended to any existing IMSIM_IMAGE_PATH " | |||
"environment variable, for which $CWD is included by " | |||
"default.") | |||
parser.add_argument('--ckpt_archive_dir', type=str, default=None, | |||
help="Archive directory for checkpoint files. " | |||
"If None, then delete them (if the checkpointing.cleanup " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some users might not be aware of the config file and think this refers to a non-existent option. Perhaps "if the checkpointing.cleanup variable in the config file being used is True"?
self.ckpt_archive_dir = ckpt_archive_dir | ||
if (self.ckpt_archive_dir is not None and | ||
not os.path.isdir(self.ckpt_archive_dir)): | ||
os.makedirs(self.ckpt_archive_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this should be in a try: block?
Im thinking that people might put down general shared directories here, and it might turn out they don't exist or aren't available in certain run conditions or systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the action should be if it catches an error. Generally, unless there's a very specific action to take that resolves the error cleanly, I think it's better to let any exceptions go uncaught so that the failure mode is clear at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... well I was thinking of catching it, making a very clear statement about went wrong and then stopping the program. I thought this might be the sort of thing that was confusing to non-experts if they just saw the stack-trace.
So, basically the same but trying to be a bit more user friendly. But, I suppose you are right that the standard error will have more information in it for those who can read it.
python/desc/imsim/ImageSimulator.py
Outdated
src_file = gs_interpreter.checkpoint_file | ||
dest_file = os.path.join(IMAGE_SIMULATOR.ckpt_archive_dir, | ||
os.path.basename(src_file)) | ||
shutil.move(src_file, dest_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past, I think I recall we had problems with a small fraction of file operations sometimes not working as they weren't atomic. Do we need to pay the same attention to such issues here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the specific use case we added this for, it's not an issue: if the destination is on the same file system the shutil.move
should itself be atomic. For moving to a different fs, the last thing the shutil.move
would do would be to delete the original file after copying, which would be the last thing our code would do if we tried to make the operation atomic. In the event of a crash mid-copy to the other fs, there would be a partially written dest_file
, so I suppose we could copy to a temp file, then rename it and then delete the original.
@villarrealas I made a change in the code to move the checkpoint files to address Chris' comment about the atomicity of that operation. Could please include that change in your tests? Thanks. |
This will allow us to save the checkpoint files to an archive area for our test run without AGNs so that we can restart the runs later and add the AGNs once we have a reliable model. Having the code move the checkpoint files this way allows for the same workflow logic that relied on the deletion of the checkpoint files from the working directory.
@villarrealas Could you give this a try?