Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to specify archive directory for checkpoint files #225

Merged
merged 2 commits into from
Aug 26, 2019

Conversation

jchiang87
Copy link
Collaborator

This will allow us to save the checkpoint files to an archive area for our test run without AGNs so that we can restart the runs later and add the AGNs once we have a reliable model. Having the code move the checkpoint files this way allows for the same workflow logic that relied on the deletion of the checkpoint files from the working directory.

@villarrealas Could you give this a try?

@jchiang87 jchiang87 requested a review from cwwalter August 5, 2019 22:25
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 73.29% when pulling 51fb642 on u/jchiang/add_option_to_move_checkpoint_files into c7654ac on master.

@villarrealas
Copy link

I'll set up a test case ASAP and do a small node run to test while we wait for Cori to stop having file system issues.

Copy link
Member

@cwwalter cwwalter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A few comments/suggestions included.

@@ -50,6 +50,10 @@
"This will be prepended to any existing IMSIM_IMAGE_PATH "
"environment variable, for which $CWD is included by "
"default.")
parser.add_argument('--ckpt_archive_dir', type=str, default=None,
help="Archive directory for checkpoint files. "
"If None, then delete them (if the checkpointing.cleanup "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some users might not be aware of the config file and think this refers to a non-existent option. Perhaps "if the checkpointing.cleanup variable in the config file being used is True"?

self.ckpt_archive_dir = ckpt_archive_dir
if (self.ckpt_archive_dir is not None and
not os.path.isdir(self.ckpt_archive_dir)):
os.makedirs(self.ckpt_archive_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should be in a try: block?

Im thinking that people might put down general shared directories here, and it might turn out they don't exist or aren't available in certain run conditions or systems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the action should be if it catches an error. Generally, unless there's a very specific action to take that resolves the error cleanly, I think it's better to let any exceptions go uncaught so that the failure mode is clear at runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... well I was thinking of catching it, making a very clear statement about went wrong and then stopping the program. I thought this might be the sort of thing that was confusing to non-experts if they just saw the stack-trace.

So, basically the same but trying to be a bit more user friendly. But, I suppose you are right that the standard error will have more information in it for those who can read it.

src_file = gs_interpreter.checkpoint_file
dest_file = os.path.join(IMAGE_SIMULATOR.ckpt_archive_dir,
os.path.basename(src_file))
shutil.move(src_file, dest_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, I think I recall we had problems with a small fraction of file operations sometimes not working as they weren't atomic. Do we need to pay the same attention to such issues here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the specific use case we added this for, it's not an issue: if the destination is on the same file system the shutil.move should itself be atomic. For moving to a different fs, the last thing the shutil.move would do would be to delete the original file after copying, which would be the last thing our code would do if we tried to make the operation atomic. In the event of a crash mid-copy to the other fs, there would be a partially written dest_file, so I suppose we could copy to a temp file, then rename it and then delete the original.

@jchiang87
Copy link
Collaborator Author

@villarrealas I made a change in the code to move the checkpoint files to address Chris' comment about the atomicity of that operation. Could please include that change in your tests? Thanks.

@jchiang87 jchiang87 merged commit a3e4b71 into master Aug 26, 2019
@jchiang87 jchiang87 deleted the u/jchiang/add_option_to_move_checkpoint_files branch August 26, 2019 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants