Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to restart training from a checkpoint file #53

Open
knc6 opened this issue Jan 4, 2022 · 1 comment
Open

Add functionality to restart training from a checkpoint file #53

knc6 opened this issue Jan 4, 2022 · 1 comment

Comments

@knc6
Copy link
Collaborator

knc6 commented Jan 4, 2022

Due to GPU walltime limitation or other reasons, a job may die before completing all the requested epochs. We should have a function for restarting a job from previous checkpoint file. We need a config such as restart_mode=True/False, where it will search for checkpoint*.pt and load in https://github.com/usnistgov/alignn/blob/main/alignn/train.py#L140

@bdecost
Copy link
Collaborator

bdecost commented Jan 4, 2022

since we are using ignite checkpointing here, resuming should be straightforward enough.

I think train_dgl should take an optional checkpoint to resume from, and then it can load model and optimizer state with ignite's Checkpoint.load_objects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants