Add functionality to restart training from a checkpoint file #53

knc6 · 2022-01-04T16:59:15Z

Due to GPU walltime limitation or other reasons, a job may die before completing all the requested epochs. We should have a function for restarting a job from previous checkpoint file. We need a config such as restart_mode=True/False, where it will search for checkpoint*.pt and load in https://github.com/usnistgov/alignn/blob/main/alignn/train.py#L140

bdecost · 2022-01-04T17:38:51Z

since we are using ignite checkpointing here, resuming should be straightforward enough.

I think train_dgl should take an optional checkpoint to resume from, and then it can load model and optimizer state with ignite's Checkpoint.load_objects

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to restart training from a checkpoint file #53

Add functionality to restart training from a checkpoint file #53

knc6 commented Jan 4, 2022 •

edited

bdecost commented Jan 4, 2022

Add functionality to restart training from a checkpoint file #53

Add functionality to restart training from a checkpoint file #53

Comments

knc6 commented Jan 4, 2022 • edited

bdecost commented Jan 4, 2022

knc6 commented Jan 4, 2022 •

edited