restarting with swa error #172

mlfffinder · 2023-09-25T02:09:12Z

Hi,

I'm currently using multi gpu branch.
After the training is done with the swa off, i'm trying to restart training with the swa on. But this makes error in log file below.
log.txt

My batch file for this run is here (I'm loading the checkpoint at epoch 388)

batch.txt

It seems calling swa.scheduler.step() in train.py has some problem but cannot figure it out.
Also, if I get into swa starting from epoch 1, it works well.

Thanks.

SanggyuChong · 2024-01-02T00:20:20Z

Hey guys! I know this is a bit old but I recently ran into the exact same issue and wanted to shed some light.

What was discovered from my end was that if people like myself and @mlfffinder want to re-run MACE model training later on with the SWA feature turned on, then in the initial run without SWA, one must still run it with SWA turned on (i.e. give start_swa a very large value beyond max_num_epochs) so that the swa_lr parameter can be read from the model saved in the checkpoint, in the second training with SWA turned on.

Of course, it would be nice if the initialization and checkpoint saving routines can be amended to allow for such sequential incorporation of the SWA routine, but IMO this seems to be the ad-hoc solution for now.

Best,
Sanggyu

ilyes319 added the multi-gpu label Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restarting with swa error #172

restarting with swa error #172

mlfffinder commented Sep 25, 2023

SanggyuChong commented Jan 2, 2024 •

edited

restarting with swa error #172

restarting with swa error #172

Comments

mlfffinder commented Sep 25, 2023

SanggyuChong commented Jan 2, 2024 • edited

SanggyuChong commented Jan 2, 2024 •

edited