You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently using multi gpu branch.
After the training is done with the swa off, i'm trying to restart training with the swa on. But this makes error in log file below. log.txt
My batch file for this run is here (I'm loading the checkpoint at epoch 388)
It seems calling swa.scheduler.step() in train.py has some problem but cannot figure it out.
Also, if I get into swa starting from epoch 1, it works well.
Thanks.
The text was updated successfully, but these errors were encountered:
Hey guys! I know this is a bit old but I recently ran into the exact same issue and wanted to shed some light.
What was discovered from my end was that if people like myself and @mlfffinder want to re-run MACE model training later on with the SWA feature turned on, then in the initial run without SWA, one must still run it with SWA turned on (i.e. give start_swa a very large value beyond max_num_epochs) so that the swa_lr parameter can be read from the model saved in the checkpoint, in the second training with SWA turned on.
Of course, it would be nice if the initialization and checkpoint saving routines can be amended to allow for such sequential incorporation of the SWA routine, but IMO this seems to be the ad-hoc solution for now.
Hi,
I'm currently using multi gpu branch.
After the training is done with the swa off, i'm trying to restart training with the swa on. But this makes error in log file below.
log.txt
My batch file for this run is here (I'm loading the checkpoint at epoch 388)
batch.txt
It seems calling swa.scheduler.step() in train.py has some problem but cannot figure it out.
Also, if I get into swa starting from epoch 1, it works well.
Thanks.
The text was updated successfully, but these errors were encountered: