Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restarting with swa error #172

Open
mlfffinder opened this issue Sep 25, 2023 · 1 comment
Open

restarting with swa error #172

mlfffinder opened this issue Sep 25, 2023 · 1 comment

Comments

@mlfffinder
Copy link

Hi,

I'm currently using multi gpu branch.
After the training is done with the swa off, i'm trying to restart training with the swa on. But this makes error in log file below.
log.txt

My batch file for this run is here (I'm loading the checkpoint at epoch 388)

batch.txt

It seems calling swa.scheduler.step() in train.py has some problem but cannot figure it out.
Also, if I get into swa starting from epoch 1, it works well.

Thanks.

@SanggyuChong
Copy link

SanggyuChong commented Jan 2, 2024

Hey guys! I know this is a bit old but I recently ran into the exact same issue and wanted to shed some light.

What was discovered from my end was that if people like myself and @mlfffinder want to re-run MACE model training later on with the SWA feature turned on, then in the initial run without SWA, one must still run it with SWA turned on (i.e. give start_swa a very large value beyond max_num_epochs) so that the swa_lr parameter can be read from the model saved in the checkpoint, in the second training with SWA turned on.

Of course, it would be nice if the initialization and checkpoint saving routines can be amended to allow for such sequential incorporation of the SWA routine, but IMO this seems to be the ad-hoc solution for now.

Best,
Sanggyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants