-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot resume training without quality loss #30
Comments
By default a checkpoint is saved every 500 iterations. |
Yes, but that's not the issue, imagine I trained for weeks up to 300.400 iterations and a black out happened, I'd lose only 400 iterations of progress but still have a "checkpoint_300000" file, is it possible to resume training from this checkpoint? Any attempt I made to resume from a checkpoint have generated a model that sounded much worse than its predecessor (checkpoint_300000" file), I know sometimes resuming a training requires some warming up before returning to its original state, but this isn't happening after a week, the results are not even close to its predecessor, if I had a time machine and could have prevented the blackout now the new checkpoint (ie. checkpoint_400000) would have sounded better not worse than before, do I have to start over again from scratch and lose weeks of training or I did something wrong? Thanks for your patience. |
@AndroYD84 try changing the hyperparameter before resuming from checkpoint: ignore_layers=[] and use_saved_learning_rate=True |
Note that when using --warm_start does not include the optimizer. |
Thank you for your help, I resumed training without the "warm_start" option and I confirm that so far I haven't noticed any quality loss, I haven't tried texpomru13 solution as the results were already improving without the need to change anything. |
Due to unfortunate circumstances my training process has terminated abruptly, any attempt to resume it has resulted in a model that was much worse than before the interruption happened, even after days of training it doesn't seem to get back to the quality it used to be, as if it was unrecoverable, I attempted 2 different methods:
python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start
just like when I began training and looks like it resumes from that latest iteration I pointed it to, but results are way worse than it used to be even after 3 days of training.
python train.py --output_directory=outdir --log_directory=logdir -c models/mylatestmodel.pt --warm_start
but the generated results sound even worse than (1) after 2 days of training.
Is it actually possible to resume an interrupted training back to it's tracks? If so, what is the correct method? It can be quite frustating losing days/hours of training because of an incident beyond our control.
I think a console logger could be a useful addition too, if the terminal window gets closed unexpectedly you'd still have a record of it, in my case, it would have been useful to check how many epochs it reached before the training was suddenly interrupted, even if it was purely cosmetical.
The text was updated successfully, but these errors were encountered: